I am trying to parse a nested JSON object whose levels can contain bags and/or tuples using Elephant Bird in Pig. Referencing columns at the fourth level results in some odd behavior.
Pig has issues when referencing columns at and below the fourth. It seems because there the data does some alternating between bag, tuple, and map. To be clear, it looks like the JsonLoader converts some to maps, but others not. For example, see the referencing of the "five" below.
HDP version: 2.1.2, Pig version: 0.12.1, Elephant bird version: 4.13
Here is sample data of the structure with keys and values replaced with placeholders.
{
"one" : {
"output_info" : {
"sample_key": "sample value"
},
"two" : {
"three" : [{
"three_id" : "three_id_value",
"four" : {
"five" : [{
"level_five_info" : {
"five_info_key" : "five_info_value"
},
"six" : {
"seven" : [{
"eight_id" : "123545",
"eight_score" : "77"
}, {
"eight_id" : "98765",
"eight_score" : "88"
}
]
}
}
]
}
}]
}
}
}
pig statements:
a = LOAD 'nest_test.dat' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
b = foreach a generate json#'one'#'two'#'three' as (three: {(three_id: chararray,four: map[])});
running dump b; results in:
({([three_id#three_id_value,four#{five={([six#{seven={([eight_score#77,eight_id#123545]),([eight_score#88,eight_id#98765])}},level_five_info#{five_info_key=five_info_value}])}}])})
this all looks as expected, however:
c = foreach b generate three.four as ({(four:map[])});
but now, running dump c; results in none of the data above being returned.
({()})
same goes for leaving off the schema description
c = foreach b generate three.four
referencing a level deeper gives an error:
d = foreach b generate three.four#'five';
2016-03-15 11:56:01,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1052: Cannot cast bag with schema :bag{:tuple(four:map)} to map with schema :map
How should I be referencing level five and six? My ultimate goal is to be able to reference eight_id and eight_score and flatten the elements of the seven array/bag
try using