Reading json files in pig - apache-pig

I have three data types...
1) Base data
2) data_dict_1
3) data_dict_2
Base data is very well formatted json..
For example:
{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}
data_dict_1
1 foo
2 bar
3 foobar
....
data_dict_2
-1 foo
-2 bar
-3 foobar
... and so on
Now, what I want is.. if the data is of type1
Then read id1 from data_dict_1, id2 from data_dict2 and assign that integer id..
If data is of type2.. then read id1 from data_dict_2.. id2 from data_dict1.. and assign corresponding ids..
For example:
{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}
And so on..
How do i do this in pig?

Note: what you have in the upper example is not valid json, the type key is not quoted.
Assuming Pig 0.10 and up, there's the JsonLoader built-in, which you can pass a schema to and load it with
data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');
and load the dicts
dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);
Then split that based on the type value
SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';
JOIN them appropriately
type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;
type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;
and since the schemas are equal, UNION them together
final_data = UNION type1_joined, type2_joined;
this produces
DUMP final_data;
(foo,bar,type2,-2)
(foo,bar,type1,1)

Related

How to read a list of values in Pig as a bag and compare it to a specific value?

Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)

using filter and group by in pig

I am new to pig syntax and was wondering if someone can provide a hint for translating this SQL code into pig.
SELECT column1, column2, SUM(column3)
FROM table
WHERE column5 = 100
GROUP BY column2;
So far I have:
data = LOAD....etc.
filterColumn = FILTER data BY column5 = 100;
groupColumn = Group filterColumn By column2;
result = foreach groupColumn Generate group, column1, SUM(column3) as sumCol3;
DUMP result;
This does not work. The error message is "Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast."
SUM(): Computes the sum of the numeric values in a single-column bag. It expects bag as its input. So, the FOREACH ... GENERATE would be,
result = foreach groupColumn Generate group, filterColumn.column1, SUM(filterColumn.column3) as sumCol3;
Also in the FILTER statement, to check for equality use ==
filterColumn = FILTER data BY column5 == 100;
The below pig commands can be used :
test=LOAD '<testdata>' USING PigStorage(',') AS (column1:<datatype>, column2:<datatype>,column3:<datatype>, column5:<datatype>);
A =FILTER test BY column5==100;
B = GROUP A BY column2;
C = FOREACH B GENERATE group, test.column1,test.column2,SUM(test.column3);
dump C;
Note that the usage of'PigStorage' and 'AS' are optional.
Hope this helps.

READ TABLE with dynamic key fields?

I have the name of a table DATA lv_tablename TYPE tabname VALUE 'xxxxx', and a generic FIELD-SYMBOLS: <lt_table> TYPE ANY TABLE. which contains entries selected from that corresponding table.
I've defined my line structure FIELD-SYMBOLS: <ls_line> TYPE ANY. which i'd use for reading from the table.
Is there a way to create a READ statement on <lt_table> fully specifying the key fields?
I am aware of the statement / addition READ TABLE xxxx WITH KEY (lv_field_name) = 'asdf'., but this however wouldn't work (afaik) for a dynamic number of key fields, and I wouldn't like to create a large number of READ TABLE statements with an increasing number of key field specifications.
Can this be done?
Actually i found this to work
DATA lt_bseg TYPE TABLE OF bseg.
DATA ls_bseg TYPE bseg.
DATA lv_string1 TYPE string.
DATA lv_string2 TYPE string.
lv_string1 = ` `.
lv_string2 = lv_string1.
SELECT whatever FROM wherever INTO TABLE lt_bseg.
READ TABLE lt_bseg INTO ls_bseg
WITH KEY ('MANDT') = 800
(' ') = ''
('BUKRS') = '0005'
('BELNR') = '0100000000'
('GJAHR') = 2005
('BUZEI') = '002'
('') = ''
(' ') = ''
(' ') = ' '
(lv_string1) = '1'
(lv_string2) = ''.
By using this syntax one can just specify as many key fields as required. If some fields will be empty, then these will just get ignored, even if values are specified for these empty fields.
One must pay attention that using this exact syntax (static definitions), 2 fields with the exact same name (even blank names) will not be allowed.
As shown with the variables lv_string1 and lv_string2, at run-time this is no problem.
And lastly, one can specify the fields in any order (i don't know what performance benefits or penalties one might get while using this syntax)
There seems to be the possibility ( like a dynamic select statement whith binding and lt_dynwhere ).
Please refer to this post, there was someone, who also asked for the requirement:
http://scn.sap.com/thread/1789520
3 ways:
READ TABLE itab WITH [TABLE] KEY (comp1) = value1 (comp2) = value2 ...
You can define a dynamic number of key fields by indicating statically the maximum number of key fields in the code, and indicate at runtime some empty key field names if there are less key fields to be used.
LOOP AT itab WHERE (where) (see Addition 4 "WHERE (cond_syntax)")
Available since ABAP 7.02.
SELECT ... FROM #itab WHERE (where) ...
Available since ABAP 7.52. It may be slow if the condition is complex and cannot be handled by the ABAP kernel, i.e. it needs to be executed by the database. In that case, only few databases are supported (I think only HANA is supported currently).
Examples (ASSERT statements are used here to prove that the conditions are true, otherwise the program would fail):
TYPES: BEGIN OF ty_table_line,
key_name_1 TYPE i,
key_name_2 TYPE i,
attr TYPE c LENGTH 1,
END OF ty_table_line,
ty_internal_table TYPE SORTED TABLE OF ty_table_line WITH UNIQUE KEY key_name_1 key_name_2.
DATA(itab) = VALUE ty_internal_table( ( key_name_1 = 1 key_name_2 = 1 attr = 'A' )
( key_name_1 = 1 key_name_2 = 2 attr = 'B' ) ).
"------------------ READ TABLE
DATA(key_name_1) = 'KEY_NAME_1'.
DATA(key_name_2) = 'KEY_NAME_2'.
READ TABLE itab WITH TABLE KEY
(key_name_1) = 1
(key_name_2) = 2
ASSIGNING FIELD-SYMBOL(<line>).
ASSERT <line> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 2 attr = 'B' ).
key_name_2 = ''. " ignore this key field
READ TABLE itab WITH TABLE KEY
(key_name_1) = 1
(key_name_2) = 2 "<=== will be ignored
ASSIGNING FIELD-SYMBOL(<line_2>).
ASSERT <line_2> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).
"------------------ LOOP AT
DATA(where) = 'key_name_1 = 1 and key_name_2 = 1'.
LOOP AT itab ASSIGNING FIELD-SYMBOL(<line_3>)
WHERE (where).
EXIT.
ENDLOOP.
ASSERT <line_3> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).
"---------------- SELECT ... FROM #itab
SELECT SINGLE * FROM #itab WHERE (where) INTO #DATA(line_3).
ASSERT line_3 = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).

Find similar entries using pig script

I have data as below
1,ref1,1200,USD,CR
2,ref1,1200,USD,DR
3,ref2,2100,USD,DR
4,ref2,800,USD,CR
5,ref2,700,USD,CR
6,ref2,600,USD,CR
I want to group these records where field2 matches, sum(field3) matches and field5 is opposite (means if lhs is "CR" then rhs should be "DR" and viceversa)
How can I achieve this using pig script?
You can also do it this way:
data = LOAD 'myData' USING PigStorage(',') AS
(field1: int, field2: chararray,
field3: int, field4: chararray,
field5: chararray) ;
B = FOREACH (GROUP data BY (field2, field5)) GENERATE group.field2, data ;
-- Since in B there will always be two sets of field2 (one for CR and one for DR)
-- grouping by field2 again will get all pairs of CR and DR
-- (where the sums are equal of course)
C = GROUP B BY (field2, SUM(field3)) ;
The schema and output at the last step:
C: {group: (field2: chararray,long),B: {(field2: chararray,data: {(field1: int,field2: chararray,field3: int,field4: chararray,field5: chararray)})}}
((ref1,1200),{(ref1,{(1,ref1,1200,USD,CR)}),(ref1,{(2,ref1,1200,USD,DR)})})
((ref2,2100),{(ref2,{(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR)}),(ref2,{(3,ref2,2100,USD,DR)})})
The output put is a little unwieldy right now, but this will clear it up:
-- Make sure to look at the schema for C above
D = FOREACH C {
-- B is a bag containing tuples in the form: B: {(field2, data)}
-- What we want is to just extract out the data field and nothing else
-- so we can go over each tuple in the bag and pull out
-- the second element (the data we want).
justbag = FOREACH B GENERATE FLATTEN($1) ;
-- Without FLATTEN the schema for justbag would be:
-- justbag: {(data: (field1, ...))}
-- FLATTEN makes it easier to access the fields by removing data:
-- justbag: {(field1, ...)}
GENERATE justbag ;
}
Into this:
D: {justbag: {(data::field1: int,data::field2: chararray,data::field3: int,data::field4: chararray,data::field5: chararray)}}
({(1,ref1,1200,USD,CR),(2,ref1,1200,USD,DR)})
({(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR),(3,ref2,2100,USD,DR)})
I'm not sure that I understand your requirements, but you could load the data, split into two sets (filter/split) and the cogroup such as:
data = load ... as (field1: int, field2: chararray, field3: int, field4: chararray, field5: chararray);
crs= filter data by field5='CR';
crs_grp = group crs by field1;
crs_agg = foreach crs_grp generate group.field1 as field1, sum(crs.field3);
drs = filter data by field5='DR';
drs_grp = group drs by field1;
drs_agg = foreach drs_grp generate group.field1 as field1, sum(drs.field3);
g = COGROUP crs_agg BY (field1, field3), drs_agg BY (field1, field3);

bincod evaluation in pig

I am trying to replace the missing values with some precomputed value.
So i posted the question here and followed the advice and here is the code snippet
input = LOAD 'data.txt' USING PigStorage(',') AS
(
id1:double , id21:double );
gin = foreach input generate
id1 IS NULL ? 2 : id1,
id2 IS NULL ? 4 : id2;
But I am getting an error mismatched input 'IS' expecting SEMI_COLON?
Try adding parentheses in the bincond. The following works properly for me:
Contents of input:
0.9,1.11
,0.3
10.3,
Script:
inp = LOAD 'input' USING PigStorage(',') AS (id1:double, id2:double );
gin = foreach inp generate
((id1 IS NULL) ? 2 : id1),
((id2 IS NULL) ? 4 : id2);
DUMP gin;
Output:
(0.9,1.11)
(2.0,0.3)
(10.3,4.0)