PIG script for creating IDxCITY matrix from given csv file - apache-pig

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0

You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

Related

Pig script to concatenate the values in the tuples

Input:
(11111111,{(A,MARK,APPLE,ABC1,11111111),(B,PAUL,AMAZON,ABC2,11111111),(C,TIM,FIVN,ABC3,11111111),(D,LIN,MULESFT,ABC4,11111111),(E,YEP,UHG,ABC5,11111111),(F,QIN,ATT,ABC6,11111111)})
(22222222,{(A,MARK,APPLE,ABC6,22222222),(B,MARK,AMAZON,ABC7,22222222),(C,MARK,PQE,ABC8,22222222),(D,MARK,AMB,ABC9,22222222),(E,MARK,YZQ,ABC19,22222222),(F,MARK,PQR,,22222222)})
I have grouped the data with the key as above. I should generate the output by concatenating all the values of the tuple including nulls as below:
Output:
(1111111,A^B^C^D^E^F,MARK^PAUL^TIM^LIN^YEP^QIN,APPLE^AMAZON^FIVN^MULESFT^UHG^ATT,ABC1^ABC2^ABC3^ABC4^ABC5^^ABC6)
(2222222,A^B^^D^E^G,TIM^AIN^TIM^BIN^CIN^DIN^RIN,APPLE^AMAZON^PQE^AMB^YZQ^RIN,ABC6^ABC7^ABC8^ABC9^ABC19^^)
Can some one please help me?
Sharing a code snippet which may help, work on this to achieve the expected output.
Input :
1,A
1,B
1,C
2,D
2,E
2,F
Output :
(1,C^B^A)
(2,F^E^D)
Pig Snippet :
data1 = load '/Users/muralirao/learning/pig/a.csv' using PigStorage(',') as (id:int, name:chararray);
req_data = FOREACH (GROUP data1 BY id) {
names = data1.name;
GENERATE group AS id, BagToString(names,'^');
};
DUMP req_data;

How to read a list of values in Pig as a bag and compare it to a specific value?

Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)

Pig Latin fetching first field

I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;

Apache Pig: Add one field dataset to another one as a new column

Let's say we have this scenario:
dataset1.csv :
datefield
field11, field12, field13
field21, field22, field23
field31, field32, field33
What is the best way to get this?:
field11, field12, field13, datefield
field21, field22, field23, datefield
field31, field32, field33, datefield
I tried to generate one dataset (relation1) with just this columns (after loading and generating):
field11, field12, field13
field21, field22, field23
field31, field32, field33
another one (relation2) with just this column (after loading and generating):
datefield
and then doing this:
finalResult = FOREACH dataset1 GENERATE UDFFunction1(relation1::f1) as firstFields, UDFFunction2(relation2::f2) as lastField
But I'm getting 'A column needs to be projected from a relation for it to be used as a scalar'
The problem is with the second field (the one with the datefield).
I would like to avoid joins, since it would be a little messy workaround.
Any suggestions?
Please, forget my UDF functions. They just format the input Tuples accordingly.
Adding the pig script:
register 's3://bucketName/lib/MyJar.jar';
define ParseOutFilesUDF packageName.ParseOutFiles;
define FormatTimestartedUDF packageName.FormatTimestarted;
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
--This UDF just reformat each tuple, adding a String to each Tuple and returning a new one.
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
--load the same csv again to get the TIMESTARTED field
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
--filter to get only one record, which is somth like TIMESTARTED=20160101
filetered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
timestarted = foreach filetered GENERATE $0 as fechaStarted;
-- the FormatTimestartedUDF just gets ride of 'TIMESTARTED=' in order to get the date '20160101'
in this FOREACH sentence is where it fails with the 'A column needs to be projected...'
finalResult = FOREACH outFile GENERATE f1, FormatTimestartedUDF(timestarted) as f2;
STORE finalResult INTO 's3://bucketName/output/';
You are getting the error because you are referencing f1 which doesn't exist in outfile and timestarted is a relation and not a field.Also you should be using the field in resultALL and filtered.
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
filtered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
finalResult = FOREACH resultALL GENERATE resultAll.initial, FormatTimestartedUDF(filtered.$0);
STORE finalResult INTO 's3://bucketName/output/';

How to address fields in Pig Latin after loading

Have a large file with a lot of columns file which I am loading like
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2.. ;
C = FOREACH B FILTER BY (name is NOT NULL);
I get an error that projected field [name] does not exist? I dont want to address columns by doing $0, $1 and all that . How can I give them some identifiers ?
That pig script doesnt run for me - but changing to this :
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2 as another;
C = FILTER B BY (name is NOT NULL);
does work.
Nested FOREACH will be a better option
B=FOREACH A {
filtered_rec = FILTER A BY (name is not null);
GENERATE filtered_rec;
}