How to address fields in Pig Latin after loading - apache-pig

Have a large file with a lot of columns file which I am loading like
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2.. ;
C = FOREACH B FILTER BY (name is NOT NULL);
I get an error that projected field [name] does not exist? I dont want to address columns by doing $0, $1 and all that . How can I give them some identifiers ?

That pig script doesnt run for me - but changing to this :
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2 as another;
C = FILTER B BY (name is NOT NULL);
does work.

Nested FOREACH will be a better option
B=FOREACH A {
filtered_rec = FILTER A BY (name is not null);
GENERATE filtered_rec;
}

Related

Pig Latin fetching first field

I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

Flatten distinct in apache pig

I have a data set that looks like this:
DUMP A;
(10000,({(10000),(20000),(50000)},{(10000),(20000),(30000)}))
(20000,({(10000),(20000),(50000)},{(20000)},{(10000),(20000),(30000)}))
(30000,({(30000)},{(10000),(20000),(30000)}))
(40000,({(40000)},{(40000),(50000)}))
(50000,({(40000),(50000)},{(10000),(20000),(50000)}))
DESCRIBE A;
{foo: bytearray, bar_gp: (baz: {(foo: bytearray)})}
I eventually want it to look like this:
DUMP A;
(10000,{(10000),(20000),(50000),(30000)})
(20000,{(10000),(20000),(50000),(30000)})
(30000,{(10000),(20000),(30000)})
(40000,{(40000),(50000)})
(50000,{(40000),(50000),(10000),(20000)})
If I tried using:
B = FOREACH A GENERATE $0, FLATTEN($1);
C = FOREACH B {D = FOREACH B GENERATE FLATTEN($1); D= DISTINCT D; GENERATE $0, D; }
but I kept getting the error:
expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)
How can I get the desired output? I know I could use a UDF to parse it, but I would like to find a built-in solution.
I think you need to do distinct on the BAG before flattening it.
B = FOREACH A {
D = DISTINCT $1;
GENERATE $0, FLATTEN(D)}

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

Pig Group By Addition

I have data files split by "|" so I'm using below code.
RAW_LOG = LOAD 'logs.log' USING TextLoader as (line:chararray);
splt = foreach RAW_LOG generate FLATTEN(STRSPLIT($0, '\\|'));
id_vals = foreach splt generate $4 as uid, $8 as site_id , $9 as dsid , $6 as amt;
I want to SUM(amt) of each site_id, i have tried group by but didn't work.
I am assuming you want end result to be two columns site_id and sum of amt for that site_id.
You can directly load pipe separated file using PigStorage, no need to load and then split. It will be good to provide schema definition, though you can access elements using $.
Here is the code -
RAW_LOG = LOAD 'logs.log' USING PigStorage('|') as (//YOUR SCHEMA DEFINITION);
SITE_GRP = group RAW_LOG by site_id;
SITE_SUM = foreach SITE_GRP generate group, SUM(RAW_LOG.amt);
Hope this helps.