Pig Latin fetching first field - apache-pig

I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.

The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)

each2 = foreach each1 generate $0;

Related

How to right Pig Script if line contains more than one same delimiter?

Here I have a line in my "test.csv" file as follows:
1987654,file not uploaded,please try again,Johnson
I would like to get output as follows using Pig
Task ID
1987654
Message
file not uploaded,please try again
User
Johnson
Since all lines have the same format, the simple solution is to load it into 4 fields with comma as the delimiter and then use CONCAT to join the 2nd and 3rd field along with a comma.
A = LOAD 'data.txt' USING PigStorage(',') AS (a1:int,a2:chararray,a3:chararray,a4:chararray);
B = FOREACH A GENERATE a1,CONCAT(CONCAT(a2,','),a3),a4;
DUMP B;

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

Pig Programming Logic

10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
I need to find out if the atm_id belongs to the same bank then I need a indicator to be produced
I need output like this
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank
atm_trans = LOAD '/user/cloudera/inputfiles/atm_trans.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name :chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,(bank_name matches atm_id ?'same_bank' : 'diff_bank') as ind;
dump atm_trans_each;
but I am getting syntax error. Can somebody correct it give me the correct statement to get the ouput;
Can you try this?
input.txt
10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
PigScript:
atm_trans = LOAD 'input.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name:chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((STARTSWITH(atm_id,bank_name)== true)?'same_bank':'diff_bank') as ind;
STORE atm_trans_each INTO 'output' USING PigStorage('|');
Update: In 0.8 version
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((REGEX_EXTRACT(atm_id,'([A-Za-z]+)',1) == bank_name)?'same_bank':'diff_bank');
output:
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank

Getting empty values, while parsing XML using Pig

I have an XML file like the following. I am loading the XML using XMLLOader. It is working fine. But, while fetching the values it is giving empty values:
<mfh>
<f></f>
<sn>***</sn>
<st>****</st>
<vnr>****</vnr>
<cb>***</cb>
</mfh>
<md>
<nei>
<ne>***</ne>
<k>***</k>
<n>***</n>
</nei>
<mi>
<mts>**</mts>
<g>**</g>
<mv>
<m>***</m>
</mv>
</mi>
.....
.....
</md>
My Pig script is as follows:
REGISTER '/usr/lib/pig/piggybank.jar'
a = load '/user/root/sample.xml' using org.apache.pig.piggybank.storage.XMLLoader('mfh') as (doc:chararray);
dump input_xml;
b = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s*<ffv>(.*)</ffv>\\s*</mfh'));
dump required_tags;
The output of the script is as follows:
It is not givning any errors, but the output is (). I have updated the XML file and I want to parse all the values.
Can you try this ?
To print the value of 'ffv' attribute as per your example:
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s+<ffv>(.*)</ffv>.*'));
To print all the values of ffv,sn,st,vn,cbt:
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s+<ffv>(.*)</ffv>\\s+<sn>(.*)</sn>\\s+<st>(.*)</st>\\s+<vn>(.*)</vn>\\s+<cbt>(.*)</cbt>\\s+</mfh>'));
you can do this
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'\s*(.)\s(.)\s(.)\s(.)\s(.*)')) AS (ffv,sn,st,vn,cbt);
dump required_tags;