I have an XML file like the following. I am loading the XML using XMLLOader. It is working fine. But, while fetching the values it is giving empty values:
<mfh>
<f></f>
<sn>***</sn>
<st>****</st>
<vnr>****</vnr>
<cb>***</cb>
</mfh>
<md>
<nei>
<ne>***</ne>
<k>***</k>
<n>***</n>
</nei>
<mi>
<mts>**</mts>
<g>**</g>
<mv>
<m>***</m>
</mv>
</mi>
.....
.....
</md>
My Pig script is as follows:
REGISTER '/usr/lib/pig/piggybank.jar'
a = load '/user/root/sample.xml' using org.apache.pig.piggybank.storage.XMLLoader('mfh') as (doc:chararray);
dump input_xml;
b = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s*<ffv>(.*)</ffv>\\s*</mfh'));
dump required_tags;
The output of the script is as follows:
It is not givning any errors, but the output is (). I have updated the XML file and I want to parse all the values.
Can you try this ?
To print the value of 'ffv' attribute as per your example:
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s+<ffv>(.*)</ffv>.*'));
To print all the values of ffv,sn,st,vn,cbt:
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'<mfh>\\s+<ffv>(.*)</ffv>\\s+<sn>(.*)</sn>\\s+<st>(.*)</st>\\s+<vn>(.*)</vn>\\s+<cbt>(.*)</cbt>\\s+</mfh>'));
you can do this
required_tags = foreach input_xml generate FLATTEN(REGEX_EXTRACT_ALL(doc,'\s*(.)\s(.)\s(.)\s(.)\s(.*)')) AS (ffv,sn,st,vn,cbt);
dump required_tags;
Related
I have data like this:
1,234,"john, lee", john#xyz.com
I want to remove , inside "" with space using pig script. So that my data will look like:
1,234,john lee, john#xyz.com
I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes.
I am stuck on this. Any help is highly appreciated. Thanks
Below command will help:
csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
Ouput:
(1,234,john lee, john#xyz.com)
Load it into a single field and then use STRSPLIT and REPLACE
A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3);
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;
A
1,234,"john, lee", john#xyz.com
B
(1,234,)(john, lee)(, john#xyz.com)
C
(1,234,)(john lee)(, john#xyz.com)
D
(1,234,john lee, john#xyz.com)
E
(1),(234),(john lee),(john#xyz.com)
I got the perfect way to do this. A very generic solution is as below:
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Detailed use case is available at my blog
Here I have a line in my "test.csv" file as follows:
1987654,file not uploaded,please try again,Johnson
I would like to get output as follows using Pig
Task ID
1987654
Message
file not uploaded,please try again
User
Johnson
Since all lines have the same format, the simple solution is to load it into 4 fields with comma as the delimiter and then use CONCAT to join the 2nd and 3rd field along with a comma.
A = LOAD 'data.txt' USING PigStorage(',') AS (a1:int,a2:chararray,a3:chararray,a4:chararray);
B = FOREACH A GENERATE a1,CONCAT(CONCAT(a2,','),a3),a4;
DUMP B;
I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;
I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.
10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
I need to find out if the atm_id belongs to the same bank then I need a indicator to be produced
I need output like this
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank
atm_trans = LOAD '/user/cloudera/inputfiles/atm_trans.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name :chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,(bank_name matches atm_id ?'same_bank' : 'diff_bank') as ind;
dump atm_trans_each;
but I am getting syntax error. Can somebody correct it give me the correct statement to get the ouput;
Can you try this?
input.txt
10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
PigScript:
atm_trans = LOAD 'input.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name:chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((STARTSWITH(atm_id,bank_name)== true)?'same_bank':'diff_bank') as ind;
STORE atm_trans_each INTO 'output' USING PigStorage('|');
Update: In 0.8 version
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((REGEX_EXTRACT(atm_id,'([A-Za-z]+)',1) == bank_name)?'same_bank':'diff_bank');
output:
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank