I have data like this:
1,234,"john, lee", john#xyz.com
I want to remove , inside "" with space using pig script. So that my data will look like:
1,234,john lee, john#xyz.com
I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes.
I am stuck on this. Any help is highly appreciated. Thanks
Below command will help:
csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
Ouput:
(1,234,john lee, john#xyz.com)
Load it into a single field and then use STRSPLIT and REPLACE
A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3);
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;
A
1,234,"john, lee", john#xyz.com
B
(1,234,)(john, lee)(, john#xyz.com)
C
(1,234,)(john lee)(, john#xyz.com)
D
(1,234,john lee, john#xyz.com)
E
(1),(234),(john lee),(john#xyz.com)
I got the perfect way to do this. A very generic solution is as below:
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Detailed use case is available at my blog
Related
I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)
I need to extract part of the string that is after '-' part of the string.
lets say,
LONGNAME Andrew-stellar Alex-COOK
Expected output:
COOK stellar
I tried with:
REGEX_EXTRACT(LONGNAME,'(-.*)',1) as shortname
But it gives:
-COOK
-Stellar
how can i remove '-'?
try adding REPLACE function also:
A = LOAD 'data' USING PigStorage() AS (longname:chararray);
B = FOREACH A GENERATE REPLACE(REGEX_EXTRACT(longname,'(-.*)',1),'-','') as shortname;
DUMP B;
output:
(stellar) (COOK)
Here I have a line in my "test.csv" file as follows:
1987654,file not uploaded,please try again,Johnson
I would like to get output as follows using Pig
Task ID
1987654
Message
file not uploaded,please try again
User
Johnson
Since all lines have the same format, the simple solution is to load it into 4 fields with comma as the delimiter and then use CONCAT to join the 2nd and 3rd field along with a comma.
A = LOAD 'data.txt' USING PigStorage(',') AS (a1:int,a2:chararray,a3:chararray,a4:chararray);
B = FOREACH A GENERATE a1,CONCAT(CONCAT(a2,','),a3),a4;
DUMP B;
I'm trying to replace all comma in a chararray like this:
Example of input lines:
1,compras com cartão, comprei (cp1,cp2,cp3), 206-01-01 00:00:00
Output example:
1,compras com cartão, comprei (cp1 cp2 cp3), 206-01-01 00:00:00
Using this approach:
raw_data = LOAD 's3://datalake/example'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);
apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;
But this command just remove the first occurrence of comma, and the result is:
1,compras com cartão, comprei (cp1 cp2, cp3), 206-01-01 00:00:00
What i'm doing wrong?
Thanks,
There is no clear demarkation of the 3rd field.You have 2 options.Enclose the 3rd field in quotes and then use the pigscript you have.
1,compras com cartão, "comprei (cp1,cp2,cp3)", 206-01-01 00:00:00
raw_data = LOAD 's3://datalake/example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);
apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;
Alternatively, you can load the fields using comma as the delimiter and then generate the 3rd field as the combination of 3,4,5 fields in the load.See below
A = LOAD 'test16.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
B = FOREACH A GENERATE $0 as id:int,$1 as transaction:chararray,CONCAT(CONCAT(CONCAT(CONCAT($2,' '),$3),' '),$4) as transaction_name:chararray,$5 as date:chararray;
DUMP B;
I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.