Apache Pig how to replace all comma in chararray - apache-pig

I'm trying to replace all comma in a chararray like this:
Example of input lines:
1,compras com cartão, comprei (cp1,cp2,cp3), 206-01-01 00:00:00
Output example:
1,compras com cartão, comprei (cp1 cp2 cp3), 206-01-01 00:00:00
Using this approach:
raw_data = LOAD 's3://datalake/example'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);
apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;
But this command just remove the first occurrence of comma, and the result is:
1,compras com cartão, comprei (cp1 cp2, cp3), 206-01-01 00:00:00
What i'm doing wrong?
Thanks,

There is no clear demarkation of the 3rd field.You have 2 options.Enclose the 3rd field in quotes and then use the pigscript you have.
1,compras com cartão, "comprei (cp1,cp2,cp3)", 206-01-01 00:00:00
raw_data = LOAD 's3://datalake/example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);
apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;
Alternatively, you can load the fields using comma as the delimiter and then generate the 3rd field as the combination of 3,4,5 fields in the load.See below
A = LOAD 'test16.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
B = FOREACH A GENERATE $0 as id:int,$1 as transaction:chararray,CONCAT(CONCAT(CONCAT(CONCAT($2,' '),$3),' '),$4) as transaction_name:chararray,$5 as date:chararray;
DUMP B;

Related

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

replace comma(,) only if its inside quotes("") in Pig

I have data like this:
1,234,"john, lee", john#xyz.com
I want to remove , inside "" with space using pig script. So that my data will look like:
1,234,john lee, john#xyz.com
I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes.
I am stuck on this. Any help is highly appreciated. Thanks
Below command will help:
csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
Ouput:
(1,234,john lee, john#xyz.com)
Load it into a single field and then use STRSPLIT and REPLACE
A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3);
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;
A
1,234,"john, lee", john#xyz.com
B
(1,234,)(john, lee)(, john#xyz.com)
C
(1,234,)(john lee)(, john#xyz.com)
D
(1,234,john lee, john#xyz.com)
E
(1),(234),(john lee),(john#xyz.com)
I got the perfect way to do this. A very generic solution is as below:
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Detailed use case is available at my blog

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

How to right Pig Script if line contains more than one same delimiter?

Here I have a line in my "test.csv" file as follows:
1987654,file not uploaded,please try again,Johnson
I would like to get output as follows using Pig
Task ID
1987654
Message
file not uploaded,please try again
User
Johnson
Since all lines have the same format, the simple solution is to load it into 4 fields with comma as the delimiter and then use CONCAT to join the 2nd and 3rd field along with a comma.
A = LOAD 'data.txt' USING PigStorage(',') AS (a1:int,a2:chararray,a3:chararray,a4:chararray);
B = FOREACH A GENERATE a1,CONCAT(CONCAT(a2,','),a3),a4;
DUMP B;

Apache Pig: Add one field dataset to another one as a new column

Let's say we have this scenario:
dataset1.csv :
datefield
field11, field12, field13
field21, field22, field23
field31, field32, field33
What is the best way to get this?:
field11, field12, field13, datefield
field21, field22, field23, datefield
field31, field32, field33, datefield
I tried to generate one dataset (relation1) with just this columns (after loading and generating):
field11, field12, field13
field21, field22, field23
field31, field32, field33
another one (relation2) with just this column (after loading and generating):
datefield
and then doing this:
finalResult = FOREACH dataset1 GENERATE UDFFunction1(relation1::f1) as firstFields, UDFFunction2(relation2::f2) as lastField
But I'm getting 'A column needs to be projected from a relation for it to be used as a scalar'
The problem is with the second field (the one with the datefield).
I would like to avoid joins, since it would be a little messy workaround.
Any suggestions?
Please, forget my UDF functions. They just format the input Tuples accordingly.
Adding the pig script:
register 's3://bucketName/lib/MyJar.jar';
define ParseOutFilesUDF packageName.ParseOutFiles;
define FormatTimestartedUDF packageName.FormatTimestarted;
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
--This UDF just reformat each tuple, adding a String to each Tuple and returning a new one.
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
--load the same csv again to get the TIMESTARTED field
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
--filter to get only one record, which is somth like TIMESTARTED=20160101
filetered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
timestarted = foreach filetered GENERATE $0 as fechaStarted;
-- the FormatTimestartedUDF just gets ride of 'TIMESTARTED=' in order to get the date '20160101'
in this FOREACH sentence is where it fails with the 'A column needs to be projected...'
finalResult = FOREACH outFile GENERATE f1, FormatTimestartedUDF(timestarted) as f2;
STORE finalResult INTO 's3://bucketName/output/';
You are getting the error because you are referencing f1 which doesn't exist in outfile and timestarted is a relation and not a field.Also you should be using the field in resultALL and filtered.
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
filtered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
finalResult = FOREACH resultALL GENERATE resultAll.initial, FormatTimestartedUDF(filtered.$0);
STORE finalResult INTO 's3://bucketName/output/';