bincod evaluation in pig - apache-pig

I am trying to replace the missing values with some precomputed value.
So i posted the question here and followed the advice and here is the code snippet
input = LOAD 'data.txt' USING PigStorage(',') AS
(
id1:double , id21:double );
gin = foreach input generate
id1 IS NULL ? 2 : id1,
id2 IS NULL ? 4 : id2;
But I am getting an error mismatched input 'IS' expecting SEMI_COLON?

Try adding parentheses in the bincond. The following works properly for me:
Contents of input:
0.9,1.11
,0.3
10.3,
Script:
inp = LOAD 'input' USING PigStorage(',') AS (id1:double, id2:double );
gin = foreach inp generate
((id1 IS NULL) ? 2 : id1),
((id2 IS NULL) ? 4 : id2);
DUMP gin;
Output:
(0.9,1.11)
(2.0,0.3)
(10.3,4.0)

Related

Pig script to concatenate the values in the tuples

Input:
(11111111,{(A,MARK,APPLE,ABC1,11111111),(B,PAUL,AMAZON,ABC2,11111111),(C,TIM,FIVN,ABC3,11111111),(D,LIN,MULESFT,ABC4,11111111),(E,YEP,UHG,ABC5,11111111),(F,QIN,ATT,ABC6,11111111)})
(22222222,{(A,MARK,APPLE,ABC6,22222222),(B,MARK,AMAZON,ABC7,22222222),(C,MARK,PQE,ABC8,22222222),(D,MARK,AMB,ABC9,22222222),(E,MARK,YZQ,ABC19,22222222),(F,MARK,PQR,,22222222)})
I have grouped the data with the key as above. I should generate the output by concatenating all the values of the tuple including nulls as below:
Output:
(1111111,A^B^C^D^E^F,MARK^PAUL^TIM^LIN^YEP^QIN,APPLE^AMAZON^FIVN^MULESFT^UHG^ATT,ABC1^ABC2^ABC3^ABC4^ABC5^^ABC6)
(2222222,A^B^^D^E^G,TIM^AIN^TIM^BIN^CIN^DIN^RIN,APPLE^AMAZON^PQE^AMB^YZQ^RIN,ABC6^ABC7^ABC8^ABC9^ABC19^^)
Can some one please help me?
Sharing a code snippet which may help, work on this to achieve the expected output.
Input :
1,A
1,B
1,C
2,D
2,E
2,F
Output :
(1,C^B^A)
(2,F^E^D)
Pig Snippet :
data1 = load '/Users/muralirao/learning/pig/a.csv' using PigStorage(',') as (id:int, name:chararray);
req_data = FOREACH (GROUP data1 BY id) {
names = data1.name;
GENERATE group AS id, BagToString(names,'^');
};
DUMP req_data;

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

Convert one line into multiple line in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
Output should be:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3
ABC,DEF,GHI,JKL,AAA,bbb,1,2,3
ABC,DEF,GHI,JKL,AAA,ccc,1,2,3
ABC,DEF,GHI,JKL,BBB,aaa,1,2,3
ABC,DEF,GHI,JKL,BBB,bbb,1,2,3
ABC,DEF,GHI,JKL,BBB,ccc,1,2,3
Could anyone please help me?
You can write your own custom UDF or try the below approach
input.txt
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,CCC,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
PigScript:
A = LOAD 'input.txt' USING PigStorage(',');
B = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($4)),
FLATTEN(TOBAG(
TOTUPLE($5..$8),
TOTUPLE($9..$12),
TOTUPLE($13..$16)
)
);
C = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($17)),
FLATTEN(TOBAG(
TOTUPLE($18..$21),
TOTUPLE($22..$25),
TOTUPLE($26..$29)
)
);
D = UNION B,C;
DUMP D
Output:
(ABC,DEF,GHI,JKL,AAA,aaa,1,2,3)
(ABC,DEF,GHI,JKL,AAA,bbb,1,2,3)
(ABC,DEF,GHI,JKL,AAA,ccc,1,2,3)
(ABC,DEF,GHI,JKL,BBB,aaa,1,2,3)
(ABC,DEF,GHI,JKL,BBB,bbb,1,2,3)
(ABC,DEF,GHI,JKL,BBB,ccc,1,2,3)

Pig Programming Logic

10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
I need to find out if the atm_id belongs to the same bank then I need a indicator to be produced
I need output like this
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank
atm_trans = LOAD '/user/cloudera/inputfiles/atm_trans.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name :chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,(bank_name matches atm_id ?'same_bank' : 'diff_bank') as ind;
dump atm_trans_each;
but I am getting syntax error. Can somebody correct it give me the correct statement to get the ouput;
Can you try this?
input.txt
10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
PigScript:
atm_trans = LOAD 'input.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name:chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((STARTSWITH(atm_id,bank_name)== true)?'same_bank':'diff_bank') as ind;
STORE atm_trans_each INTO 'output' USING PigStorage('|');
Update: In 0.8 version
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((REGEX_EXTRACT(atm_id,'([A-Za-z]+)',1) == bank_name)?'same_bank':'diff_bank');
output:
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank

Pig one row to multiple row

Can you please provide Pig script for below query?
here's input format.
Input
ID, Label
122,a|b
215,q|b|c
214,Z|b|c
218,w|b|c
211,r|b|c
219,u|b
Expected output
122,a
122,b
215,q
215,b
215,c
214,Z
214,b
214,c
218,w
218,b
218,c
...........
Thanks,
Abhi
TOKENIZE the Label, it will give a bag and than FLATTEN it, which will give you as many rows as are tuples in the bag. Sample code
inpt = LOAD '....' USING PigStorage(',') AS (ID: chararray, Label : chararray);
result = FOREACH inpt GENERATE ID, FLATTEN(TOKENIZE(Lable, '|'));
DUMP result;