How do I do a DIFF in PIG - apache-pig

I have 2 data sets on which I am trying to find the difference. I am aware that there are other ways to do the same. What I am interested in is why this snippet of code is failing.
A = LOAD 'raw.people1' using org.apache.hive.hcatalog.pig.HCatLoader();
B = LOAD 'raw.people2' using org.apache.hive.hcatalog.pig.HCatLoader();
C = COGROUP A BY (name, place, animal, thing) , B BY (name, place, animal, thing) ;
D = FOREACH C DIFF(A, B);
A, B and C work correctly. But D fails with the error:
Failed to parse: Syntax error, unexpected symbol at or near 'DIFF'
Now this should not be the case. The pig docs (http://pig.apache.org/docs/r0.9.1/func.html#diff) state the DIFF takes two pags as params and A and B are bags of tuples.
What am I missing here?
Thanks

You missed GENERATE keyword before DIFF stmt, that is the reason for this error. Can you change like this?
D = FOREACH C GENERATE DIFF(A, B);

Related

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

Casting the output from Flatten and Strsplit in Pig

I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.

EqualsIgnoreCase function - Exception : org.apache.pig.backend.executionengine.ExecException

EqualsIgnoreCase function - Exception : org.apache.pig.backend.executionengine.ExecException
Input :
a.csv
-------
a
A
(blank/empty line)
b
B
c
C
Objective : To select the records which are 'a', 'A', 'b' and 'B'.
Approach 1 :
A = LOAD 'a.csv' using PigStorage(',') AS (value:chararray);
B = FILTER A BY LOWER(value) IN ('a','b');
DUMP B;
Output :
(a)
(A)
(b)
(B)
Approach 2 :
C = FILTER A BY EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b');
Output :
2015-04-27 23:48:21,958 [Thread-30] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0014
org.apache.pig.backend.executionengine.ExecException
at org.apache.pig.builtin.EqualsIgnoreCase.exec(EqualsIgnoreCase.java:50)
Trying to understand why this exception is getting thrown. I understand that its because of the blank record.
Tried checking for value NOT being null or empty, still the same error.
D = FILTER A BY (value IS NOT NULL) OR (TRIM(value) != '') AND (EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b'));
Any inputs/ thoughts on achieving our objective using Approach 2 is much appreciated.
Yes you are right, string functions EqualsIgnoreCase and TRIM are not able to handle blank string in the input.
To solve this issue,what ever you did in the last stmt is right, just remove the Trim function it will work.
C = FILTER A BY (value is not null) and (EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b'));
Is not null condition will take care of empty(null, space and tab) chars, so TRIM function is not required.

Assigning value in foreach generate

I'm having trouble with something very simple:
data = LOAD a:chararray, b:chararray, c:chararray;
Now I want to append a date on (which will either be a constant or a param).
add_date = foreach data generate a, b, c, (date = 'today');
or
add_date = foreach data generate a, b, c, (date = $date);
Both of these are throwing me errors in grunt. I know I'm doing something simple wrong, but can't seem to find the answer in the reference manual.
It should be just:
add_date = foreach data generate a, b, c, 'today' AS date ;
No need to put anything in ()s.