Filter data after join using PIG - apache-pig

I would like to filter the records after two files are joined.
The file BX-Books.csv contains the book data. and the file BX-Book-Ratings.csv contains books rating data where ISBN is the common column from both the files. The inner join betweeb the files is done using the this column.
I would like to get the books that are published in the year 2002.
I have used the following script but i am getting 0 records.
grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN;
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4;
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002';

I created a test.csv and test-rating.csv and a Pig script that works out of them. It worked perfectly fine.
test.csv
1;abc;author1;2002
2;xyz;author2;2003
test-rating.csv
user1;1;3
user2;2;5
Pig Script :
A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray);
describe A;
dump A;
B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
describe B;
dump B;
C = JOIN A BY ISBN, B BY ISBN;
describe C;
dump C;
D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3;
describe D;
dump D;
E = FILTER D BY $3 == '2002';
describe E;
dump E;
Output:
A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
B: {user: chararray,ISBN: chararray,rating: chararray}
(user1,1,3)
(user2,2,5)
C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray}
(1,abc,author1,2002,user1,1,3)
(2,xyz,author2,2003,user2,2,5)
D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)

Requirement: Get the books that are published in the year 2002.
It is not required to have 2 data set.
Only with the "BookXRecords", this can be achieved.
grunt>BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt>A=FILTER BookXRecords BY year ='2002';
grunt>dump A;

Related

Apache Pig only load first nested tuple

I use the exact sample from official document:
I have data.txt:
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
I run:
A = LOAD 'data.txt' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
dump A
I always got:
((3,8,9),)
((1,4,7),)
((2,5,8),)
The second nested tuple never got loaded. I tried in both versions of 0.16.0 and 0.17.0.
The problem should be with the data file you created. There should be tab in between both tuples as separator in the data file while creating it. If there was a space then we need to change the load query accordingly.
a)With tab(\t) as delimiter or separator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
b)With single space( ) as delimiter or seperator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),)
((1,4,7),)
((2,5,8),)
#Use PigStorage(' ') in case if you still want to use space as delimiter for file.
grunt> A = LOAD '/home/ec2-user/data' USING PigStorage(' ') AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Pig Aggregrate functions

My Input file is below
a,t1,1000,100
a,t1,2000,200
b,t2,1000,200
b,t2,5000,100
How to find count of distinct $0 in the above file.
myinput = LOAD 'file' AS(a1:chararray,a2:chararray,amt:int,rate:int);
After the above script what needs to done.
Also Can I use that distinct count for dividing some other is a different relation
First of all, the way you read the data is incorrect. If you try to dump "myinput", youll see that the whole row is read in the first field (a1), while the others are empty.
The reason is that you don't specify a LOAD function, and a default function is the PigStorage() built-in function which expects tab-delimited file (so it ignores your commas!).You need to explicitly specify a load function (e.g. PigStorage()) via the using clause and pass it arguments:
myInput = LOAD file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
Moving on, to find the DISTINCT $0 first you have to extract field $0 in a separate relation. The reason is that the DISTINCT statement works on entire records, rather than on separate fields.
myField = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT myField;
Now the result of distinctA1 is {(a), (b)}. By using now group all, you will group together all of your records together, and then what is left is to COUNT them:
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
And now you're happy. :)
The complete code:
myInput = LOAD 'file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
a1 = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT a1;
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
You can do something like this :
myInput = LOAD 'file.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray,amt:int,rate:int);
Data = GROUP myInput BY $0;
Data = FOREACH Data GENERATE $0;
Data = GROUP Data ALL;
Data = FOREACH Data GENERATE $0,COUNT($1);
NB: By Grouping on $0 you are doing the same thing as a distinct and you get better performance ;)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

Pig Latin split columns to rows

Is there any solution in Pig latin to transform columns to rows to get the below?
Input:
id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6
required output:
id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6
thanks
I'm willing to bet this is not the best way to do this however ...
data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray,
col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;
(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)
store final into '~/some_dir' using PigStorage('|');
EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...
# create file called udf.py
#outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
c1 = column1.split(',')
c2 = column2.split(',')
innerBag = zip(c1, c2)
return innerBag
Then in Pig
$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')