pig latin need data not in commom - apache-pig

Data1
1,a
2,b
3,c
4,d
5,e
Data2
1,a
2,g
3,j
4,b
5,c
6,d
7,e
Script
a = load '/tmp/data/data1' using PigStorage(',') as (timestamp:chararray,constant:chararray);
b = load '/tmp/data/data2' using PigStorage(',') as (timestamp:chararray,constant:chararray);
I need output only the constants which not common and present in data2 as below
2,g
3,j

RIGHT OUTER JOIN and FILTER where a.timestamp is null.That will give you all records in b that are not in a.
c = JOIN a BY (timestamp) RIGHT OUTER,b BY (timestamp);
d = FILTER c BY (a::timestamp is null);
DUMP d;

Related

PIG Group into bag by distinct value

recipe,ingredient,inventor
Tacos,Beef,Alex
Tacos,Lettuce,Alex
Tacos,Cheese,Alex
TomatoSoup,Tomatoes,Steve
TomatoSoup,Milk,Steve
I want to group the record by recipe and bag the ingredient and inventor like
(Tacos,{Beef,Lettuce,Cheese},Alex)
(TomatoSoup,{Tomatoes,Milk},Steve)
Group by recipe and inventor and then order the columns as per your requirement.
A = LOAD 'data.txt' USING PigStorage(',') AS (recipe:chararray,ingredient:chararray,inventor:chararray);
B = GROUP A BY (recipe,inventory);
C = FOREACH B GENERATE FLATTEN(group) as (recipe,inventor),A.ingredient;
D = FOREACH C GENERATE recipe,ingredient,inventory;
DUMP D;

Pig - Convert rows into multiple columns

Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")

Apache Pig Distinct and Count

I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;

Distinct on Multiple columns of a pig

I have a file
(1,1,100)
(1,1,200)
(1,2,300)
Now I want the distinct to be applied on two columns and want the output to be
I did this
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY id3) {
b = A.(id1,id2);
s = DISTINCT b;
GENERATE FLATTEN(s);
};
DUMP A_unique;
Output comes out to be
(1,1)
(1,1)
(1,2)
I expected it to be
(1,1)
(1,2)
Here you go this should give you the desired output -
a = load 'sample1.txt' using PigStorage(',') as (id1:int, id2:int, id3:int);
b = group a by (id1, id2);
c = foreach b {
first_e = limit a.id3 1;
generate flatten(group) as (id1, id2);
}
Below code generates the required result.
a = load '$dir/data' using PigStorage(',') as (d1:int,d2:int,d3:int);
b= group a all;
c= foreach b {
d = a.(d1,d2);
e = DISTINCT d;
generate FLATTEN(e);
}
dump c ;
~

how to combine/concat two bags in pig latin

I have two datasets:
A = {uid, url}; B = {uid, url};
now I do a cogroup:
C = COGROUP A BY uid, B BY uid;
and I want to change C into {group AS uid, DISTINCT A.url+B.url};
My question is how do I do this concatenation of two bags A.url and B.url?
Or to put it differently, how do I do DISTINCT on multiple columns?
It cannot be what you're expecting but that's what I understood from your question:
C = JOIN A BY uid, B BY uid;
D = DISTINCT C;
Concatenation is done the following way:
E = FOREACH D GENERATE CONCAT(A::uid,B::uid);
A = LOAD 'A' using PigStorage() as (uid,url);
B = LOAD 'B' using PigStorage() as (uid,url);
C = JOIN A by uid ,B by uid;
D = FOREACH C GENERATE $0,CONCAT(A::url,B::url);
E= DISTINCT D;
dump E;