Data1
1,a
2,b
3,c
4,d
5,e
Data2
1,a
2,g
3,j
4,b
5,c
6,d
7,e
Script
a = load '/tmp/data/data1' using PigStorage(',') as (timestamp:chararray,constant:chararray);
b = load '/tmp/data/data2' using PigStorage(',') as (timestamp:chararray,constant:chararray);
I need output only the constants which not common and present in data2 as below
2,g
3,j
RIGHT OUTER JOIN and FILTER where a.timestamp is null.That will give you all records in b that are not in a.
c = JOIN a BY (timestamp) RIGHT OUTER,b BY (timestamp);
d = FILTER c BY (a::timestamp is null);
DUMP d;
Related
recipe,ingredient,inventor
Tacos,Beef,Alex
Tacos,Lettuce,Alex
Tacos,Cheese,Alex
TomatoSoup,Tomatoes,Steve
TomatoSoup,Milk,Steve
I want to group the record by recipe and bag the ingredient and inventor like
(Tacos,{Beef,Lettuce,Cheese},Alex)
(TomatoSoup,{Tomatoes,Milk},Steve)
Group by recipe and inventor and then order the columns as per your requirement.
A = LOAD 'data.txt' USING PigStorage(',') AS (recipe:chararray,ingredient:chararray,inventor:chararray);
B = GROUP A BY (recipe,inventory);
C = FOREACH B GENERATE FLATTEN(group) as (recipe,inventor),A.ingredient;
D = FOREACH C GENERATE recipe,ingredient,inventory;
DUMP D;
Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")
I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;
I have a file
(1,1,100)
(1,1,200)
(1,2,300)
Now I want the distinct to be applied on two columns and want the output to be
I did this
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY id3) {
b = A.(id1,id2);
s = DISTINCT b;
GENERATE FLATTEN(s);
};
DUMP A_unique;
Output comes out to be
(1,1)
(1,1)
(1,2)
I expected it to be
(1,1)
(1,2)
Here you go this should give you the desired output -
a = load 'sample1.txt' using PigStorage(',') as (id1:int, id2:int, id3:int);
b = group a by (id1, id2);
c = foreach b {
first_e = limit a.id3 1;
generate flatten(group) as (id1, id2);
}
Below code generates the required result.
a = load '$dir/data' using PigStorage(',') as (d1:int,d2:int,d3:int);
b= group a all;
c= foreach b {
d = a.(d1,d2);
e = DISTINCT d;
generate FLATTEN(e);
}
dump c ;
~
I have two datasets:
A = {uid, url}; B = {uid, url};
now I do a cogroup:
C = COGROUP A BY uid, B BY uid;
and I want to change C into {group AS uid, DISTINCT A.url+B.url};
My question is how do I do this concatenation of two bags A.url and B.url?
Or to put it differently, how do I do DISTINCT on multiple columns?
It cannot be what you're expecting but that's what I understood from your question:
C = JOIN A BY uid, B BY uid;
D = DISTINCT C;
Concatenation is done the following way:
E = FOREACH D GENERATE CONCAT(A::uid,B::uid);
A = LOAD 'A' using PigStorage() as (uid,url);
B = LOAD 'B' using PigStorage() as (uid,url);
C = JOIN A by uid ,B by uid;
D = FOREACH C GENERATE $0,CONCAT(A::url,B::url);
E= DISTINCT D;
dump E;