Pig: How to access fields from multiple tuples from a bag - apache-pig

My Pig Script:
A = LOAD 'average.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(line,'^(.\*?)\\s+(.\*?)\\s+(.*?) AS TUPLE(AA:chararray,BB:chararray,CC:chararray);
C = FILTER B BY tuple_0.AA IS NOT NULL;
D = GROUP C BY $0.AA;
Output after group stmt:
(1,{((1,a,b)),((1,c,d))})
(2,{((2,e,f)),((2,g,h))})
I need final output like this:
(1,a,b,c,d)
(2,e,f,g,h)
Describe query:
| D | group:chararray | C:bag{:tuple(tuple_0:tuple(AA:chararray,BB:chararray,CC:chararray))}

Instead of grouping on $0.AA, I would suggest to do a self join on C as below:
A = LOAD 'average.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(line,'^(.\*?)\\s+(.\*?)\\s+(.*?) AS TUPLE(AA:chararray,BB:chararray,CC:chararray);
C = FILTER B BY tuple_0.AA IS NOT NULL;
C = FOREACH C GENERATE tuple_0.AA AS AA, tuple_0.BB AS BB, tuple_0.CC AS CC; --renaming columns to easy names
D = FOREACH C GENERATE AA, BB, CC; -- clone of C
CD = JOIN C BY AA, D BY AA;
CD2 = FOREACH CD
GENERATE
C::AA AS AA,
C::BB AS CBB,
C::CC AS CCC,
D::BB AS DBB,
D::CC AS DCC;
I hope this helps.

Related

PIG Group into bag by distinct value

recipe,ingredient,inventor
Tacos,Beef,Alex
Tacos,Lettuce,Alex
Tacos,Cheese,Alex
TomatoSoup,Tomatoes,Steve
TomatoSoup,Milk,Steve
I want to group the record by recipe and bag the ingredient and inventor like
(Tacos,{Beef,Lettuce,Cheese},Alex)
(TomatoSoup,{Tomatoes,Milk},Steve)
Group by recipe and inventor and then order the columns as per your requirement.
A = LOAD 'data.txt' USING PigStorage(',') AS (recipe:chararray,ingredient:chararray,inventor:chararray);
B = GROUP A BY (recipe,inventory);
C = FOREACH B GENERATE FLATTEN(group) as (recipe,inventor),A.ingredient;
D = FOREACH C GENERATE recipe,ingredient,inventory;
DUMP D;

Apache Pig Distinct and Count

I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;

Distinct on Multiple columns of a pig

I have a file
(1,1,100)
(1,1,200)
(1,2,300)
Now I want the distinct to be applied on two columns and want the output to be
I did this
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY id3) {
b = A.(id1,id2);
s = DISTINCT b;
GENERATE FLATTEN(s);
};
DUMP A_unique;
Output comes out to be
(1,1)
(1,1)
(1,2)
I expected it to be
(1,1)
(1,2)
Here you go this should give you the desired output -
a = load 'sample1.txt' using PigStorage(',') as (id1:int, id2:int, id3:int);
b = group a by (id1, id2);
c = foreach b {
first_e = limit a.id3 1;
generate flatten(group) as (id1, id2);
}
Below code generates the required result.
a = load '$dir/data' using PigStorage(',') as (d1:int,d2:int,d3:int);
b= group a all;
c= foreach b {
d = a.(d1,d2);
e = DISTINCT d;
generate FLATTEN(e);
}
dump c ;
~

How to compute substraction of the data from different alias in pig?

dump count_a;
(10)
dump count_b;
(20)
dump count_c;
(30)
Now I want to to calculate: count_c - count_b - count_a. How to achieve this in PIG script ?
You'll need to join the three aliases together, for which you'll need a common field to join upon. Assuming they are all single record aliases you can create a field n to join on:
prep_a = FOREACH count_a GENERATE 1 AS n, a;
prep_b = FOREACH count_b GENERATE 1 AS n, b;
prep_c = FOREACH count_c GENERATE 1 AS n, c;
Then you can join a & b & c all by the common field n:
ab = JOIN prep_a by n FULL, prep_b by n;
abc = JOIN ab by prep_a::n FULL, prep_c by n;
Then finally calculate the final result:
result = FOREACH abc GENERATE (c - b - a) AS result;

how to combine/concat two bags in pig latin

I have two datasets:
A = {uid, url}; B = {uid, url};
now I do a cogroup:
C = COGROUP A BY uid, B BY uid;
and I want to change C into {group AS uid, DISTINCT A.url+B.url};
My question is how do I do this concatenation of two bags A.url and B.url?
Or to put it differently, how do I do DISTINCT on multiple columns?
It cannot be what you're expecting but that's what I understood from your question:
C = JOIN A BY uid, B BY uid;
D = DISTINCT C;
Concatenation is done the following way:
E = FOREACH D GENERATE CONCAT(A::uid,B::uid);
A = LOAD 'A' using PigStorage() as (uid,url);
B = LOAD 'B' using PigStorage() as (uid,url);
C = JOIN A by uid ,B by uid;
D = FOREACH C GENERATE $0,CONCAT(A::url,B::url);
E= DISTINCT D;
dump E;