PIG: Get all tuples out of a grouped bag - apache-pig

I am using PIG to generate groups from tuples as follows:
a1, b1
a1, b2
a1, b3
...
->
a1, [b1, b2, b3]
...
This is easy and working. But my problem is to get the following: From the obtained groups, I would like to generate a set of all tuples in the group's bag:
a1, [b1, b2, b3]
->
b1,b2
b1,b3
b2,b3
This would be easy if I could nest "foreach" and firstly iterate over each group and then over its bag.
I suppose I am misunderstanding the concept and I will appreciate your explanation.
Thanks.

It looks like you need a Cartesian product between the bag and itself. To do this you need to use FLATTEN(bag) twice.
Code:
inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2;
dump result;
Be aware that large bags will produce a lot of rows. To avoid it you could use TOP(...) before FLATTEN:
inpt = load '....group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2;
};
dump result;
For your specific output you could use some filtering before FLATTEN:
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
l = filter values by val == 'b1' or val == 'b2';
generate id, FLATTEN(l) as v1, FLATTEN(values) as v2;
};
result = filter result by v1 != v2;
I hope it helps.
Cheers

Also relevant is this UnorderedPairs function from the DataFu UDF library. It generates pairs of all items in a bag (in your case your grouped bag)

You can use the GROUP ALL pig statement to generate
A = -- Some bag
B = -- Another bag
groupedB = group B ALL;
result = foreach A GENERATE
TOTUPLE(*), groupedB.$1;
-- Will generate
((a1), {(b1, b2, b3)})
((a2), {(b1, b2, b3)})
((a3), {(b1, b2, b3)})
...

Related

Cacluate the sum/avg of an attribute in Apache pig

How can I calculate the avg or sum of an attribute in Apache pig (vertically not horizontally). Lots of example are available for doing this horizontally but not vertically.
This is my code
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING
PigStorage(',') AS (Year:int, ArrDelay:chararray);
f123 = f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;
-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;
and the output:
(2008,30)
(2009,60)
(2)
grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}
But this throw me an error:
cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;
I need to get 90 as output (30+60)
By the way, what is this schema as the output:
cntsum: {null::ArrDelay: chararray}
using pig Apache Pig version 0.16.0.2.6.5.0-292
You should load ArrDelay as int column
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;
If that is not an option then load ArrDelay into chararray and then cast it before the group all and sum
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;

Count distinct values in a group using pig

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.
Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:
a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;
c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;
dump c;
The output from the alias 'c' is:
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)
I want in the end:
catA, 2014, {(apple, 2), (milk, 2)}
catA, 2015, {(milk, 1)}
catB, 2014, {(apple, 1), (milk, 1)}
I've seen some example of generating value counts, but grouping by category and year is tripping me up.
Input:
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .
The below code will help you
Pig Script:
list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);
list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;
list_grp = GROUP list_each BY (category,my_year);
list_nested_each = FOREACH list_grp
{
list_inner_each = FOREACH list_each GENERATE food;
list_inner_dist = DISTINCT list_inner_each;
GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;
};
dump list_nested_each;
Output:
(catA,2014,2)
(catA,2015,1)
(catB,2014,2)
Appending to the code in the question:
d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;
will produce:
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

Storing results of each column-operation in seperate row in pig

I need to perform some numerical operations (using a UDF) on every column of my table. And for every column I am getting 2 values (mean and standard-dev). But the final result is coming like (mean_1, sd_1, mean_2, sd_2, mean_3, sd_3...), where 1,2... are column indexes. But I need the output for every column in a separate row. Like:
mean_1, sd_1 \\for col1
mean_2, sd_2 \\for col2
...
Here is the pig script I'm using:
data = LOAD 'input_file.csv' USING PigStorage(',') AS (C0,C1,C2);
grouped_data = GROUP data ALL;
res = FOREACH grouped_data GENERATE FLATTEN(data), AVG(data.$1) as mean, COUNT(data.$1) as count;
tmp = FOREACH res {
diff = (C1-mean)*(C1-mean);
GENERATE *,diff as diff;
};
grouped_diff = GROUP tmp all;
sq_tmp = FOREACH grouped_diff GENERATE flatten(tmp), SUM(tmp.diff) as sq_sum;
stat_tmp = FOREACH sq_tmp GENERATE mean as mean, sq_sum/count as variance, SQRT(sq_sum/count) as sd;
stats = LIMIT stat_tmp 1;
Could anybody please guide me on how to achieve this?
Thanks. I got the required output by creating tuples of mean and sd values for respective columns and then storing all such tuples in a bag. Then in the next step I flattened the bag.
tupled_stats = FOREACH raw_stats generate TOTUPLE(mean_0, var_0, sd_0) as T0, TOTUPLE(mean_1, var_1, sd_1) as T1, TOTUPLE(mean_2, var_2, sd_2) as T2;
bagged_stats = FOREACH tupled_stats generate TOBAG(T0, T1, T2) as B;
stats = foreach bagged_stats generate flatten(B);

How to perform a DISTINCT in Pig Latin on a subset of columns?

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:
You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).
It is simple to perform a DISTINCT operation on all of the columns:
A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;
Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?
Here's an example of input and expected output:
A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;
(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)
-- insert DISTINCT operation on a1,a2,a3 here:
-- ...
DUMP A_unique;
(1 2 3 4)
(1 2 4 4)
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY a4) {
b = A.(a1,a2,a3);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};
The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:
A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE
group.f1, group.f2, f4, f5, group.f3 ;
When you group on certain fields, the selection would have unique values for the group in a each tuple.
For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.
A_unique = DISTINCT A;
Here are 2 possible solutions, are there any other good approaches?
Solution 1 (using LIMIT 1):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;
grouped_and_distinct = FOREACH grouped_by_a4 {
single = LIMIT A2 1;
GENERATE FLATTEN(single);
};
Solution 2 (using DISTINCT):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;
-- Perform the distinct on a projection of combined and flatten
grouped_and_distinct = FOREACH grouped_by_a4 {
combined_unique = DISTINCT A2.combined;
GENERATE FLATTEN(combined_unique);
};
unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
limit_a = LIMIT A 1;
GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};
I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:
A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);
I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};