How to Find the Distinct Values in a DataBag? - apache-pig

Say I have some data like
1,A
1,A
1,B
2,C
2,D
3,E
3,E
I want to be able to group the first column and then return the distinct values in that group:
1,A,B
2,C,D
3,E
or
1,{A,B}
2,{C,D}
3,{E}
Is there a way to do this aside from a UDF?
If I do
DATA = LOAD 'data.txt' USING PigStorage(',') AS (a:int, b:chararray);
GROUPED = GROUP DATA BY a;
UNIQUES = FOREACH GROUPED {
distinct_bs = DISTINCT GROUPED.b;
GENERATE
group AS a
,FLATTEN(distinct_bs)
;
}
(regardless of FLATTEN or not, or if I include the group as a, I receive a
ERROR 1200: org.apache.pig.newplan.logical.expression.ScalarExpression
cannot be cast to org.apache.pig.newplan.logical.expression.ProjectExpression

GROUPED does not contain b, but DATA does:
DESCRIBE GROUPED
GROUPED: {group: int,DATA: {(a: int,b: chararray)}}
Try the following:
UNIQUES = FOREACH GROUPED {
distinct_bs = DISTINCT DATA.b;
GENERATE
group AS a,
distinct_bs;
}
Results in:
(1,{(A),(B)})
(2,{(C),(D)})
(3,{(E)})

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Count distinct values in a group using pig

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.
Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:
a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;
c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;
dump c;
The output from the alias 'c' is:
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)
I want in the end:
catA, 2014, {(apple, 2), (milk, 2)}
catA, 2015, {(milk, 1)}
catB, 2014, {(apple, 1), (milk, 1)}
I've seen some example of generating value counts, but grouping by category and year is tripping me up.
Input:
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .
The below code will help you
Pig Script:
list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);
list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;
list_grp = GROUP list_each BY (category,my_year);
list_nested_each = FOREACH list_grp
{
list_inner_each = FOREACH list_each GENERATE food;
list_inner_dist = DISTINCT list_inner_each;
GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;
};
dump list_nested_each;
Output:
(catA,2014,2)
(catA,2015,1)
(catB,2014,2)
Appending to the code in the question:
d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;
will produce:
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

How to ORDER columns in csv once the grouped in pig?

I want to order the columns once they are already grouped. How can I do this?
My data looks like this:
product,next_link,count_value
p1,p2,2
p1,p4,4
p1,p5,5
p2,p1,3
p2,p3,2
p3,p2,1
p3,p5,6
p3,p1,8
p4,p1,8
p4,p5,2
p5,p3,3
p5,p2,5
p5,p4,6
p5,p1,4
I grouped them using this piece of code:
product_group = GROUP product_data BY products;
DUMP product_group;
The output is:
(p1,{(p1,p2,2),(p1,p4,4),(p1,p5,5)})
(p2,{(p2,p1,3),(p2,p3,2)})
(p3,{(p3,p5,6),(p3,p1,8),(p3,p2,1)})
(p4,{(p4,p5,2),(p4,p1,8)})
(p5,{(p5,p1,4),(p5,p3,3),(p5,p2,5),(p5,p4,6)})
I want to use ORDER to order the next_link base on count_value.
I have written the code as:
B = FOREACH product_data {
field2_ord = ORDER next_link BY count_value;
GENERATE products, field2_ord;
};
If you want to print the product data in order of count_value then you can use:
A = LOAD 'Product_data.csv' USING PigStorage(',') AS (product:chararray, next_link:chararray, count_value:int);
B = ORDER A BY count_value ASC;
C = FOREACH B GENERATE product, next_link;
DUMP C;
I hope this is what was expected.
Try this code below:
a_input = LOAD 'Product_data.csv' USING PigStorage(',') AS (product:chararray, next_link:chararray, count_value:int);
B = GROUP (ORDER a_input BY count_value) BY next_link;
Are you expecting this type of code?

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};