How do you get SUM() to work after grouping and have bags? - apache-pig

Can someone explain why this works:
data = FOREACH problem_three GENERATE $0 AS playerID, $1 AS yearID, $8 AS doubles, $9 AS triples, $10 AS HR;
filtered_years = FILTER data BY (yearID=='1980' OR yearID=='1981'OR yearID=='1982'OR yearID=='1983'OR yearID=='1984'OR yearID=='1985'OR
yearID=='1986'OR yearID=='1987'OR yearID=='1988'OR yearID=='1989');
extra_bases_total = FOREACH filtered_years GENERATE playerID, doubles+triples+HR AS yearly_total;
players_grouped = GROUP extra_bases_total BY playerID;
sums = FOREACH players_grouped GENERATE group, SUM(extra_bases_total.yearly_total) AS total;
grouped_all = GROUP sums ALL;
but this doesn't:
Batting = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',');
batting_details = FOREACH Batting GENERATE $0 AS playerID, $11 AS RBI;
batting_detail = FOREACH batting_details GENERATE playerID, RBI;
batting_details_groupeddata = GROUP batting_detail BY playerID;
totals = FOREACH batting_details_groupeddata GENERATE group, SUM(batting_detail.RBI) AS total;
DUMP totals;
What I am trying to do is get a sum of the RBIs for each player in the second part. I've tried filtering out the nulls and a bunch of other re-writing but nothing is working. SUM seems like a simple function but I cannot get it to work after I make a group.

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Pig Script error while using SUM()

I get the following error while running this PIG Script....Please Help!!!
Thanks in advance.
"ERROR 1000: Error during parsing. Scalars can be only used with projections"
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by $2 >= 1950;
Filter12 = filter Filter11 by $2 <= 1960;
Group1 = group Filter12 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter12);
Sum1 = foreach Count1 generate SUM(Group1);
DUMP Sum1;
Combine the two filter conditions and in the last step, sum the counts i.e. COUNT(Filter11) or $1.
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by ($2 >= 1950 and $2 <= 1960);
Group1 = group Filter11 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter11);
Sum1 = foreach Count1 generate SUM($1);
DUMP Sum1;

Pig script sum within a bag

Sum up the number of doubles and triples for each birthCity/birthState combination. Output the top 5 birthCity/birthState combinations that produced the players who had the most doubles and triples.
Currently I have this
clean = FOREACH filtered_2 GENERATE id,city,state, dble + tripple AS combined;
dump clean;
My questions is how do I fit the above? it's obvious I have to group by (city,state). how do I get sum within a bag if I do group by
counter = foreach clean {
sum1 = SUM(combined);
generate id,city,state,sum1;
};
I was thinking something like this but, it's not working
Group the relation clean by city,state and then use the SUM to get the total of the grouping for each city,state.
clean = FOREACH filtered_2 GENERATE id,city,state,(dble + tripple) AS combined;
clean_group = GROUP clean BY (city,state);
counter = FOREACH clean_group GENERATE FLATTEN(group) as (city,state),SUM(clean.combined) as sum1;

Count distinct values in a group using pig

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.
Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:
a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;
c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;
dump c;
The output from the alias 'c' is:
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)
I want in the end:
catA, 2014, {(apple, 2), (milk, 2)}
catA, 2015, {(milk, 1)}
catB, 2014, {(apple, 1), (milk, 1)}
I've seen some example of generating value counts, but grouping by category and year is tripping me up.
Input:
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .
The below code will help you
Pig Script:
list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);
list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;
list_grp = GROUP list_each BY (category,my_year);
list_nested_each = FOREACH list_grp
{
list_inner_each = FOREACH list_each GENERATE food;
list_inner_dist = DISTINCT list_inner_each;
GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;
};
dump list_nested_each;
Output:
(catA,2014,2)
(catA,2015,1)
(catB,2014,2)
Appending to the code in the question:
d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;
will produce:
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

Apache pig about sort top n

Recently i try to use pig to sort some data, and following is my script to data order by count (for example i want to find top 3) :
in = load 'data.txt';
split = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group split by tmp;
result = foreach C generate group, COUNT(split) as cnt;
des = ORDER result BY cnt DESC;
fin = LIMIT des 3;
And then output just like:
A,10
B,9
C,8
But if we have another data which count is also 8, it can't be output. In detail, when i type DUMP des, contents like following:
A,10
B,9
C,8
D,8
E,8
F,7
.
.
If i want to output top 3, it also need to include D,8 E,8 in the result, but LIMIT in pig script language can't do it. Does someone have experience deal with the problem by using pig language, or must write an UDF function to handle it?
Limit will not work in your case, you have to use RANK and FILTER operator.
data.txt
A,A,A,A,A,A,A,A,A,A
B,B,B,B,B,B,B,B,B
C,C,C,C,C,C,C,C
D,D,D,D,D,D,D,D
E,E,E,E,E,E,E,E
F,F,F,F,F,F,F
PigScript:
in = load 'data.txt';
sp = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group sp by tmp;
result = foreach C generate group, COUNT(sp) as cnt;
myrank = RANK result BY cnt DESC DENSE;
top3 = FILTER myrank BY rank_result<=3;
finalOutput = FOREACH top3 GENERATE group,cnt;
DUMP finalOutput;
Output:
(A,10)
(B,9)
(C,8)
(D,8)
(E,8)