Pig Script error while using SUM() - apache-pig

I get the following error while running this PIG Script....Please Help!!!
Thanks in advance.
"ERROR 1000: Error during parsing. Scalars can be only used with projections"
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by $2 >= 1950;
Filter12 = filter Filter11 by $2 <= 1960;
Group1 = group Filter12 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter12);
Sum1 = foreach Count1 generate SUM(Group1);
DUMP Sum1;

Combine the two filter conditions and in the last step, sum the counts i.e. COUNT(Filter11) or $1.
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by ($2 >= 1950 and $2 <= 1960);
Group1 = group Filter11 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter11);
Sum1 = foreach Count1 generate SUM($1);
DUMP Sum1;

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

How do you get SUM() to work after grouping and have bags?

Can someone explain why this works:
data = FOREACH problem_three GENERATE $0 AS playerID, $1 AS yearID, $8 AS doubles, $9 AS triples, $10 AS HR;
filtered_years = FILTER data BY (yearID=='1980' OR yearID=='1981'OR yearID=='1982'OR yearID=='1983'OR yearID=='1984'OR yearID=='1985'OR
yearID=='1986'OR yearID=='1987'OR yearID=='1988'OR yearID=='1989');
extra_bases_total = FOREACH filtered_years GENERATE playerID, doubles+triples+HR AS yearly_total;
players_grouped = GROUP extra_bases_total BY playerID;
sums = FOREACH players_grouped GENERATE group, SUM(extra_bases_total.yearly_total) AS total;
grouped_all = GROUP sums ALL;
but this doesn't:
Batting = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',');
batting_details = FOREACH Batting GENERATE $0 AS playerID, $11 AS RBI;
batting_detail = FOREACH batting_details GENERATE playerID, RBI;
batting_details_groupeddata = GROUP batting_detail BY playerID;
totals = FOREACH batting_details_groupeddata GENERATE group, SUM(batting_detail.RBI) AS total;
DUMP totals;
What I am trying to do is get a sum of the RBIs for each player in the second part. I've tried filtering out the nulls and a bunch of other re-writing but nothing is working. SUM seems like a simple function but I cannot get it to work after I make a group.

Cacluate the sum/avg of an attribute in Apache pig

How can I calculate the avg or sum of an attribute in Apache pig (vertically not horizontally). Lots of example are available for doing this horizontally but not vertically.
This is my code
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING
PigStorage(',') AS (Year:int, ArrDelay:chararray);
f123 = f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;
-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;
and the output:
(2008,30)
(2009,60)
(2)
grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}
But this throw me an error:
cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;
I need to get 90 as output (30+60)
By the way, what is this schema as the output:
cntsum: {null::ArrDelay: chararray}
using pig Apache Pig version 0.16.0.2.6.5.0-292
You should load ArrDelay as int column
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;
If that is not an option then load ArrDelay into chararray and then cast it before the group all and sum
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;

Apache pig about sort top n

Recently i try to use pig to sort some data, and following is my script to data order by count (for example i want to find top 3) :
in = load 'data.txt';
split = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group split by tmp;
result = foreach C generate group, COUNT(split) as cnt;
des = ORDER result BY cnt DESC;
fin = LIMIT des 3;
And then output just like:
A,10
B,9
C,8
But if we have another data which count is also 8, it can't be output. In detail, when i type DUMP des, contents like following:
A,10
B,9
C,8
D,8
E,8
F,7
.
.
If i want to output top 3, it also need to include D,8 E,8 in the result, but LIMIT in pig script language can't do it. Does someone have experience deal with the problem by using pig language, or must write an UDF function to handle it?
Limit will not work in your case, you have to use RANK and FILTER operator.
data.txt
A,A,A,A,A,A,A,A,A,A
B,B,B,B,B,B,B,B,B
C,C,C,C,C,C,C,C
D,D,D,D,D,D,D,D
E,E,E,E,E,E,E,E
F,F,F,F,F,F,F
PigScript:
in = load 'data.txt';
sp = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group sp by tmp;
result = foreach C generate group, COUNT(sp) as cnt;
myrank = RANK result BY cnt DESC DENSE;
top3 = FILTER myrank BY rank_result<=3;
finalOutput = FOREACH top3 GENERATE group,cnt;
DUMP finalOutput;
Output:
(A,10)
(B,9)
(C,8)
(D,8)
(E,8)

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};