How can I calculate the avg or sum of an attribute in Apache pig (vertically not horizontally). Lots of example are available for doing this horizontally but not vertically.
This is my code
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING
PigStorage(',') AS (Year:int, ArrDelay:chararray);
f123 = f1;
ff123 = FILTER f123 BY something;
grp = GROUP ff123 ALL;
cnt = FOREACH grp GENERATE COUNT(ff123);-- this counts the number of rows and works fine
DUMP cnt;
-- The below code is the problem
DESCRIBE grp;
cntsum = FOREACH grp GENERATE FLATTEN(ff123.ArrDelay);
DESCRIBE cntsum;
and the output:
(2008,30)
(2009,60)
(2)
grp: {group: chararray,ff123: {(Year: int,ArrDelay: chararray)}}
cntsum: {null::ArrDelay: chararray}
But this throw me an error:
cntsum = FOREACH grp GENERATE SUM((int)FLATTEN(ff123.ArrDelay));
DESCRIBE cntsum;
I need to get 90 as output (30+60)
By the way, what is this schema as the output:
cntsum: {null::ArrDelay: chararray}
using pig Apache Pig version 0.16.0.2.6.5.0-292
You should load ArrDelay as int column
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:int);
ff123 = FILTER f1 BY something;
grp = GROUP ff123 ALL;
total = FOREACH grp GENERATE SUM(ff123.ArrDelay);
DUMP total;
If that is not an option then load ArrDelay into chararray and then cast it before the group all and sum
f1 = LOAD '/user/maria_dev/flightdelays/flight_delays1.csv' USING PigStorage(',') AS (Year:int, ArrDelay:chararray);
ff123 = FILTER f1 BY something;
f2 = FOREACH ff123 GENERATE Year, (int)ArrDelay as ArrDelay;
grp = GROUP f2 ALL;
total = FOREACH grp GENERATE SUM(f2.ArrDelay);
DUMP total;
Related
I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;
I get the following error while running this PIG Script....Please Help!!!
Thanks in advance.
"ERROR 1000: Error during parsing. Scalars can be only used with projections"
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by $2 >= 1950;
Filter12 = filter Filter11 by $2 <= 1960;
Group1 = group Filter12 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter12);
Sum1 = foreach Count1 generate SUM(Group1);
DUMP Sum1;
Combine the two filter conditions and in the last step, sum the counts i.e. COUNT(Filter11) or $1.
MOVIES = LOAD '/MOVIES' using PigStorage(',') as (mid:double, mn:chararray, yr:int, rt:float, dr:int);
Filter11 = filter MOVIES by ($2 >= 1950 and $2 <= 1960);
Group1 = group Filter11 by yr;
Count1 = foreach Group1 generate group, COUNT(Filter11);
Sum1 = foreach Count1 generate SUM($1);
DUMP Sum1;
Recently i try to use pig to sort some data, and following is my script to data order by count (for example i want to find top 3) :
in = load 'data.txt';
split = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group split by tmp;
result = foreach C generate group, COUNT(split) as cnt;
des = ORDER result BY cnt DESC;
fin = LIMIT des 3;
And then output just like:
A,10
B,9
C,8
But if we have another data which count is also 8, it can't be output. In detail, when i type DUMP des, contents like following:
A,10
B,9
C,8
D,8
E,8
F,7
.
.
If i want to output top 3, it also need to include D,8 E,8 in the result, but LIMIT in pig script language can't do it. Does someone have experience deal with the problem by using pig language, or must write an UDF function to handle it?
Limit will not work in your case, you have to use RANK and FILTER operator.
data.txt
A,A,A,A,A,A,A,A,A,A
B,B,B,B,B,B,B,B,B
C,C,C,C,C,C,C,C
D,D,D,D,D,D,D,D
E,E,E,E,E,E,E,E
F,F,F,F,F,F,F
PigScript:
in = load 'data.txt';
sp = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group sp by tmp;
result = foreach C generate group, COUNT(sp) as cnt;
myrank = RANK result BY cnt DESC DENSE;
top3 = FILTER myrank BY rank_result<=3;
finalOutput = FOREACH top3 GENERATE group,cnt;
DUMP finalOutput;
Output:
(A,10)
(B,9)
(C,8)
(D,8)
(E,8)
I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};
I am using PIG to generate groups from tuples as follows:
a1, b1
a1, b2
a1, b3
...
->
a1, [b1, b2, b3]
...
This is easy and working. But my problem is to get the following: From the obtained groups, I would like to generate a set of all tuples in the group's bag:
a1, [b1, b2, b3]
->
b1,b2
b1,b3
b2,b3
This would be easy if I could nest "foreach" and firstly iterate over each group and then over its bag.
I suppose I am misunderstanding the concept and I will appreciate your explanation.
Thanks.
It looks like you need a Cartesian product between the bag and itself. To do this you need to use FLATTEN(bag) twice.
Code:
inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2;
dump result;
Be aware that large bags will produce a lot of rows. To avoid it you could use TOP(...) before FLATTEN:
inpt = load '....group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2;
};
dump result;
For your specific output you could use some filtering before FLATTEN:
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
l = filter values by val == 'b1' or val == 'b2';
generate id, FLATTEN(l) as v1, FLATTEN(values) as v2;
};
result = filter result by v1 != v2;
I hope it helps.
Cheers
Also relevant is this UnorderedPairs function from the DataFu UDF library. It generates pairs of all items in a bag (in your case your grouped bag)
You can use the GROUP ALL pig statement to generate
A = -- Some bag
B = -- Another bag
groupedB = group B ALL;
result = foreach A GENERATE
TOTUPLE(*), groupedB.$1;
-- Will generate
((a1), {(b1, b2, b3)})
((a2), {(b1, b2, b3)})
((a3), {(b1, b2, b3)})
...