Pig script sum within a bag - apache-pig

Sum up the number of doubles and triples for each birthCity/birthState combination. Output the top 5 birthCity/birthState combinations that produced the players who had the most doubles and triples.
Currently I have this
clean = FOREACH filtered_2 GENERATE id,city,state, dble + tripple AS combined;
dump clean;
My questions is how do I fit the above? it's obvious I have to group by (city,state). how do I get sum within a bag if I do group by
counter = foreach clean {
sum1 = SUM(combined);
generate id,city,state,sum1;
};
I was thinking something like this but, it's not working

Group the relation clean by city,state and then use the SUM to get the total of the grouping for each city,state.
clean = FOREACH filtered_2 GENERATE id,city,state,(dble + tripple) AS combined;
clean_group = GROUP clean BY (city,state);
counter = FOREACH clean_group GENERATE FLATTEN(group) as (city,state),SUM(clean.combined) as sum1;

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Apache pig about sort top n

Recently i try to use pig to sort some data, and following is my script to data order by count (for example i want to find top 3) :
in = load 'data.txt';
split = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group split by tmp;
result = foreach C generate group, COUNT(split) as cnt;
des = ORDER result BY cnt DESC;
fin = LIMIT des 3;
And then output just like:
A,10
B,9
C,8
But if we have another data which count is also 8, it can't be output. In detail, when i type DUMP des, contents like following:
A,10
B,9
C,8
D,8
E,8
F,7
.
.
If i want to output top 3, it also need to include D,8 E,8 in the result, but LIMIT in pig script language can't do it. Does someone have experience deal with the problem by using pig language, or must write an UDF function to handle it?
Limit will not work in your case, you have to use RANK and FILTER operator.
data.txt
A,A,A,A,A,A,A,A,A,A
B,B,B,B,B,B,B,B,B
C,C,C,C,C,C,C,C
D,D,D,D,D,D,D,D
E,E,E,E,E,E,E,E
F,F,F,F,F,F,F
PigScript:
in = load 'data.txt';
sp = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group sp by tmp;
result = foreach C generate group, COUNT(sp) as cnt;
myrank = RANK result BY cnt DESC DENSE;
top3 = FILTER myrank BY rank_result<=3;
finalOutput = FOREACH top3 GENERATE group,cnt;
DUMP finalOutput;
Output:
(A,10)
(B,9)
(C,8)
(D,8)
(E,8)

Storing results of each column-operation in seperate row in pig

I need to perform some numerical operations (using a UDF) on every column of my table. And for every column I am getting 2 values (mean and standard-dev). But the final result is coming like (mean_1, sd_1, mean_2, sd_2, mean_3, sd_3...), where 1,2... are column indexes. But I need the output for every column in a separate row. Like:
mean_1, sd_1 \\for col1
mean_2, sd_2 \\for col2
...
Here is the pig script I'm using:
data = LOAD 'input_file.csv' USING PigStorage(',') AS (C0,C1,C2);
grouped_data = GROUP data ALL;
res = FOREACH grouped_data GENERATE FLATTEN(data), AVG(data.$1) as mean, COUNT(data.$1) as count;
tmp = FOREACH res {
diff = (C1-mean)*(C1-mean);
GENERATE *,diff as diff;
};
grouped_diff = GROUP tmp all;
sq_tmp = FOREACH grouped_diff GENERATE flatten(tmp), SUM(tmp.diff) as sq_sum;
stat_tmp = FOREACH sq_tmp GENERATE mean as mean, sq_sum/count as variance, SQRT(sq_sum/count) as sd;
stats = LIMIT stat_tmp 1;
Could anybody please guide me on how to achieve this?
Thanks. I got the required output by creating tuples of mean and sd values for respective columns and then storing all such tuples in a bag. Then in the next step I flattened the bag.
tupled_stats = FOREACH raw_stats generate TOTUPLE(mean_0, var_0, sd_0) as T0, TOTUPLE(mean_1, var_1, sd_1) as T1, TOTUPLE(mean_2, var_2, sd_2) as T2;
bagged_stats = FOREACH tupled_stats generate TOBAG(T0, T1, T2) as B;
stats = foreach bagged_stats generate flatten(B);

Pig - Calculating percentage of total for a field

I am trying to calculate % of total for a value for in a field.
For example, for data (name, ct)
(john, 1000)
(Dan, 2000)
(liz, 2000)
I want the output to be (name, % of ct to the total)
(john, .2)
(Dan, .4)
(liz, .4)
data = load 'fakedata.txt' as (name:chararray,sqr:chararray,ct:int);
A = foreach data generate name, ct;
A = FILTER A by ct is not null;
B = group A all;
C = foreach B generate SUM(A.ct) as tot;
D = foreach A generate name, ct/(double)C.tot;
dump D;
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: C in {name: bytearray,ct: int}
I am following exactly how it is given in the http://pig.apache.org/docs/r0.10.0/basic.html
an example code in section - "Casting Relations to Scalars"
If I say Dump C, then the output is correctly generated as 5000. So there is a problem in the D. Any help is greatly appreciated.
The below works for me without any error. This is basically same as what you have. Not sure why you are getting this error. Which version of pig are you using?
data = load 'StackData' as (name:chararray, marks:int);
grp = GROUP data all;
allcount = foreach grp generate SUM(data.marks) as total;
perc = foreach data generate name, marks/(double)allcount.total;
dump perc
In Relation D, you are looping over Relation A again - it knows knowing about C.
I'd suggest calculating the SUM, then doing JOIN so each entry contains the sum. That way you'll be able to calculate the % total for each entry.