Find 5 top popular based on sum in Pig Script - apache-pig

I'm trying to find the top 3 most popular locations with the greatest tripCount.
So I need to see the total of tripCount per location and return the greatest n...
My data is as follow:
LocationID tripCount tripDistance
101 40 4.6
203 29 1.3
56 25 9.3
101 17 4.5
66 5 1.1
13 5 0.5
203 10 1.2
558 8 0.5
56 10 5.5
So the result I'm expecting is:
101 57
203 39
56 35
So far my code is:
B = GROUP UNION_DATA BY DOLocationID;
C = FOREACH B {
DA = ORDER UNION_DATA BY passenger_count DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.LocationID), FLATTEN(DB.dropoff_datetime);
}
What am I missing and what do I need to do to get the expected result?

Below piece of code should get you desired results.
I broke down the statement into simple chunks for better understanding and readability.Also your alias and code provided seems incomplete so i completely re-wrote from scratch.
LocationID,tripCount,tripDistance
cat > trip_data.txt
101,40,4.6
203,29,1.3
56,25,9.3
101,17,4.5
66,5,1.1
13,5,0.5
203,10,1.2
558,8,0.5
56,10,5.5
PIG Code:
A = load '/home/ec2-user/trip_data.txt' using PigStorage(',') as (LocationID,tripCount,tripDistance);
describe A;
B = GROUP A BY LocationID;
describe B;
dump B;
C = FOREACH B GENERATE group, SUM(A.tripCount);
describe C;
dump C;
D = ORDER C BY $1 DESC;
describe D;
dump D;
RESULT = LIMIT D 3;
describe RESULT;
dump RESULT;

Related

How to: For each unique id, for each unique version, grab the best score and organize it into a table

Just wanted to preface this by saying while I do have a basic understanding, I am still fairly new to using Bigquery tables and sql statements in general.
I am trying to make a new view out of a query that grabs all of the best test scores for each version by each employee:
select emp_id,version,max(score) as score from `project.dataset.table` where type = 'assessment_test' group by version,emp_id order by emp_id
I'd like to take the results of that query, and make a new table comprised of employee id's with a column for each versions best score for that rows emp_id. I know that I can manually make a table for each version by including a "where version = a", "where version = b", etc.... and then joining all of the tables at the end but that doesn't seem like the most elegant solution plus there is about 20 different versions in total.
Is there a way to programmatically create a column for each unique version or at the very least use my initial query as maybe a subquery and just reference it, something like this:
with a as (
select id,version,max(score) as score
from `project.dataset.table`
where type = 'assessment_test' and version is not null and score is not null and id is not null
group by version,id
order by id),
version_a as (select score from a where version = 'version_a')
version_b as (select score from a where version = 'version_b')
version_c as (select score from a where version = 'version_c')
select
a.id as id,
version_a.score as version_a,
version_b.score as version_b,
version_c.score as version_c
from
a,
version_a,
version_b,
version_c
Example Picture: left table is example data, right table is expected output
Example Data:
id
version
score
1
a
88
1
b
93
1
c
92
2
a
89
2
b
99
2
c
78
3
a
95
3
b
83
3
c
89
4
a
90
4
b
90
4
c
86
5
a
82
5
b
78
5
c
98
1
a
79
1
b
97
1
c
77
2
a
100
2
b
96
2
c
85
3
a
83
3
b
87
3
c
96
4
a
84
4
b
80
4
c
77
5
a
95
5
b
77
Expected Output:
id
a score
b score
c score
1
88
97
92
2
100
99
85
3
95
87
96
4
90
90
86
5
95
78
98
Thanks in advance and feel free to ask any clarifying questions
Use below approach
select * from your_table
pivot (max(score) score for version in ('a', 'b', 'c'))
if applied to sample data in your question - output is
In case if versions is not known in advance - use below
execute immediate (select '''
select * from your_table
pivot (max(score) score for version in (''' || string_agg(distinct "'" || version || "'") || "))"
from your_table
)

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

Pig Java UDF: Generating a bag from a tuple

I am hoping someone can help me create a java UDF that will take this input spread across three text files:
Montreal, 5 3 10 9 8
Toronto, 7 2 2 3 4 4
Edmonton, 3 3 1 1 7
Montreal, 2 2 9
and return the following output bags:
{(Montreal,5),(Montreal,3),(Montreal,10),(Montreal,9),(Montreal,8),(Montreal,2),(Montreal,2),(Montreal,9)}
{(Toronto,7),(Toronto,2),(Toronto,2),(Toronto,3),(Toronto,4),(Toronto,4)}
I am fairly new to java and any help you can provide is greatly appreciated. Thank you.
If you're using pig 0.14 or after that supports STRSPLITTOBAG, then
A = load 'test.input' using PigStorage(',') as (place:chararray, numbers:chararray);
B = FOREACH A GENERATE place, FLATTEN(STRSPLITTOBAG(numbers)) as number;
C = FOREACH B GENERATE place, (chararray) number;
D = GROUP C by place;
E = FOREACH D generate C; -- dropping group field
dump E;
Output
({(Toronto,2),(Toronto,2),(Toronto,7),(Toronto,4),(Toronto,4),(Toronto,3)})
({(Edmonton,7),(Edmonton,1),(Edmonton,1),(Edmonton,3),(Edmonton,3)})
({(Montreal,9),(Montreal,2),(Montreal,2),(Montreal,8),(Montreal,9),(Montreal,10),(Montreal,3),(Montreal,5)})

How do I add a key to a row based on its "group"?

I have a data set like this:
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
I want to write a PROC SQL statement or DATA step that will yield this:
a 10 1
a 13 1
a 14 1
b 15 2
b 44 2
c 64 3
c 32 3
d 12 4
How do?
DATA TEST;
INPUT id $ value ;
DATALINES;
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
;
RUN;
Sort your data if needed:
proc sort data=test;
by id;
run;
Then:
data want;
set test;
retain key;
by id;
if _n_ = 1 then key = 0;
if first.id then key = key + 1;
run;
The retain statement will retain the value of key through the iterations.
Then, whenever a new id appears, we sum 1 to key.
Alternatively as stated by Keith, you could use this simplified data step to do the job:
data want;
set test;
by id;
if first.id then key + 1;
run;
I'll leave both versions here for reference because I think the first one is easier to understand, and the last one from Keith's comments is a lot cleaner.

How can I find the Maximum Year in a given dataset using PIG?

Suppose I have the following dataset :-
Year Temp
1974 48
1974 48
1991 56
1983 89
1993 91
1938 41
1938 56
1941 93
1983 87
I want my final answer to be 93 ( Pertaining to the year 1941). I am able to find the Maximum temperature for each year(Say 1941-93) but unable to find only the maximum. Any suggestions are appreciated.
Thanks,
You can solve this problem in two ways.
Option1: Using (Group ALL + MAX)
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MAX(A.Temp);
DUMP C;
Output:
(93)
Option2: Using (ORDER and LIMIT)
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = ORDER A BY Temp DESC;
C = LIMIT B 1;
D = FOREACH C GENERATE Temp;
DUMP D;
Output:
(93)