Pig: Summing Fields - apache-pig

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.

Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)

Related

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Count the grouped records in pig query

Below is my test data.
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
I want to find something like below:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
My Code:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
Now the out put looks like below:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
How should I calculate count the grouped records.
The COUNT function will give you the number of elements in a bag, which is exactly what you want. After grouping by user and result, you end up with a bag with the number of times each combination appeared.
Therefore, you only have to add one line:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
The FLATTEN(group) is because after grouping, a tuple containing the elements you grouped by is generated, and by the looks of what you want as output you don't want it inside a tuple, as the output would be like ((Jack,wrong),4).

Update 2 Columns in PIG

I want to update to columns when a particular condition satisfies.
For example:
We will first load data
A = load 'students.txt' as (name:chararray, age:int, gpa:float);
Now,
B = foreach A generate name, (age==18?1:age) as age, gpa;
Here whenever my condition for age is satisfied at the same instant I want to update one more column say is_adult and set it's value to true and this column in created dynamically(As you observe is_adult column is not there in original schema).
Your help will be highly appreciated.
A = LOAD 'students.txt' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE
name,
(age==18?1:age) AS age,
(age>=18?'true':'false') AS adult,
gpa ;
The adult column would be updated with true or false based on the value of age. This is a pretty standard way of doing this. The new schema/alias obtained in the FOREACH loop can have more (or less) number of columns than the original alias.

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));