Sending relation to UDF functions - apache-pig

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.

This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS

You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Related

Deriving fields data from grouped data in pig

I am new to PIG. Could someone please help me. Below is the code
a = load 'stage.temp' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = limit a 10;
c = group b by $0;
dump c;
(74409607,{(74409607,a,2),(74409607,b,1)})
(74409607,{(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})
How could we generate this from above operator c?
(74409607,{(2,a),(1,b),(4,c),(5,d)})
(74409735,{(159,NA),(158,)})
Swap the second and third columns while enumerating each group.
d = foreach c generate group,a.$2,a.$1;
Note: When you group by $0 and dump c you should be getting
(74409607,{(74409607,a,2),(74409607,b,1),(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})
Are you sure that the output you posted is what you obtained from that dump statement ?
From what I see this should be the output.
(74409607,{(74409607,d,5),(74409607,c,4),(74409607,b,1),(74409607,a,2)})
(74409735,{(74409735,,158),(74409735,NA,159)})
Coming to your question, group by operator in pig will give you the output as a bag having the grouped column (in Bold) and all the rows(italicized) that are having the same value .
You can how ever fetch the required fields from the bag using the relation c .

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Count the grouped records in pig query

Below is my test data.
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
I want to find something like below:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
My Code:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
Now the out put looks like below:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
How should I calculate count the grouped records.
The COUNT function will give you the number of elements in a bag, which is exactly what you want. After grouping by user and result, you end up with a bag with the number of times each combination appeared.
Therefore, you only have to add one line:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
The FLATTEN(group) is because after grouping, a tuple containing the elements you grouped by is generated, and by the looks of what you want as output you don't want it inside a tuple, as the output would be like ((Jack,wrong),4).

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.