How to check COUNT of filtered elements in PIG - apache-pig

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.

Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Related

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you
Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;
Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Removing duplicate pair in Pig

I am having the below sample
Update:
OBR|1|METABOLIC PANEL
OBX|1|Glucose
OBX|2|BUN
OBX|3|CREATININE
OBR|2|RFLX TO VERIFICATION
OBX|1|EGFR
OBX|2|SODIUM
OBR|3|AMBIGUOUS DEFAULT
OBX|1|POTASSIUM
In this sample consider all the OBR as one Test and every OBR is followed by OBX which is the result of the OBR. Every OBR is followd by id (such as 1,2 and 3 in this case) all the OBX of a particular OBR starts with 1. SO what i was thing is if i found one OBR i'll create one unique id and put it in all the OBX followed by the OBR untill i reach the OBR with id 2 again i'll do the same.
Below is my expected output.
Expected Result :
OBR|1|METABOLIC PANEL|OBR_filename_1
OBX|1|Glucose|OBR_filename_1
OBX|2|BUN|OBR_filename_1
OBX|3|CREATININE|OBR_filename_1
OBR|2|RFLX TO VERIFICATION|OBR_filename_2
OBX|1|EGFR|OBR_filename_2
OBX|2|SODIUM|OBR_filename_2
OBR|3|AMBIGUOUS DEFAULT|OBR_filename_3
OBX|1|POTASSIUM|OBR_filename_3
Use DISTINCT.Assuming you have relation A with duplicate records.The below statement will remove duplicate records and store the unique records in relation B
B = DISTINCT A;
I tried this, it looks like a HL file. you can use Stitch, Over & Lead and come up with something like this. Probably there might be a better solution than this from a performance standpoint. But this should work I guess, please let me know how it goes.
DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;
in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray),
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);
This should produce something like this:
(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)
(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)
(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)
(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)
(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)
(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)
Instead of file name or a constant, it just adds the OBR record to its corresponding OBXs.

Count the grouped records in pig query

Below is my test data.
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
I want to find something like below:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
My Code:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
Now the out put looks like below:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
How should I calculate count the grouped records.
The COUNT function will give you the number of elements in a bag, which is exactly what you want. After grouping by user and result, you end up with a bag with the number of times each combination appeared.
Therefore, you only have to add one line:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
The FLATTEN(group) is because after grouping, a tuple containing the elements you grouped by is generated, and by the looks of what you want as output you don't want it inside a tuple, as the output would be like ((Jack,wrong),4).

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));