Get first and last tuple from a Bag using Apache Pig - apache-pig

I am new to Pig Latin, I am trying below example with Pig BUILT IN functions.
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
B = GROUP A BY name;
DUMP B;
(John,{(John,sm,3.8),(John,sp,4.0),(John,wt,3.7),(John,fl ,3.9)})
(Mary,{(Mary,sm,4.0),(Mary,sp,4.0),(Mary,wt,3.9),(Mary,fl,3.8)})
I need to retrieve 1st element => (John,sm,3.8) and last element => (John,fl ,3.9) from the bag.
Need help to resolve with out using UDF.

Ok.. You can use this solution.. But it is little lengthy.
names = LOAD '/user/user/inputfiles/names.txt' USING PigStorage(',') AS(name:chararray,term:chararray,gpa:float);
names_rank = RANK names;
names_each = FOREACH names_rank GENERATE $0 as row_id,name,term,gpa;
names_grp = GROUP names_each BY name;
names_first_each = FOREACH names_grp
{
order_asc = ORDER names_each BY row_id ASC;
first_rec = LIMIT order_asc 1;
GENERATE flatten(first_rec) as(row_id,name,term,gpa);
};
names_last_each = FOREACH names_grp
{
order_desc = ORDER names_each BY row_id DESC;
last_rec = LIMIT order_desc 1;
GENERATE flatten(last_rec) as(row_id,name,term,gpa);
};
names_unioned = UNION names_first_each,names_last_each;
names_extract = FOREACH names_unioned GENERATE name,term,gpa;
names_ordered = ORDER names_extract BY name;
dump names_ordered;
Output :-
(John,fl,3.9)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,sm,4.0)

Related

How to join and filter 2 relations by columns in Pig?

I'm a novice to Pig script, and trying to modify some existing pig script to extract some data from log files.
E.g. I have 2 log files, one with the schema as:
message Class {
message Student {
optional int32 uid = 1;
optional string name = 2;
}
optional int32 cid = 1;
repeated Student students = 2;
}
After loading, I think a bag (say, bag1) is created (correct me if I'm wrong):
bag1:
{
(uid1, {(cid11, name11), (cid12, name12), (cid13, name13), ...}),
(uid2, {(cid21, name21), (cid22, name22), (cid23, name23), ...}),
...
}
And another log file is simple, the resulting bag (bag2) is like this.
bag2:
{
(name11),
(name13),
(name22),
...
}
What I want is, get all the rows from bag1 if any name in bag2 is contained inside the row, like:
result bag:
{
(uid1, (name11, name13)),
(uid2, (name22)),
}
I think I'll need to do some join/filter on these 2 bags, but don't know how.
I tried a script snippet like below, but it's even not a valid script.
res = FOREACH bag1 {
names = FOREACH students GENERATE name;
xnames = JOIN names by name, bag2 by name;
GENERATE cid, xnames;
};
FILTER res BY not IsEmpty(xnames);
So could anyone pls. give me some help on the script?
You won't be able to use JOIN inside a nested FOREACH, you can try flattening your tuple and then join it with the second table:
bag1_flat = FOREACH bag1 GENERATE $0 AS uid, FLATTEN($1);
bag1_flat = FOREACH bag1_flat GENERATE uid, $2 AS name;
An inner join, will filter the lines :
bag12 = JOIN bag1_flat by name, bag2 by $0;
bag12 = FOREACH bag12 GENERATE bag1_flat::uid AS uid, bag1_flat::name AS name;
Finally, group by uid you won't get tuples though as they cannot be different sizes, you'll get bags:
bag12_group = GROUP bag12 BY uid;
res = FOREACH bag12_group GENERATE group AS uid, bag12.name AS names;

Identifying columns through PiG

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A
Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;
I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

PIG: FLATTEN error

I have a relation X with structure X: {group: chararray,inboundCount: {(name: chararray,inb: long)},outboundCount: {(name: chararray,out: long)}}as follows:
(IAD,{},{(IAD,25)})
(LAX,{},{(LAX,2)})
(ORD,{(ORD,27)},{})
(PDX,{},{(PDX,3)})
(SFO,{(SFO,3)},{})
I want an output with the following structure final: {airport: chararray,inbound: long,outbound: long}with out put:
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
I've tried the following code and it gives the output structure that I want. But nothing get printed. Is it because of the null value bags?.
final = foreach X generate group as airport,FLATTEN(inboundCount.inb) as inbound,FLATTEN(outboundCount.out) as outbound;
Please help me.
EDIT
I got this relation x by executing the following commands.
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
Sample input record:
2008,1,31,4,1757,1155,2400,1758,UA,114,N845UA,243,243,217,362,362,LAX,ORD,1745,11,15,0,,0,0,0,362,0,0
You are almost there.Pls try this .just apply SUM instead of flatten
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
final_data = FOREACH X GENERATE group as airport, SUM(inboundCount.inb) as inb, SUM(outboundCount.out) as out;
dump final_data;
The dump of final_data will give you the expected result.
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
If you want then you can still replace the NULL count into 0
final_null_check = FOREACH final_data GENERATE airport,(inb is null ? 0 :inb) as inb_cnt, (out is null ? 0 : out) as out_cnt;
After NULL Check if you dump final_null_check relation the you will get output like below
(IAD,0,25)
(LAX,0,2)
(ORD,27,0)
(PDX,0,3)
(SFO,3,0)

Regex to extract first part of string in Apache Pig

I need to extract post code district from the input data below
AB55 4
DD7 6LL
DD5 2HI
My Code
A = load 'data' as postcode:chararray;
B = foreach A {
code_district = REGEX_EXTRACT(postcode,'<SOME EXP>',1);
generate code_district;
};
dump B;
Output should look like
AB55
DD7
DD5
what should be the regular expression to extract the first part of the string?
Can you try the below Regex?
Option1:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'(\\w+).*',1);
DUMP code_district;
Option2:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'([a-zA-Z0-9]+).*',1);
DUMP code_district;
Output:
(AB55)
(DD7)
(DD5)

Piggybank running total: Sum Over()

I am using the following pig script to calculate a running total (pig local mode)
Register /home/ec2-user/pig*/bin/piggybank-0.12.0.jar ;
define Sum org.apache.pig.piggybank.evaluation.int.Sum();
define Over org.apache.pig.piggybank.evaluation.Over();
define Stitch org.apache.pig.piggybank.evaluation.Stitch();
A = load '/home/ec2-user/staff_data.csv' using PigStorage(',') as (id:int, name:chararray, salary:int, department:chararray);
B = group A by department;
C = foreach B {
C1 = order A by salary;
generate flatten(Stitch(C1, Over(C1.department, 'Sum(C1.salary)')));
};
However, I am getting the following error
Unknown aggregate Sum(C1.salary)
Anyone any ideas?
Edit:
Figured the answer by myself. Here it is:
Register /home/ec2-user/pig*/bin/piggybank-0.12.0.jar ;
define Over org.apache.pig.piggybank.evaluation.Over();
define Stitch org.apache.pig.piggybank.evaluation.Stitch();
A = load '/home/ec2-user/staff_data.csv' using PigStorage(',') as (id:int, name:chararray, salary:int, department:chararray);
B = group A by department;
C = foreach B {
C1 = order A by salary;
generate flatten(Stitch(C1, Over(C1.salary, 'sum(int)')));
};