Piggybank running total: Sum Over() - apache-pig

I am using the following pig script to calculate a running total (pig local mode)
Register /home/ec2-user/pig*/bin/piggybank-0.12.0.jar ;
define Sum org.apache.pig.piggybank.evaluation.int.Sum();
define Over org.apache.pig.piggybank.evaluation.Over();
define Stitch org.apache.pig.piggybank.evaluation.Stitch();
A = load '/home/ec2-user/staff_data.csv' using PigStorage(',') as (id:int, name:chararray, salary:int, department:chararray);
B = group A by department;
C = foreach B {
C1 = order A by salary;
generate flatten(Stitch(C1, Over(C1.department, 'Sum(C1.salary)')));
};
However, I am getting the following error
Unknown aggregate Sum(C1.salary)
Anyone any ideas?
Edit:
Figured the answer by myself. Here it is:
Register /home/ec2-user/pig*/bin/piggybank-0.12.0.jar ;
define Over org.apache.pig.piggybank.evaluation.Over();
define Stitch org.apache.pig.piggybank.evaluation.Stitch();
A = load '/home/ec2-user/staff_data.csv' using PigStorage(',') as (id:int, name:chararray, salary:int, department:chararray);
B = group A by department;
C = foreach B {
C1 = order A by salary;
generate flatten(Stitch(C1, Over(C1.salary, 'sum(int)')));
};

Related

PIG - STORING TEMPORARY VALUES

Data schema : sdesc:chararray,samt:init,syear:chararrary,stype:chararrary
Data:
Wrench 259000 2000 store
Wrench 135000 2000 online
Wrench 175000 2001 online
Wrench 180000 2001 store
Script
ysales =LOAD ‘salesdata.txt’ using PigStorage()as (sdesc:chararray,samt:init,syear:chararrary,stype:chararrary);
basedata = FILTER ysales by (sdesc==’Wrench’) and (syear = ‘2000’ ) and (stype = ‘store);
my result set is : DUMP basedata;
(Wrench,259000,2000,store)
So the question is how do I break up basedata to have (for example) A = ‘Wrench’ B = 259000, C=2000, D = ‘store’
You can use argument numbers to extract values according to columns
a = foreach basedata generate $0;
b = foreach basedata generate $1;
c = foreach basedata generate $2;
d = foreach basedata generate $3;
data = load '/home/satish/wrench' using PigStorage(' ') as (name,total,year,type) ;
//if you want to use you can use filter
reqdata = foreach data generate CONCAT('A','=',name) as A, CONCAT('B','=',total) as B, CONCAT('C','=',year) as C,CONCAT('D','=',type) as D;
dump reqdata;
(A=Wrench,B=259000,C=2000,D=store)
(A=Wrench,B=135000,C=2000,D=online)
(A=Wrench,B=175000,C=2001,D=online)
(A=Wrench,B=180000,C=2001,D=store)
fdata = foreach reqdata generate A,B;
dump fdata
(A=Wrench,B=259000)
(A=Wrench,B=135000)
(A=Wrench,B=175000)
(A=Wrench,B=180000)
\if you want to remove tuples use FLATTEN

Identifying columns through PiG

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A
Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;
I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

Get first and last tuple from a Bag using Apache Pig

I am new to Pig Latin, I am trying below example with Pig BUILT IN functions.
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
B = GROUP A BY name;
DUMP B;
(John,{(John,sm,3.8),(John,sp,4.0),(John,wt,3.7),(John,fl ,3.9)})
(Mary,{(Mary,sm,4.0),(Mary,sp,4.0),(Mary,wt,3.9),(Mary,fl,3.8)})
I need to retrieve 1st element => (John,sm,3.8) and last element => (John,fl ,3.9) from the bag.
Need help to resolve with out using UDF.
Ok.. You can use this solution.. But it is little lengthy.
names = LOAD '/user/user/inputfiles/names.txt' USING PigStorage(',') AS(name:chararray,term:chararray,gpa:float);
names_rank = RANK names;
names_each = FOREACH names_rank GENERATE $0 as row_id,name,term,gpa;
names_grp = GROUP names_each BY name;
names_first_each = FOREACH names_grp
{
order_asc = ORDER names_each BY row_id ASC;
first_rec = LIMIT order_asc 1;
GENERATE flatten(first_rec) as(row_id,name,term,gpa);
};
names_last_each = FOREACH names_grp
{
order_desc = ORDER names_each BY row_id DESC;
last_rec = LIMIT order_desc 1;
GENERATE flatten(last_rec) as(row_id,name,term,gpa);
};
names_unioned = UNION names_first_each,names_last_each;
names_extract = FOREACH names_unioned GENERATE name,term,gpa;
names_ordered = ORDER names_extract BY name;
dump names_ordered;
Output :-
(John,fl,3.9)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,sm,4.0)

PIG: FLATTEN error

I have a relation X with structure X: {group: chararray,inboundCount: {(name: chararray,inb: long)},outboundCount: {(name: chararray,out: long)}}as follows:
(IAD,{},{(IAD,25)})
(LAX,{},{(LAX,2)})
(ORD,{(ORD,27)},{})
(PDX,{},{(PDX,3)})
(SFO,{(SFO,3)},{})
I want an output with the following structure final: {airport: chararray,inbound: long,outbound: long}with out put:
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
I've tried the following code and it gives the output structure that I want. But nothing get printed. Is it because of the null value bags?.
final = foreach X generate group as airport,FLATTEN(inboundCount.inb) as inbound,FLATTEN(outboundCount.out) as outbound;
Please help me.
EDIT
I got this relation x by executing the following commands.
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
Sample input record:
2008,1,31,4,1757,1155,2400,1758,UA,114,N845UA,243,243,217,362,362,LAX,ORD,1745,11,15,0,,0,0,0,362,0,0
You are almost there.Pls try this .just apply SUM instead of flatten
A= load '/user/hduser/airline.csv' using PigStorage(',') as (year:int,month:int,dayofmonth:int,dayofweek:int,dep:int,CRS:int,Arr:int,CRSArr:int,UniqueCarrier:chararray,FlightNum:int,TailNum:chararray,ActualElapsedTime:int,CRSElapsed:int,AirTime:int,ArrDelay:int,DepDelay:int,Origin:chararray,Dest:chararray,Distance:int,TaxiIn:int,TaxiOut:int,Cancelled:int,CancelCode:chararray,Diverted:int,CarrierDelay:int,WeatherDelay:int,NASDelay:int,SecurityDelay:int,LateAircraft:int);
B= foreach A generate year,month,UniqueCarrier,FlightNum,TailNum,Origin,Dest;
inbound = group B by Dest;
inboundCount = foreach inbound generate group,COUNT(B.FlightNum) as inb;
outbound = group B by Origin;
outboundCount = foreach outbound generate group,COUNT(B.FlightNum) as out;
X = COGROUP inboundCount BY name, outboundCount BY name;
final_data = FOREACH X GENERATE group as airport, SUM(inboundCount.inb) as inb, SUM(outboundCount.out) as out;
dump final_data;
The dump of final_data will give you the expected result.
(IAD,,25)
(LAX,,2)
(ORD,27,)
(PDX,,3)
(SFO,3,)
If you want then you can still replace the NULL count into 0
final_null_check = FOREACH final_data GENERATE airport,(inb is null ? 0 :inb) as inb_cnt, (out is null ? 0 : out) as out_cnt;
After NULL Check if you dump final_null_check relation the you will get output like below
(IAD,0,25)
(LAX,0,2)
(ORD,27,0)
(PDX,0,3)
(SFO,3,0)

Using Aggregate functions in Pig

My input file is below
a1,1,on,400
a1,2,off,100
a1,3,on,200
I need to add $3 only if $2 is equal to "on".I have written script as below, after that I don't know how to proceed. For adding $3 only I need to apply some filter. for adding $1 there is no filter at all
Can someone help me on finishing this.
myinput = LOAD 'file' USING PigStorage(',') AS(id:chararray,flag:chararray,amt:int)
grouped = GROUP myinput BY id
I need output as below
a1, 6,600
Here is a possible solution,
You could do something like this (not tested) :
myinput = LOAD 'file' USING PigStorage(',');
A = FOREACH myinput GENERATE $0 as id, $1 as first_sum, (($2 == 'on') ? $3 : 0) as second_sum;
grouped = GROUP A BY id;
RESULT = FOREACH grouped GENERATE group as id, SUM($1.first_sum), SUM($1.second_sum);
That should do the trick
Try this
myinput = LOAD '/home/gopalkrishna/PIGPRAC/pig-sum.txt' using PigStorage(',') as (name:chararray,num:int,stat:chararray,amt:int);
A = GROUP myinput BY name;
B = FOREACH A GENERATE group, SUM(myinput.num),SUM(myinput.amt);
STORE B INTO 'SUMOUT';