PIG: sum and division, creating an object - sum

I am writing a pig program that loads a file that separates its entires with tabs
ex: name TAB year TAB count TAB...
file = LOAD 'file.csv' USING PigStorage('\t') as (type: chararray, year: chararray,
match_count: float, volume_count: float);
-- Group by type
grouped = GROUP file BY type;
-- Flatten
by_type = FOREACH grouped GENERATE FLATTEN(group) AS (type, year, match_count, volume_count);
group_operat = FOREACH by_type GENERATE
SUM(match_count) AS sum_m,
SUM(volume_count) AS sum_v,
(float)sum_m/sm_v;
DUMP group_operat;
The issue lies in the group operations object I am trying to create. I'm wanting to sum all the match counts, sum all the volume counts and divide the match counts by volume counts.
What am I doing wrong in my arithmetic operations/object creation?
An error I receive is line 7, column 11> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "type:NULL,year:NULL,match_count:NULL,volume_count:NULL", right is "group:chararray"
Thank you.

Try like this, this will return type and sum.
UPDATED the working code
input.txt
A 2001 10 2
A 2002 20 3
B 2003 30 4
B 2004 40 1
PigScript:
file = LOAD 'input.txt' USING PigStorage() AS (type: chararray, year: chararray,
match_count: float, volume_count: float);
grouped = GROUP file BY type;
group_operat = FOREACH grouped {
sum_m = SUM(file.match_count);
sum_v = SUM(file.volume_count);
GENERATE group,(float)(sum_m/sum_v) as sum_mv;
}
DUMP group_operat;
Output:
(A,6.0)
(B,14.0)

try this,
file = LOAD 'file.csv' USING PigStorage('\t') as (type: chararray, year: chararray,
match_count: float, volume_count: float);
grouped = GROUP file BY (type,year);
group_operat = FOREACH grouped GENERATE group,
SUM(file.match_count) AS sum_m,
SUM(file.volume_count) AS sum_v,
(float)(SUM(file.match_count)/SUM(file.volume_count)) as sum_mv;
Above script give result group by type and year, if you want only group by type then remove from grouped
grouped = GROUP file BY type;
group_operat = FOREACH grouped GENERATE group,file.year,
SUM(file.match_count) AS sum_m,
SUM(file.volume_count) AS sum_v,
(float)(SUM(file.match_count)/SUM(file.volume_count)) as sum_mv;

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you
Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;
Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Pig latin search variabale name and min value

I have a pig Latin question. I have a table with the following:
ID:Seller:Price:BID
1:John:20:B1
1:Ben:25:B1
2:John:60:B2
2:Chris:35:B2
3:John:20:B3
I'm able to group the table by ID using the following (assuming A is the LOAD table):
W = GROUP A BY ID;
But what I can't seem to figure out is the command to only return the values for the lowest price for each ID. In this example the final output should be:
1:John:20:B1
2:Chris:35:B2
3:John:20:B3
Cheers,
Shivedog
Generally you'll want to GROUP by the BID, then use MIN. However, since you want the whole tuple associated with the minimum value you'll want to use a UDF to do this.
myudfs.py
#outputSchema('vals: (ID: int, Seller: chararray, Price: chararray, BID: chararray)')
def get_min_tuple(bag):
return min(bag, key=lambda x: x[2])
myscript.pig
register 'myudfs.py' using jython as myudfs ;
-- A: (ID: int, Seller: chararray, Price: chararray, BID: chararray)
B = GROUP A BY BID ;
C = FOREACH B GENERATE group AS BID, FLATTEN(myudfs.get_min_tuple(A)) ;
-- Now you can do the JOIN to get the type of novel on C
Remember change the types (int,chararray,etc.) to the appropriate values.
Note: If multiple items in A have the same minimum price for an ID, then this will only return one of them.
option (1) - get all records with maximum price:
Use the new (Pig 0.11) RANK operator:
A = LOAD ...;
B = RANK A BY Price DESC;
C = FILTER B BY $0=1;
option (2) - get all records with maximum price:
Pig version below 0.11:
a = load ...;
b = group a by all;
c = foreach b generate MAX(a.price) as maxprice;
d = JOIN a BY price, c BY maxprice;
option (3) - use org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField to get one of the tuples with maximum price:
define mMax org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField( '4', 'max' );
a = load ...;
b = group a by all;
c - foreach b generate mMax(a);

Pig - Calculating percentage of total for a field

I am trying to calculate % of total for a value for in a field.
For example, for data (name, ct)
(john, 1000)
(Dan, 2000)
(liz, 2000)
I want the output to be (name, % of ct to the total)
(john, .2)
(Dan, .4)
(liz, .4)
data = load 'fakedata.txt' as (name:chararray,sqr:chararray,ct:int);
A = foreach data generate name, ct;
A = FILTER A by ct is not null;
B = group A all;
C = foreach B generate SUM(A.ct) as tot;
D = foreach A generate name, ct/(double)C.tot;
dump D;
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: C in {name: bytearray,ct: int}
I am following exactly how it is given in the http://pig.apache.org/docs/r0.10.0/basic.html
an example code in section - "Casting Relations to Scalars"
If I say Dump C, then the output is correctly generated as 5000. So there is a problem in the D. Any help is greatly appreciated.
The below works for me without any error. This is basically same as what you have. Not sure why you are getting this error. Which version of pig are you using?
data = load 'StackData' as (name:chararray, marks:int);
grp = GROUP data all;
allcount = foreach grp generate SUM(data.marks) as total;
perc = foreach data generate name, marks/(double)allcount.total;
dump perc
In Relation D, you are looping over Relation A again - it knows knowing about C.
I'd suggest calculating the SUM, then doing JOIN so each entry contains the sum. That way you'll be able to calculate the % total for each entry.

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};