Pig Aggregrate functions - apache-pig

My Input file is below
a,t1,1000,100
a,t1,2000,200
b,t2,1000,200
b,t2,5000,100
How to find count of distinct $0 in the above file.
myinput = LOAD 'file' AS(a1:chararray,a2:chararray,amt:int,rate:int);
After the above script what needs to done.
Also Can I use that distinct count for dividing some other is a different relation

First of all, the way you read the data is incorrect. If you try to dump "myinput", youll see that the whole row is read in the first field (a1), while the others are empty.
The reason is that you don't specify a LOAD function, and a default function is the PigStorage() built-in function which expects tab-delimited file (so it ignores your commas!).You need to explicitly specify a load function (e.g. PigStorage()) via the using clause and pass it arguments:
myInput = LOAD file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
Moving on, to find the DISTINCT $0 first you have to extract field $0 in a separate relation. The reason is that the DISTINCT statement works on entire records, rather than on separate fields.
myField = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT myField;
Now the result of distinctA1 is {(a), (b)}. By using now group all, you will group together all of your records together, and then what is left is to COUNT them:
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
And now you're happy. :)
The complete code:
myInput = LOAD 'file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
a1 = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT a1;
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);

You can do something like this :
myInput = LOAD 'file.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray,amt:int,rate:int);
Data = GROUP myInput BY $0;
Data = FOREACH Data GENERATE $0;
Data = GROUP Data ALL;
Data = FOREACH Data GENERATE $0,COUNT($1);
NB: By Grouping on $0 you are doing the same thing as a distinct and you get better performance ;)

Related

How to filter out first line from an variable in pig

I imported a cvs file to an variable like below:
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
below is the output of the first 3 lines:
tmp = limit basketball_players 3;
dump tmp
("playerID","year","stint","tmID","lgID","GP","GS","minutes","points","oRebounds","dRebounds","rebounds","assists","steals","blocks","turnovers","PF","fgAttempted","fgMade","ftAttempted","ftMade","threeAttempted","threeMade","PostGP","PostGS","PostMinutes","PostPoints","PostoRebounds","PostdRebounds","PostRebounds","PostAssists","PostSteals","PostBlocks","PostTurnovers","PostPF","PostfgAttempted","PostfgMade","PostftAttempted","PostftMade","PostthreeAttempted","PostthreeMade","note")
("abramjo01","1946","1","PIT","NBA","47","0","0","527","0","0","0","35","0","0","0","161","834","202","178","123","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)
("aubucch01","1946","1","DTF","NBA","30","0","0","65","0","0","0","20","0","0","0","46","91","23","35","19","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)
you can see that the first line is the header of the table. I use below command to filter out the first line but it doesn't work.
grunt> players_raw = filter basketball_players by $1 > 0;
2017-05-06 11:03:36,389 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 6 time(s).
when I dump the value of players_raw it returns empty. How can I filter the first row out from an variable?
Use RANK to generate a new column that will add row numbers to the dataset.Use that column to filter the first row.
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
ranked = rank basketball_players;
basketball_players_without_header = Filter ranked by (rank_basketball_players > 1);
DUMP basketball_players_without_header;
Another way to do this
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
basketball_players_without_header = Filter basketball_players by ($0 matches '.*playerID.*');
DUMP basketball_players_without_header;

Basic statistics with Apache Pig

I am trying to characterize fractions of rows having certain properties using Apache Pig.
For example, if the data looks like:
a,15
a,16
a,17
b,3
b,16
I would like to get:
a,0.6
b,0.4
I am trying to do the following:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
which gives me total = (5), but then when I attempt to use this 'total':
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
I get an error.
Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?
One more way to do the same:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
You will have to project and cast it to double:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
For some reason the following modification of what #inquisitive-mind suggested works:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

I have a seemingly simple Pig generate and then filter issue

I am trying to run a simple Pig script on a simple csv file and I can not get FILTER to do what I want. I have a test.csv file that looks like this:
john,12,44,,0
bob,14,56,5,7
dave,13,40,5,5
jill,8,,,6
Here is my script that does not work:
people = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filtered = FILTER data BY first == 13;
DUMP filtered;
When I dump data, everything looks good. I get the name and the first and last integer as expected. When I describe the data, everything looks good:
data: {name: bytearray,first: int,second: int}
When I try and filter out data by the first value being 13, I get nothing. DUMP filtered simply returns nothing. Oddly enough, if I change it to first > 13, then all "rows" will print out.
However, this script works:
peopletwo = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',') AS (f1:chararray,f2:int,f3:int,f4:int,f5:int);
datatwo = FOREACH peopletwo GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filteredtwo = FILTER datatwo BY first == 13;
DUMP filteredtwo;
What is the difference between filteredtwo and filtered (or data and datatwo for that matter)? I want to know why the new relation obtained using GENERATE (i.e. data) won't filter in the first script as one would expect.
Specify the datatype in the load itself.See below
people = LOAD 'test5.csv' USING PigStorage(',') as (f1:chararray,f2:int,f3:int,f4:int,f5:int);
filtered = FILTER people BY f2 == 13;
DUMP filtered;
Output
Changing the filter to use > gives
filtered = FILTER people BY f2 > 13;
Output
EDIT
When converting from bytearray you will have to explicitly cast the value of the fields in the FOREACH.This works.
people = LOAD 'test5.csv' USING PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray,(int)$1 AS f1,(int)$4 AS f2;
filtered = FILTER data BY f1 == 13;
DUMP filtered;

Pig Latin split columns to rows

Is there any solution in Pig latin to transform columns to rows to get the below?
Input:
id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6
required output:
id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6
thanks
I'm willing to bet this is not the best way to do this however ...
data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray,
col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;
(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)
store final into '~/some_dir' using PigStorage('|');
EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...
# create file called udf.py
#outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
c1 = column1.split(',')
c2 = column2.split(',')
innerBag = zip(c1, c2)
return innerBag
Then in Pig
$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')