Pig sum on data - sum

I have a file like -
(1950,10)
(1951,33)
(1952,15)
(1953,17)
(1954,17)
(1955,14)
(1956,60)
(1957,98)
(1958,73)
(1959,87)
(1960,123)
I want to get the sum of the second field through Pig.
eg out put should be like
(547)
Please help

You can do like this. You have to group all your records..
x = LOAD '/root/stack.txt' USING PigStorage(',') as (year:int,score:int);
y = GROUP x ALL;
z = FOREACH y GENERATE SUM(x.score);
dump z;
Answer:
(547)
Is that solves your problem......

Related

Calculate percentage in pig

I have the below requirements.
Test data has the following values.
I need to find the percentage of each of the characters out of the total.
I have tried with the below query , but not to success.
Ex:
W
H
U
U
H
W
U
W
W
H
W
U
H
H
H
U
W
W
W
H
data = LOAD 'location of test data';
grp = GROUP data BY data.$0; // considering only 1 field in this csv.
result = FOREACH grp GENERATE group, COUNT(data.$0)/SUM(data.$0);
Since the fields are chararrays, I am not able to do the sum of the fields.
Is there an alternate to find one?
If I use a GROUP ALL, followed by COUNT(data.$0), I get the total number of entries.
If I use a GROUP of the field, followed by COUNT(data.$0), I get the individual count.
Here i need the percentage of this individual count by the sum.
Thanks in advance.
Here i need the percentage of this individual count by the sum.
To do this, you would need to run two Pig Operations I believe -
1) First as you said get the individual counts in one relation
W 8
H 7
U 5
2) Second, you count all the elements as you mentioned earlier in one relation
total 20
3) You then need to CROSS the relations obtained in first and two (CROSS) so that you have a new relation like this
W 8 20
H 7 20
U 5 20
4) Post this, you can calculate the percentage that you wanted.
Update
Below is the Pig script that I came up with.
A = LOAD 'data.txt' using PigStorage('\n');
--DUMP A;
B = GROUP A by $0;
C = FOREACH B GENERATE group, COUNT(A.$0);
--DUMP C;
D = GROUP A ALL;
E = FOREACH D GENERATE group,COUNT(A.$0);
DUMP E;
DESCRIBE C;
DESCRIBE E;
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,($1*100/$3);
DESCRIBE G;
DUMP G;
you have to do that manually,
something like
data = foreach data generate *, ((B=='b1')?1:0) AS dummy_b1;
data = foreach data generate *, mean(dummy_b1) AS percentage;

How to loop through tuples in a Bag, Pig

I am new to pig scripting.
I have an input, (A,B,{(XYZ,123,CDE)})
I am looking to loop through the bag inside and print the following records.
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
Can someone please help me out!
Lets say X is your relation and it has (A,B,{(XYZ,123,CDE)}).ToBag converts the expression into bags and FLATTEN unnests the tuples,bag.
Y = FOREACH X GENERATE $0,$1,ToBag(FLATTEN($2));
Solved!!
Let us load below file (Tab separated)
A B {(XYZ,123,CDE)}
input_plus_bag = load '' USING PigStorage() AS (entry1:chararray, entry2:chararray, bag1:bag{(te1:chararray, te2:int, te3:chararray)});
intermed_output = foreach input_plus_bag generate entry1, entry2, FLATTEN(bag1);
Dump intermed_output;
This will give
(A,B,XYZ,123,CDE)
DESCRIBE intermed_output;
intermed_output: {entry1: chararray,entry2: chararray,bag1::te1: chararray,bag1::te2: int,bag1::te3: chararray}
Now perform TOBAG operation
intermed2_output = foreach intermed_output generate entry1, entry2, TOBAG(bag1::te1,bag1::te2,bag1::te3);
DUMP intermed2_output;
This will result in below output:-
(A,B,{(XYZ),(123),(CDE)})
Now final step is FLATTEN the bag
final_output = foreach intermed2_output generate entry1, entry2, FLATTEN($2);
And we have our desired output:-
(A,B,XYZ)
(A,B,123)
(A,B,CDE)

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Merge two lines in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
Output should be:
ABC,DEF,GHI,JKL
MNO,PQR,STU,VWX
Could anyone please help me?
It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach.
input.txt
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
PigScript:
REGISTER /tmp/datafu-1.2.0.jar;
DEFINE BagSplit datafu.pig.bags.BagSplit();
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = GROUP A ALL;
C = FOREACH B GENERATE FLATTEN(BagSplit(2,$1)) AS mybag;
D = FOREACH C GENERATE FLATTEN(STRSPLIT(REPLACE(BagToString(mybag),'_null_null_null_null',''),'_',4));
E = FOREACH D GENERATE $2,$3,$0,$1;
DUMP E;
Output:
(MNO,PQR,STU,VWX)
(ABC,DEF,GHI,JKL)
Note:
Based on the above input format, my assumption will be 1st row last two cols will be null, 2nd row first two cols will be null, similarly for 3rd and 4th row also

how to re-name alias in PIG

I just started learning pig and I have a problem with re-naming an alias. Want I want to do is read a file, filter it, and then join it by itself. What I did is this:
register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar
raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/cse344-test-file' USING TextLoader as (line:chararray);
ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject:chararray,predicate:chararray,object:chararray);
ntriples2 = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject2:chararray,predicate2:chararray,object2:chararray);
X = FILTER ntriples BY (subject matches '.*business.*');
X2 = FILTER ntriples2 BY (subject2 matches '.*business.*');
joined= join X by subject, X2 by subject2;
joined = DISTINCT joined;
store joined into '/user/hadoop/join-results' using PigStorage();
as you can see I read and filter the file twice in order two have two different alias for each column. How can I simply copy the filtered collection and assign it new aliases? This operation was supposed to take 18 mins, but took 1.5 hour.
I found the answer:
X = FILTER ntriples BY (subject matches '.*rdfabout\\.com.*') PARALLEL 50;
y = foreach X generate subject as subject2, predicate as predicate2, object as object2 PARALLEL 50;
This is how you make copy of x and changing the aliases.