Using the COV function in Pig - apache-pig

For some reason, I am not able to get a grasp of the proper syntax for this function.
I have a file called testing.txt:
1
2
3
4
5
6
7
8
I have a Pig script:
testing = load '/testing.txt' using PigStorage(',') as (var1:double);
t = foreach testing generate var1, var1 as var2;
grp = group t all;
result = foreach grp generate AVG(t.var1) as average, COV(t.var1,t.var2) as variance;
dump result;
This should give me the mean and variance.
I tried this as well:
testing = load '/testing.txt' using PigStorage(',') as (var1:double);
grp = group t all;
result = foreach grp generate AVG(testing.var1) as average, COV(testing.var1,testing.var1) as variance;
dump result;
Both these scripts give me the same error:
ERROR 2078: Caught error from UDF: org.apache.pig.builtin.COV$Intermed [Caught exception in COV.Intermed]
I looked in the Java code and couldn't find anything out of the ordinary.
I was wondering how to use function COV in Pig.

Related

Pig script to calculate total, percentage along with group by

Relatively new to Pig scripts. I have below script to derive the error details grouped by Error Code, Name and their respective count.
A = LOAD 'traffic_error_details.txt' USING
PigStorage(',') as (id:int, error_code:chararray,error_name:chararray, error_status:int);
B = FOREACH A GENERATE A.error_code as errorCode,A.error_name as
errorName,A.error_status as errorStatus;
C = GROUP B by ($0,$1,$2);
F = FOREACH C GENERATE group, COUNT(B) as count;
Dump F;
Above would give results as below :
INVALID_PARAM,REQUEST_ERROR,10
INTERNAL_ERROR,SERVER_ERROR,15
NOT_ALLOWED,ACCESS_ERROR,4
UNKNOWN_ERR,UNKNOWN_ERROR,10
NIL,NIL,11
I would want to display percentage of errors as well. So as below :
INVALID_PARAM,REQUEST_ERROR,10,20%
INTERNAL_ERROR,SERVER_ERROR,15,30%
NOT_ALLOWED,ACCESS_ERROR,4,9%
UNKNOWN_ERR,UNKNOWN_ERROR,10,20%
NIL,NIL,11,21%
Here total number of requests considered is 50. Out of which 21% are successful. Remaining are splitup of Error %.
So how to calculate the total as well in the same script and in the same tuple ? so that % could be calculated as (count/total)*100.
Total refers to the count of all records error_details.txt.
After you've gotten counts for each error code, you would need to do a GROUP ALL to find the total number of errors and add that field to every row. Then you can divide the error code counts by the total count to find percent. Make sure you convert the counts variables from type long to type double to avoid any integer division problems.
This is the code:
A = LOAD 'traffic_error_details.txt' USING PigStorage(',') as
(id:int, errorCode:chararray, errorName:chararray, errorStatus:int);
B = FOREACH A GENERATE errorCode, errorName, errorStatus;
C = GROUP B BY (errorCode, errorName, errorStatus);
D = FOREACH C GENERATE
FLATTEN(group) AS (errorCode, errorName, errorStatus),
COUNT(B) AS num;
E = GROUP D ALL;
F = FOREACH E GENERATE
FLATTEN(D) AS (errorCode, errorName, errorStatus, num),
SUM(D.num) AS num_total;
G = FOREACH F GENERATE
errorCode,
errorName,
errorStatus,
num,
(double)num/(double)num_total AS percent;
You'll notice I modified your code slightly. I grouped by (errorCode, errorName, errorStatus) instead of ($0,$1,$2). It's safer to refer to the field names themselves instead of their positions in case you modify your code in the future and the positions aren't the same.

Basic statistics with Apache Pig

I am trying to characterize fractions of rows having certain properties using Apache Pig.
For example, if the data looks like:
a,15
a,16
a,17
b,3
b,16
I would like to get:
a,0.6
b,0.4
I am trying to do the following:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
which gives me total = (5), but then when I attempt to use this 'total':
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
I get an error.
Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?
One more way to do the same:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
You will have to project and cast it to double:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
For some reason the following modification of what #inquisitive-mind suggested works:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Pig Latin split columns to rows

Is there any solution in Pig latin to transform columns to rows to get the below?
Input:
id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6
required output:
id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6
thanks
I'm willing to bet this is not the best way to do this however ...
data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray,
col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;
(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)
store final into '~/some_dir' using PigStorage('|');
EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...
# create file called udf.py
#outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
c1 = column1.split(',')
c2 = column2.split(',')
innerBag = zip(c1, c2)
return innerBag
Then in Pig
$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')

Pig Script - Min, Avg, Max

Let us say I have these in a file ...
1
2
3
Using a Pig Script, how can I get this (number, minimum, mean, maximum in each line) ?
1,1,2,3
2,1,2,3
3,1,2,3
Please let me know the Pig Script. I am able to get the MIN, AVG, MAX using Pig built in functions, but am not able to get them all in each line.
Thanks
Naga
Use the TOBAG built-in UDF to get your fields into a bag, and then you can use the MIN, AVG, and MAX functions on that bag. You should have no trouble using all three summary functions on a single record.
Here is my simple solution for the problem.
I had the following numbers as input,
temp2.txt
1
2
3
4
5
.
.
16
17
18
19
20
I followed these steps,
1]loaded the data from the file
2]Then grouped all the data
3]Found Average,Minimum,Maximum from the grouped data
4]Then foreach value in loaded data generated data and the minimum , maximum and average values.
The code is as follows,
grunt> data = load '/home/temp2.txt' as (val);
grunt> g = group data all;
grunt> avg = foreach g generate AVG(data.val) as a;
grunt> min = foreach g generate MIN(data.val) as m;
grunt> max = foreach g generate MAX(data.val) as x;
grunt> values = foreach data generate val,min.m,max.x,avg.a;
grunt> dump values;
The following is the output,
Output
(1,1.0,20.0,10.5)
(2,1.0,20.0,10.5)
(3,1.0,20.0,10.5)
(4,1.0,20.0,10.5)
(5,1.0,20.0,10.5)
(6,1.0,20.0,10.5)
(7,1.0,20.0,10.5)
(8,1.0,20.0,10.5)
(9,1.0,20.0,10.5)
(10,1.0,20.0,10.5)
(11,1.0,20.0,10.5)
(12,1.0,20.0,10.5)
(13,1.0,20.0,10.5)
(14,1.0,20.0,10.5)
(15,1.0,20.0,10.5)
(16,1.0,20.0,10.5)
(17,1.0,20.0,10.5)
(18,1.0,20.0,10.5)
(19,1.0,20.0,10.5)
(20,1.0,20.0,10.5)