Find continuity of elements in Pig - apache-pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.

Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.

Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Related

Pig script to calculate total, percentage along with group by

Relatively new to Pig scripts. I have below script to derive the error details grouped by Error Code, Name and their respective count.
A = LOAD 'traffic_error_details.txt' USING
PigStorage(',') as (id:int, error_code:chararray,error_name:chararray, error_status:int);
B = FOREACH A GENERATE A.error_code as errorCode,A.error_name as
errorName,A.error_status as errorStatus;
C = GROUP B by ($0,$1,$2);
F = FOREACH C GENERATE group, COUNT(B) as count;
Dump F;
Above would give results as below :
INVALID_PARAM,REQUEST_ERROR,10
INTERNAL_ERROR,SERVER_ERROR,15
NOT_ALLOWED,ACCESS_ERROR,4
UNKNOWN_ERR,UNKNOWN_ERROR,10
NIL,NIL,11
I would want to display percentage of errors as well. So as below :
INVALID_PARAM,REQUEST_ERROR,10,20%
INTERNAL_ERROR,SERVER_ERROR,15,30%
NOT_ALLOWED,ACCESS_ERROR,4,9%
UNKNOWN_ERR,UNKNOWN_ERROR,10,20%
NIL,NIL,11,21%
Here total number of requests considered is 50. Out of which 21% are successful. Remaining are splitup of Error %.
So how to calculate the total as well in the same script and in the same tuple ? so that % could be calculated as (count/total)*100.
Total refers to the count of all records error_details.txt.
After you've gotten counts for each error code, you would need to do a GROUP ALL to find the total number of errors and add that field to every row. Then you can divide the error code counts by the total count to find percent. Make sure you convert the counts variables from type long to type double to avoid any integer division problems.
This is the code:
A = LOAD 'traffic_error_details.txt' USING PigStorage(',') as
(id:int, errorCode:chararray, errorName:chararray, errorStatus:int);
B = FOREACH A GENERATE errorCode, errorName, errorStatus;
C = GROUP B BY (errorCode, errorName, errorStatus);
D = FOREACH C GENERATE
FLATTEN(group) AS (errorCode, errorName, errorStatus),
COUNT(B) AS num;
E = GROUP D ALL;
F = FOREACH E GENERATE
FLATTEN(D) AS (errorCode, errorName, errorStatus, num),
SUM(D.num) AS num_total;
G = FOREACH F GENERATE
errorCode,
errorName,
errorStatus,
num,
(double)num/(double)num_total AS percent;
You'll notice I modified your code slightly. I grouped by (errorCode, errorName, errorStatus) instead of ($0,$1,$2). It's safer to refer to the field names themselves instead of their positions in case you modify your code in the future and the positions aren't the same.

Basic statistics with Apache Pig

I am trying to characterize fractions of rows having certain properties using Apache Pig.
For example, if the data looks like:
a,15
a,16
a,17
b,3
b,16
I would like to get:
a,0.6
b,0.4
I am trying to do the following:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
which gives me total = (5), but then when I attempt to use this 'total':
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
I get an error.
Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?
One more way to do the same:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
You will have to project and cast it to double:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
For some reason the following modification of what #inquisitive-mind suggested works:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Calculate percentage in pig

I have the below requirements.
Test data has the following values.
I need to find the percentage of each of the characters out of the total.
I have tried with the below query , but not to success.
Ex:
W
H
U
U
H
W
U
W
W
H
W
U
H
H
H
U
W
W
W
H
data = LOAD 'location of test data';
grp = GROUP data BY data.$0; // considering only 1 field in this csv.
result = FOREACH grp GENERATE group, COUNT(data.$0)/SUM(data.$0);
Since the fields are chararrays, I am not able to do the sum of the fields.
Is there an alternate to find one?
If I use a GROUP ALL, followed by COUNT(data.$0), I get the total number of entries.
If I use a GROUP of the field, followed by COUNT(data.$0), I get the individual count.
Here i need the percentage of this individual count by the sum.
Thanks in advance.
Here i need the percentage of this individual count by the sum.
To do this, you would need to run two Pig Operations I believe -
1) First as you said get the individual counts in one relation
W 8
H 7
U 5
2) Second, you count all the elements as you mentioned earlier in one relation
total 20
3) You then need to CROSS the relations obtained in first and two (CROSS) so that you have a new relation like this
W 8 20
H 7 20
U 5 20
4) Post this, you can calculate the percentage that you wanted.
Update
Below is the Pig script that I came up with.
A = LOAD 'data.txt' using PigStorage('\n');
--DUMP A;
B = GROUP A by $0;
C = FOREACH B GENERATE group, COUNT(A.$0);
--DUMP C;
D = GROUP A ALL;
E = FOREACH D GENERATE group,COUNT(A.$0);
DUMP E;
DESCRIBE C;
DESCRIBE E;
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,($1*100/$3);
DESCRIBE G;
DUMP G;
you have to do that manually,
something like
data = foreach data generate *, ((B=='b1')?1:0) AS dummy_b1;
data = foreach data generate *, mean(dummy_b1) AS percentage;

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Sort tuples in a bag based on multiple fileds

I am trying to sort tuples inside a bag based on three fields in descending order..
Example : Suppose I have the following bag created by grouping:
{(s,3,my),(w,7,pr),(q,2,je)}
I want to sort the tuples in the above grouped bag based on $0,$1,$2 fields in such a way that first it will sort on $0 of all the tuples. It will pick the tuple with largest $0 value. If $0 are same for all the tuples then it will sort on $1 and so on.
The sorting should be for all the grouped bags through iterating process.
Suppose if we have databag something like:
{(21,25,34),(21,28,64),(21,25,52)}
Then according to the requirement output should be like:
{(21,25,34),(21,25,52),(21,28,64)}
Please let me know if you need any more clarification
Order your tuple in a nested foreach. This will work.
Input:
(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b, c, d;
GENERATE od;
};
DUMP C Result(which resembles your data):
({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})
Output:
({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})
This will work for all the cases.
Generate tuple with highest value:
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b desc , c desc , d desc;
od1 = LIMIT od 1;
GENERATE od1;
};
dump D;
Generate tuple with highest value if all the three fields are different, if all the tuples are same or if field 1 and field2 are same then return all the tuple.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
F = RANK C; //rank used to separate out the value if two tuples are same
R = FOREACH F {
dis = distinct A;
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
R4 = FOREACH R3 {
fil1 = ORDER A by b desc, c desc, d desc;
fil2 = LIMIT fil1 1;
GENERATE rank_C,fil2;
}; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A);
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {
DIS = distinct F1;
GENERATE flatten(DIS);
};
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value,
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;