Calculate percentage in pig - apache-pig

I have the below requirements.
Test data has the following values.
I need to find the percentage of each of the characters out of the total.
I have tried with the below query , but not to success.
Ex:
W
H
U
U
H
W
U
W
W
H
W
U
H
H
H
U
W
W
W
H
data = LOAD 'location of test data';
grp = GROUP data BY data.$0; // considering only 1 field in this csv.
result = FOREACH grp GENERATE group, COUNT(data.$0)/SUM(data.$0);
Since the fields are chararrays, I am not able to do the sum of the fields.
Is there an alternate to find one?
If I use a GROUP ALL, followed by COUNT(data.$0), I get the total number of entries.
If I use a GROUP of the field, followed by COUNT(data.$0), I get the individual count.
Here i need the percentage of this individual count by the sum.
Thanks in advance.

Here i need the percentage of this individual count by the sum.
To do this, you would need to run two Pig Operations I believe -
1) First as you said get the individual counts in one relation
W 8
H 7
U 5
2) Second, you count all the elements as you mentioned earlier in one relation
total 20
3) You then need to CROSS the relations obtained in first and two (CROSS) so that you have a new relation like this
W 8 20
H 7 20
U 5 20
4) Post this, you can calculate the percentage that you wanted.
Update
Below is the Pig script that I came up with.
A = LOAD 'data.txt' using PigStorage('\n');
--DUMP A;
B = GROUP A by $0;
C = FOREACH B GENERATE group, COUNT(A.$0);
--DUMP C;
D = GROUP A ALL;
E = FOREACH D GENERATE group,COUNT(A.$0);
DUMP E;
DESCRIBE C;
DESCRIBE E;
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,($1*100/$3);
DESCRIBE G;
DUMP G;

you have to do that manually,
something like
data = foreach data generate *, ((B=='b1')?1:0) AS dummy_b1;
data = foreach data generate *, mean(dummy_b1) AS percentage;

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Pig script to calculate total, percentage along with group by

Relatively new to Pig scripts. I have below script to derive the error details grouped by Error Code, Name and their respective count.
A = LOAD 'traffic_error_details.txt' USING
PigStorage(',') as (id:int, error_code:chararray,error_name:chararray, error_status:int);
B = FOREACH A GENERATE A.error_code as errorCode,A.error_name as
errorName,A.error_status as errorStatus;
C = GROUP B by ($0,$1,$2);
F = FOREACH C GENERATE group, COUNT(B) as count;
Dump F;
Above would give results as below :
INVALID_PARAM,REQUEST_ERROR,10
INTERNAL_ERROR,SERVER_ERROR,15
NOT_ALLOWED,ACCESS_ERROR,4
UNKNOWN_ERR,UNKNOWN_ERROR,10
NIL,NIL,11
I would want to display percentage of errors as well. So as below :
INVALID_PARAM,REQUEST_ERROR,10,20%
INTERNAL_ERROR,SERVER_ERROR,15,30%
NOT_ALLOWED,ACCESS_ERROR,4,9%
UNKNOWN_ERR,UNKNOWN_ERROR,10,20%
NIL,NIL,11,21%
Here total number of requests considered is 50. Out of which 21% are successful. Remaining are splitup of Error %.
So how to calculate the total as well in the same script and in the same tuple ? so that % could be calculated as (count/total)*100.
Total refers to the count of all records error_details.txt.
After you've gotten counts for each error code, you would need to do a GROUP ALL to find the total number of errors and add that field to every row. Then you can divide the error code counts by the total count to find percent. Make sure you convert the counts variables from type long to type double to avoid any integer division problems.
This is the code:
A = LOAD 'traffic_error_details.txt' USING PigStorage(',') as
(id:int, errorCode:chararray, errorName:chararray, errorStatus:int);
B = FOREACH A GENERATE errorCode, errorName, errorStatus;
C = GROUP B BY (errorCode, errorName, errorStatus);
D = FOREACH C GENERATE
FLATTEN(group) AS (errorCode, errorName, errorStatus),
COUNT(B) AS num;
E = GROUP D ALL;
F = FOREACH E GENERATE
FLATTEN(D) AS (errorCode, errorName, errorStatus, num),
SUM(D.num) AS num_total;
G = FOREACH F GENERATE
errorCode,
errorName,
errorStatus,
num,
(double)num/(double)num_total AS percent;
You'll notice I modified your code slightly. I grouped by (errorCode, errorName, errorStatus) instead of ($0,$1,$2). It's safer to refer to the field names themselves instead of their positions in case you modify your code in the future and the positions aren't the same.

PIG flatten vs group on nested bag

I'm learning PIG and I have a question that I know might be in books, but unfortunately I don't have the time to do the research.
I have two pipelines:
(option 1):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
e = foreach d generate flatten(year) as year, event, mpg;
f = group e by year;
g = foreach f generate group, AVG(e.mpg);
x = limit g 10;
dump x;
I load 2 files, then join then, then I take the last 2 digits of the date to get year, after I used flatten to simplify things before grouping to get average of mpg.
(option 2):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
f = group d by year;
g = foreach f generate group, AVG(d.mpg);
x = limit g 10;
dump x;
Same thing, but I don't use flatten to group and then get average of mpg.
I get the same results but, is there a significant difference? In this case the dataset I used is not big, but I'm curious about how it would be the case if I have a couple of millions of records.
Thanks.

Selecting random tuple from bag

Is it possible to (efficiently) select a random tuple from a bag in pig?
I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection.
One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
ordered_rnds = order rnds by rnd;
one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
generate group as id, one_tuple;
};
dump randoms;
INPUT:
1 a r
1 a t
1 b r
1 b 4
1 e 4
1 h 4
1 k t
2 k k
2 j j
3 a r
3 e l
3 j l
4 a r
4 b t
4 b g
4 h b
4 j d
5 h k
OUTPUT:
(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})
If you run "dump randoms;" multiple times, you should get different results for each run.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.
I needed to do this myself, and found surprisingly that a very simple answer seems to work, to get about 10% of an alias A:
B = filter A by RANDOM() < 0.1

How to optimize a group by statement in PIG latin?

I have a skewed data set and I need to do a group by operation and then do a nested foreach on it. Because of the skewed data, few reducers are taking long time and others are taking no time. I know there exists skewed join but what is there for group by and foreach? Here is my pig code (renamed the variables):
foo_grouped = GROUP foo_grouped by FOO;
FOO_stats = FOREACH foo_grouped
{
a_FOO_total = foo_grouped.ATTR;
a_FOO_total = DISTINCT a_FOO_total;
bar_count = foo_grouped.BAR;
bar_count = DISTINCT bar_count;
a_FOO_type1 = FILTER foo_grouped by COND1=='Y';
a_FOO_type1 = a_FOO_type1.ATTR;
a_FOO_type1 = DISTINCT a_FOO_type1;
a_FOO_type2 = FILTER foo_grouped by COND2=='Y' OR COND3=='HIGH';
a_FOO_type2 = a_FOO_type2.ATTR;
a_FOO_type2 = DISTINCT a_FOO_type2;
generate group as FOO,
COUNT(a_FOO_total) as a_FOO_total, COUNT(a_FOO_type1) as a_FOO_type1, COUNT(a_FOO_type2) as a_FOO_type2, COUNT(bar_count) as bar_count; }
In your example there are a lot of nested DISTINCT operators within FOREACH which are executed in the reducer, it relies on RAM to calculate unique values and this query produces just one Job. In case of just too many unique elements in a group you could get memory related exceptions as well.
Luckily PIG Latin is a dataflow language and you write sort of execution plan. In order to utilize more CPUs you could change your code in such way that forces more MapReduce jobs which could be executed in parallel. For that we should rewrite query without using nested DISTINCT, the trick is to do distinct operations and than group by as if you had just one column and than merge the results. It is very SQL like, but it works. Here it is:
records = LOAD '....' USING PigStorage(',') AS (g, a, b, c, d, fd, s, w);
selected = FOREACH records GENERATE g, a, b, c, d;
grouped_a = FOREACH selected GENERATE g, a;
grouped_a = DISTINCT grouped_a;
grouped_a_count = GROUP grouped_a BY g;
grouped_a_count = FOREACH grouped_a_count GENERATE FLATTEN(group) as g, COUNT(grouped_a) as a_count;
grouped_b = FOREACH selected GENERATE g, b;
grouped_b = DISTINCT grouped_b;
grouped_b_count = GROUP grouped_b BY g;
grouped_b_count = FOREACH grouped_b_count GENERATE FLATTEN(group) as g, COUNT(grouped_b) as b_count;
grouped_c = FOREACH selected GENERATE g, c;
grouped_c = DISTINCT grouped_c;
grouped_c_count = GROUP grouped_c BY g;
grouped_c_count = FOREACH grouped_c_count GENERATE FLATTEN(group) as g, COUNT(grouped_c) as c_count;
grouped_d = FOREACH selected GENERATE g, d;
grouped_d = DISTINCT grouped_d;
grouped_d_count = GROUP grouped_d BY g;
grouped_d_count = FOREACH grouped_d_count GENERATE FLATTEN(group) as g, COUNT(grouped_d) as d_count;
mrg = JOIN grouped_a_count BY g, grouped_b_count BY g, grouped_c_count BY g, grouped_d_count BY g;
out = FOREACH mrg GENERATE grouped_a_count::g, grouped_a_count::a_count, grouped_b_count::b_count, grouped_c_count::c_count, grouped_d_count::d_count;
STORE out into '....' USING PigStorage(',');
After execution I got following summary which shows that distinct operations did not suffer from the skew in data were processed by the first Job:
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201206061712_0244 669 45 75 8 13 376 18 202 grouped_a,grouped_b,grouped_c,grouped_d,records,selected DISTINCT,MULTI_QUERY
job_201206061712_0245 1 1 3 3 3 12 12 12 grouped_c_count GROUP_BY,COMBINER
job_201206061712_0246 1 1 3 3 3 12 12 12 grouped_b_count GROUP_BY,COMBINER
job_201206061712_0247 5 1 48 27 33 30 30 30 grouped_a_count GROUP_BY,COMBINER
job_201206061712_0248 1 1 3 3 3 12 12 12 grouped_d_count GROUP_BY,COMBINER
job_201206061712_0249 4 1 3 3 3 12 12 12 mrg,out HASH_JOIN ...,
Input(s):
Successfully read 52215768 records (44863559501 bytes) from: "...."
Output(s):
Successfully stored 9 records (181 bytes) in: "..."
From Job DAG we can see that groupby operations were executed in parallel:
Job DAG:
job_201206061712_0244 -> job_201206061712_0248,job_201206061712_0246,job_201206061712_0247,job_201206061712_0245,
job_201206061712_0248 -> job_201206061712_0249,
job_201206061712_0246 -> job_201206061712_0249,
job_201206061712_0247 -> job_201206061712_0249,
job_201206061712_0245 -> job_201206061712_0249,
job_201206061712_0249
It works fine on my datasets where one of the group key values (in column g) makes 95% of the data. It also gets rid of memory related exceptions.
I recently ran into an error with this join.. If there any null in the group then the whole relation(s) will be dropped..