Pig: summing column b of rows with the same column a - apache-pig

I'm trying to count the number of tweets with a certain hashtag over a period of time but I'm getting an error when trying to use the built-in SUM function.
Example:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by count;
X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);
Error:
<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

First Load the data then filter for the time interval you want to process. Group the record based on the hashtag. Use count() function to count the number of twitter for the corresponding hashtag.

I am not sure that the code is doing what you think or want it to do but the error you are getting is because you are doing a SUM on the wrong thing. You need to do this
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);
NBLNabilVoto_count is the name of the tuples in the databag

i think you are using the wrong realtion in your SUM, you could SUM NBLNabilVoto_count not data realtion. i have question why you are groupping by COUNT ?
if you want count all your tweet with hashtag NBLNabilVoto.
i think the code must be like :
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by all;
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);

Related

Pig calculating avg of delay fails

I have a file for airplanes data, having airplane dest and delay(delay can be negative or positve number)
A = load ‘flightdelays’ using Pigstorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
C = group b all; -- this is failing for cast error, also get an error failed to read data from input file..
D =foreach c generate b.dest, AVG(b.delay);
When i execute this , i get 0 records read from source file and mapreduce job failed..
Why is it not able to calculate AVG?
Check the extension/path of the file.Is your file comma separated? Also,there are plenty of case issues with your script.
PigStorage - s is small in your load statement.
A = load ‘flightdelays’ using PigStorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
There is no relation called a,b,c.You are loading data to relation A and so on.
1st thing A,a treated differently(in pig relation names are case sensitive) and 2nd thing while calculating Aggregate function on relation and group by on any attribute..
In FOREACH you should specify grouping attribute and aggregate function..
In this scenario you used group by all so you can't use b.dest along with aggregate function..
If you want destination wise AVG() delay then you should group by dest..

How to count and then compute the total average in PIG

Each line in my dataset is a sale and my goal is to compute the average time a client buys during his lifetime.
I have already grouped and counted by clientId like this:
byClientId = GROUP sales BY clientId;
countByClientId = FOREACH byClientId GENERATE group, count($1);
This creates a table with 2 columns: clientId, count of transactions.
Now, I am trying to get the total average of the second column (i.e. the overall average of sales to same client). I am using this code:
groupCount = GROUP countByClientId all;
avg = foreach groupCount generate AVG($1);
But I get this error message:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 18, column 31> Could not infer the matching function for org.apache.pig.builtin.AVG
as multiple or none of them fit. Please use an explicit cast.
How to get the overall average of the second column?
It would have been simpler for us with a sample of input data.. I created my own, to be sure that my solution would work. You only have one mistake : once you grouped all your schema become group:chararray,countByClientId:bag{:tuple(group:chararray,:long)}
So, $1 refers to a bag and this is why you can't compute the mean. If you want to access $1 (which is the second element) inside this bag you have two choices, either $1.$1, or countByClientId.$1. So your last line should be :
avg = foreach groupCount generate AVG(countByClientId.$1);
I hope it's clear.

What's the effective way to count rows in Pig?

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;
Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version.
The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper).
I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns
a scalar value. One interesting and useful property of many aggregate
functions is that they can be computed incrementally in a distributed
fashion. We call these functions algebraic. COUNT is an example of an
algebraic function because we can count the number of elements in a
subset of the data and then sum the counts to produce a final output.
In the Hadoop world, this means that the partial computations can be
done by the map and combiner, and the final result can be computed by
the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the PARALLEL n keyword this set the
number of reducers.
Increase the parallelism of a job by specifying the number of reduce
tasks, n. The default value for n is 1 (one reduce task).

finding the sum of columns for each row in pig

I need to find the sum of columns in a every row.
Consider the data set
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
I am trying to get the output as below
(A,96)
(B,95)
(C,68)
I cannot use built in SUM function for this. Should I write custom UDF or is there any other way to do this
You can define the schema and try the below approach.
input:
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(f1:chararray,f2:int,f3:int,f4:int,f5:int,f6:int);
B = FOREACH A GENERATE f1,SUM(TOBAG(f2..));
DUMP B;
Output:
(A,96)
(B,95)
(C,68)

Pig FILTER returns empty bag that I can't COUNT

I'm trying to count how many values in a data set match a filter condition, but I'm running into issues when the filter matches no entries.
There are a lot of columns in my data structure, but there's only three of use for this example: key - data key for the set (not unique), value - float value as recorded, nominal_value - float representing the nominal value.
Our use case right now is to find the number of values that are 10% or more below the nominal value.
I'm doing something like this:
filtered_data = FILTER data BY value <= (0.9 * nominal_value);
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT(filtered_data.value);
DUMP filtered_count;
In most cases, there are no values that fall outside of the nominal range, so filtered_data is empty (or null. Not sure how to tell which.). This results in filtered_count also being empty/null, which is not desirable.
How can I construct a statement that will return a value of 0 when filtered_data is empty/null? I've tried a couple of options that I've found online:
-- Extra parens in COUNT required to avoid syntax error
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT((filtered_data.value is null ? {} : filtered_data.value));
which results in:
Two inputs of BinCond must have compatible schemas. left hand side: #1259:bag{} right hand side: #1261:bag{#1260:tuple(cf#1038:float)}
And:
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE (filtered_data.value is null ? 0 : COUNT(filtered_data.value));
which results in an empty/null result.
The way you have it set up right now, you will lose information about any keys for which the count of bad values is 0. Instead, I'd recommend preserving all keys, so that you can see positive confirmation that the count was 0, instead of inferring it by absence. To do that, just use an indicator and then SUM that:
data2 =
FOREACH data
GENERATE
key,
((value <= 0.9*nominal_value) ? 1 : 0) AS bad;
bad_count = FOREACH (GROUP data2 BY key) GENERATE group, SUM(data2.bad);