What's the effective way to count rows in Pig? - apache-pig

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;

Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)

I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version.
The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper).
I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns
a scalar value. One interesting and useful property of many aggregate
functions is that they can be computed incrementally in a distributed
fashion. We call these functions algebraic. COUNT is an example of an
algebraic function because we can count the number of elements in a
subset of the data and then sum the counts to produce a final output.
In the Hadoop world, this means that the partial computations can be
done by the map and combiner, and the final result can be computed by
the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the PARALLEL n keyword this set the
number of reducers.
Increase the parallelism of a job by specifying the number of reduce
tasks, n. The default value for n is 1 (one reduce task).

Related

Pig calculating avg of delay fails

I have a file for airplanes data, having airplane dest and delay(delay can be negative or positve number)
A = load ‘flightdelays’ using Pigstorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
C = group b all; -- this is failing for cast error, also get an error failed to read data from input file..
D =foreach c generate b.dest, AVG(b.delay);
When i execute this , i get 0 records read from source file and mapreduce job failed..
Why is it not able to calculate AVG?
Check the extension/path of the file.Is your file comma separated? Also,there are plenty of case issues with your script.
PigStorage - s is small in your load statement.
A = load ‘flightdelays’ using PigStorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
There is no relation called a,b,c.You are loading data to relation A and so on.
1st thing A,a treated differently(in pig relation names are case sensitive) and 2nd thing while calculating Aggregate function on relation and group by on any attribute..
In FOREACH you should specify grouping attribute and aggregate function..
In this scenario you used group by all so you can't use b.dest along with aggregate function..
If you want destination wise AVG() delay then you should group by dest..

Apache Pig - Store/Flatten Bag so it can output as CSV

Not a great question title I admit.
Here's my problem, I have the following output from a query, where each row is like:
{(570349476329862),(570349476329862),(570349476329862)} {(66638102521614253348753),(66638102521614253348753),(66638102521614253348753)} 3
The schema of the above is:
{{(ID1:chararray)},{(ID2:chararray)},COUNT:long}
What I'm trying to do is generate output in a CSV format so that it can be easily ingested into a database, e.g. turn the above into:
570349476329862,66638102521614253348753,3
I think I want to flatten the bags but although this 'compiles' it doesn't run.
Any ideas welcome.
Thanks
If you have the same data on the bag e.g. result of a group, you can do 2 things:
Involve the given field in the grouping, than you won't need to deal with that
...
B = FOREACH (GROUP A BY (COUNT, ID1, ID2))
GENERATE FLATTEN(group) AS (COUNT, ID1, ID2),
...
Or use an inbuilt function e.g. MAX
...
B = FOREACH (GROUP A BY COUNT) GENERATE
FLATTEN(group) AS COUNT,
MAX(A.ID1) AS ID1,
MAX(A.ID2) AS ID1,
...
The benefit of this compare with the suggested datafu function, that you can do it with an inbuilt function.
I hope this helps

How to filter an alias by a set of IDs?

I have a huge alias HUGE which has a field ID.
I also have an alias COUNTS, indexed by ID.
I want to create an alias FILTERED which is identical to HUGE but only containing the IDs with small counts, i.e., something like:
A = join HUGE by ID, COUNTS by ID;
B = filter A by COUNTS::N < 1000;
FILTERED = foreach B generate HIGE::*;
The problem is that HUGE is, well, huge (1B rows), and the number of IDs I am removing is relatively small.
So, instead of the hugely expensive join, I want to be able to do something like:
C = foreach (filter COUNTS by N >= 1000) generate ID;
FILTERED = filter HUGE by ID not in C;
here C is relatively small (say, a 10k rows).
How do I do this?
Since C (rows of COUNTS with N >= 1000) is quite small, you can use a replicated join so that it is performed in memory with no reduce phase. This will add minimal processing to whatever else you are doing with HUGE.
*Note that in your example you filtered by N >= 1000 but you stated you wanted IDs with small counts. Not sure which of those is what you intended.

How do I convert a column to a tuple in PIG

I've a PIG question and is related to converting columns of tables into tuples so that I can pass them to a UDF. Details as follows:-
There is a result "C" which looks like following if I do "dump C"
(a1,b1,c1)
(a2,b2,c2)
I want to convert extract the every combination of 2 columns as follows:
(a1,a2,a3), (b1,b2,b3), (c1,c2,c3)
and then call a UDF on each possible pair of tuples:
UDF((a1,a2,a3), (b1,b2,b3))
UDF((a1,a2,a3), (c1,c2,c3))
UDF((c1,c2,c3), (b1,b2,b3))
How do I do this in PIG?
You can get all of the values for a given "column" by using GROUP .. ALL and then using bag projection:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.a, grpd.b),
UDF(grpd.a, grpd.c),
UDF(grpd.c, grpd.b);
Note, however, that the values for each column will be stored in bags rather than tuples. This is proper, because relations in Pig do not guarantee that the records are ordered in any particular way. So your UDF should be comparing bags and not rely on the order of the elements.
However, it may be important that you be able to compare values that were originally in the same row; i.e., match up a1 with b1, etc. For this, you will need to write your UDF to take a single bag, with each tuple containing the paired elements an and bn. To do this, use bag projection of two columns:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.(a,b)),
UDF(grpd.(a,c)),
UDF(grpd.(c,b));
Again, the tuples will not necessarily be in order, but you should not rely on that fact. Your bag will contain the tuples (a1,b1), (a2,b2), etc.

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group.
As an example, assume we have
(1,2,1)
(2,0,1)
(3,4,2)
(4,1,2)
(5,2,3)
The pig script would produce
{(1,2),(2,0)}
{(3,4),(4,1)}
{(5,2)}
The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to me how to filter the tuples and return them in a bag. Thanks for your assistance!
Turns out what I was looking for is the syntax for nested projection in Pig.
If one has tuples of the form (t,a,b) and wants to drop b after the group by, it is done this way.
grouped = GROUP tups BY b;
result = FOREACH grouped GENERATE tup.(t,a);
See the "Nested Projection" section on the PigLatin page. http://wiki.apache.org/pig/PigLatin