How do I convert a column to a tuple in PIG - apache-pig

I've a PIG question and is related to converting columns of tables into tuples so that I can pass them to a UDF. Details as follows:-
There is a result "C" which looks like following if I do "dump C"
(a1,b1,c1)
(a2,b2,c2)
I want to convert extract the every combination of 2 columns as follows:
(a1,a2,a3), (b1,b2,b3), (c1,c2,c3)
and then call a UDF on each possible pair of tuples:
UDF((a1,a2,a3), (b1,b2,b3))
UDF((a1,a2,a3), (c1,c2,c3))
UDF((c1,c2,c3), (b1,b2,b3))
How do I do this in PIG?

You can get all of the values for a given "column" by using GROUP .. ALL and then using bag projection:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.a, grpd.b),
UDF(grpd.a, grpd.c),
UDF(grpd.c, grpd.b);
Note, however, that the values for each column will be stored in bags rather than tuples. This is proper, because relations in Pig do not guarantee that the records are ordered in any particular way. So your UDF should be comparing bags and not rely on the order of the elements.
However, it may be important that you be able to compare values that were originally in the same row; i.e., match up a1 with b1, etc. For this, you will need to write your UDF to take a single bag, with each tuple containing the paired elements an and bn. To do this, use bag projection of two columns:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.(a,b)),
UDF(grpd.(a,c)),
UDF(grpd.(c,b));
Again, the tuples will not necessarily be in order, but you should not rely on that fact. Your bag will contain the tuples (a1,b1), (a2,b2), etc.

Related

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]
So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974

What's the effective way to count rows in Pig?

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;
Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version.
The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper).
I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns
a scalar value. One interesting and useful property of many aggregate
functions is that they can be computed incrementally in a distributed
fashion. We call these functions algebraic. COUNT is an example of an
algebraic function because we can count the number of elements in a
subset of the data and then sum the counts to produce a final output.
In the Hadoop world, this means that the partial computations can be
done by the map and combiner, and the final result can be computed by
the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the PARALLEL n keyword this set the
number of reducers.
Increase the parallelism of a job by specifying the number of reduce
tasks, n. The default value for n is 1 (one reduce task).

What is the difference between `::` and `.` in pig?

What is the difference between :: and . in pig?
When do I use one vs the other?
E.g., I know that :: is need in join when a field exists in both aliases:
A = foreach (join B by (x), C by (y)) generate B::y as b_y, C::y as c_y;
and I need . when accessing group fields:
A = foreach (group B by (x,y)) generate group.x as x, group.y as y, SUM(B?z) as z;
However, do I pass B::z or B.z to SUM above instead of B?z?
In Pig, :: is used as a disambiguation tool after operations which could possibly create naming collisions. Notably, this happens with JOIN, CROSS, and FLATTEN. Consider two relations, A:{(id:int, name:chararray)} and B:{(id:int, location:chararray)}. If you want to associate names with locations, naturally you would do:
C = JOIN A BY id, B BY id;
Without the disambiguation operator, your schema would be
C:{(id:int, name:chararray, id:int, location:chararray)}
Now you can't tell which field id refers to. To avoid this, Pig will instead do
C:{(A::id:int, A::name:chararray, B::id:int, B::location:chararray)}
Likewise, you could FLATTEN two bags whose tuples have fields with the same name, and they would also collide. So the same operator is used in this case as well. When there is no such conflict, you do not need to use the full name: name is unambiguous here. To simplify C, then, you can do this:
D = FOREACH C GENERATE A::id, name, location;
The . operator, by contrast, projects fields from bags and tuples. If you have a bag b with schema {(x:int, y:int, z:int)}, the projection b.y yields a bag with just the specified field: {(y:int)}. You can project multiple fields at once with parentheses: b.(y,z) yields {(y:int, z:int)}.
When used with tuples, the result is a tuple with just the specified fields. If the tuple t has schema (x:int, y:int, z:int), then t.x is the tuple (x:int) and t.(y,z) is the tuple (y:int, z:int).
To your specific question about SUM, note that SUM along with the other summary statistic UDFs, takes a bag as its argument. Therefore, you need to create a bag with just the one field per tuple that you want to sum. Using the projection operator, .: B.z.
IIRC you get :: as a side effect after some statements. You cannot bother about it, unless (as you mentioned) a name exists inside two different prefixes.
The . is different in that you are going inside the structure.
group.x as x, group.y as y is equivalent to FLATTEN(group)
SUM(B?z) - here you should do SUM(B.z), to specify that you need a particular field to SUM.

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group.
As an example, assume we have
(1,2,1)
(2,0,1)
(3,4,2)
(4,1,2)
(5,2,3)
The pig script would produce
{(1,2),(2,0)}
{(3,4),(4,1)}
{(5,2)}
The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to me how to filter the tuples and return them in a bag. Thanks for your assistance!
Turns out what I was looking for is the syntax for nested projection in Pig.
If one has tuples of the form (t,a,b) and wants to drop b after the group by, it is done this way.
grouped = GROUP tups BY b;
result = FOREACH grouped GENERATE tup.(t,a);
See the "Nested Projection" section on the PigLatin page. http://wiki.apache.org/pig/PigLatin

Is it possible to cross-join a row in a relation with a tuple in that row in Pig?

I have a set of data that shows users, collections of fruit they like, and home city:
Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento
I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this:
Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1
The part I can't figure out is how to cross join the split fruit rows with the rest of the data from the same row, so:
Alice\tApple:Orange\tSacramento
becomes:
Alice\tApple\tSacramento
Alice\tOrange\tSacramento
I know I can use TOKENIZE to split the string 'Apple:Orange' into the tuple ('Apple', 'Orange'), but I don't know how to get the cross product of that tuple with the rest of the row ('Alice').
One brute-force solution I came up with is to use the streaming to run the input collection through an external program, and handle the "cross join" to produce multiple rows per row there.
This seems like it should be unnecessary though. Are there better ideas?
You should use FLATTEN, which works great with TOKENIZE to do stuff like this.
b = FOREACH a GENERATE name, FLATTEN(TOKENIZE(fruits)) as fruit, city;
FLATTEN takes a bag and "flattens" it out across different rows. TOKENIZE breaks your fruits out into a bag (not a tuple like you said), and then FLATTEN does the cross-like behavior like you are looking for. I point out that it is a bag and not a tuple, because FLATTEN is overloaded and behaves differently with tuples.
I first learned of the FLATTEN/TOKENIZE technique in the canonical word count example, in which is tokenizes a word, then flattens the words out into rows.