Pig : Cogroup How to avoid Blank values - apache-pig

I am new to pig.
While doing a COGROUP i came across issue.
I am trying to do a COGROUP on two files. Keys which i am using for COGROUP having a null values.
Below are my input files :
Input_file_1 :
a|b||
e|f||
Input_file_2 :
a|b||
e|f||
I am using all the four columns as a key while doing a COGROUP. (Last two columns are blank)
My expected output is two records, but i am getting four records as a output.
Can anyone please help how to avoid blank values while doing a COGROUP in PIG.
Thanks in advance.

Null values are handled very differently in PIG.
As per Alan Gates, the author of Book Programming Pig says
cogroup handles null values in the keys similarly to group and unlike
join. That is, all records with a null value in the key will be collected together.
Thus the output of COGROUP would be
((a,b,,),{(a,b,,)},{})
((a,b,,),{},{(a,b,,)})
((e,f,,),{(e,f,,)},{})
((e,f,,),{},{(e,f,,)})
In your case, you have to go for JOIN instead of COGROUP. Thus giving you following result
(a,b,,,a,b,,)
(e,f,,,e,f,,)
Then generate required values.

Related

Is there a way to select a line in a database with null in a field only if there is no line with values?

I have a db2 table that store brand names. In each line, we have a supplier ID and a country. So I can use that to filter out the brands.
Now the lines also contain a customer ID and a Group.
Either the customer is filled (and the group NULL), the group is filled (and the customer NULL), or they're both NULL.
What I'm trying to do is this : get all the lines that match this specific supplier and country, but with a special filter on the customer and group.
If I have only lines with NULL in both fields, returns all these lines.
If some lines have the group specified, then return only the lines with that group (and ignore those with a NULL group)
BUT if some lines have the customer specified, then return only the lines with the customer specified and ignore the others (the lines with the group or the lines with NULL NULL)
Is that doable in SQL, specifically DB2, or do I need to get ALL the lines and filter later during processing ? (in that case, PHP)
Since you guys want me to show some code, here's what I came up with, after some research (never used "WITH" before)
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=e7a54ff6eec8ed4be2594644ba3224bd
EDIT : Finally got it working by myself, dunno if it's the best way to do it though
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=ce7c99d8a37e3dad90900440127730e4
EDIT 2 : I merged the 3 subqueries into 1 by using CASE, I think it's better, but I'm not sure.
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=b581532553b5619d9cdc388d54619836

How to get the first tuple inside a bag when Grouping

I do not understand how to deal with duplicates when generating my output, so I ended up getting several duplicates but I want one only.
I've tried using LIMIT but that only applies when selecting I suppose. I also used DISTINCT but wrong scenario I guess.
grouped = GROUP wantedTails BY tail_number;
smmd = FOREACH grouped GENERATE wantedTails.tail_number as Tails, SUM(wantedTails.distance) AS totaldistance;
So for my grouped, I got smg like (not the whole):
({(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB)},44550)
but I expect (N983JB,44550). How can I delete those duplicates generated during grouping? Thank you!
The way I see it, there are two ways to de-duplicate data in Pig.
Less flexible but a convenient way is to apply MAX to the columns which need to be de-duplicated after performing a GROUP BY. Apply SUM only if you want to add up values across duplicates:
dataWithDuplicates = LOAD '<path_to_data>';
grouped = GROUP dataWithDuplicates BY tail_number;
dedupedData= FOREACH grouped GENERATE
--Since you have grouped on tailNumber, it is already de-duped
group AS tailNumber,
MAX(dataWithDuplicates.distance) AS dedupedDistance,
SUM(dataWithDuplicates.distance) AS totalDistance;
If you want more flexibility while de-duping, you can take help of nested-FOREACH in Pig. This question captures the gist of its usage: how to delete the rows of data which is repeating in Pig. Other references for nested-FORACH: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html

Pentaho Adding summary rows

Any idea how to summarize data in a Pentaho transformation and then insert the summary row directly under the group being summarized.
I can use a Group By step and get a summarised result stream having one row per key field, but what I want is each sorted group written to the output and the summary row inserted underneath, thus preserving the input.
In the Group By, you can do 'Include all Rows', but this just appends the summary fields to the end of each existing row. It does not create new summary rows.
Thanks in advance
To get the summary rows to appear under the group by blocks you have to use some tricks, such as introducing a numeric "order" field, setting the value of the original data to 1 and the sub totals rows to 2.
Also in the group-by/ sub-totals stream, I am generating a sum field, say "subtotal". You have to make sure to also include this as a blank in your regular stream or else the metadata will be divergent and the final merge will not work.
Here is the best explanation I have found for this pattern:
https://www.packtpub.com/books/content/pentaho-data-integration-4-working-complex-data-flows
You will need to copy the rows too a different stream, and then merge or join them again, to make it a separate row.

How do I convert a column to a tuple in PIG

I've a PIG question and is related to converting columns of tables into tuples so that I can pass them to a UDF. Details as follows:-
There is a result "C" which looks like following if I do "dump C"
(a1,b1,c1)
(a2,b2,c2)
I want to convert extract the every combination of 2 columns as follows:
(a1,a2,a3), (b1,b2,b3), (c1,c2,c3)
and then call a UDF on each possible pair of tuples:
UDF((a1,a2,a3), (b1,b2,b3))
UDF((a1,a2,a3), (c1,c2,c3))
UDF((c1,c2,c3), (b1,b2,b3))
How do I do this in PIG?
You can get all of the values for a given "column" by using GROUP .. ALL and then using bag projection:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.a, grpd.b),
UDF(grpd.a, grpd.c),
UDF(grpd.c, grpd.b);
Note, however, that the values for each column will be stored in bags rather than tuples. This is proper, because relations in Pig do not guarantee that the records are ordered in any particular way. So your UDF should be comparing bags and not rely on the order of the elements.
However, it may be important that you be able to compare values that were originally in the same row; i.e., match up a1 with b1, etc. For this, you will need to write your UDF to take a single bag, with each tuple containing the paired elements an and bn. To do this, use bag projection of two columns:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.(a,b)),
UDF(grpd.(a,c)),
UDF(grpd.(c,b));
Again, the tuples will not necessarily be in order, but you should not rely on that fact. Your bag will contain the tuples (a1,b1), (a2,b2), etc.

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group.
As an example, assume we have
(1,2,1)
(2,0,1)
(3,4,2)
(4,1,2)
(5,2,3)
The pig script would produce
{(1,2),(2,0)}
{(3,4),(4,1)}
{(5,2)}
The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to me how to filter the tuples and return them in a bag. Thanks for your assistance!
Turns out what I was looking for is the syntax for nested projection in Pig.
If one has tuples of the form (t,a,b) and wants to drop b after the group by, it is done this way.
grouped = GROUP tups BY b;
result = FOREACH grouped GENERATE tup.(t,a);
See the "Nested Projection" section on the PigLatin page. http://wiki.apache.org/pig/PigLatin