Reference columns in a FOREACH after a JOIN? - apache-pig

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate id,a1,b1;
dump D;
4th line fails on:
Invalid field projection. Projected field [id] does not exist in schema
I tried to change to A.id but then the last line fails on: ERROR 0: Scalar has more than one row in the output.

What you are looking for is the "Disambiguate Operator". What you want is A::id, not A.id.
A.id says "there is a relation/bag A and there is a column called id in its schema"
A::id says "there is a record from A and that has a column called id"
So, you would do:
A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate A::id,a1,b1;
dump D;
A dirty alternative:
Just because I'm lazy, and disambiguation gets really weird when you start doing multiple joins one after another: use unique identifiers.
A = load 'a.txt' as (ida, a1);
B = load 'b.txt as (idb, b1);
C = join A by ida, B by idb;
D = foreach C generate ida,a1,b1;
dump D;

Related

How do I find intersection of a bag and an element without UDF in Pig?

I have say two Pig variables,
p which is (id: int, companies: tuple(name:chararray))
and
q which is (id: int, company: chararray).
Now after I join p and q by their "id"'s, how do I filter out those rows where q::company is not present in p::companies?
PS I went through this question Check if an element is present in a bag? but it seems to be not exactly as my problem.
Example
sample p
1,(c1 c2 c3)
2,(c4 c5 c6)
3,(c2 c3 c5)
sample q
1,c3
2,c8
3,c5
expected output after the joins
1,c3
3,c5
First, you need to convert p so that every combination of ID and company name appears on its own line.
p_flattened = FOREACH p GENERATE
id,
FLATTEN(TOKENIZE(companies.name, ' ')) AS company;
dump p_flattened;
(1,c1)
(1,c2)
(1,c3)
(2,c4)
(2,c5)
(2,c6)
(3,c2)
(3,c3)
(3,c5)
Then join with q to return only IDs and names which appear in both relations and do foreach to get rid of the duplicate fields.
pq_joined = JOIN p_flattened BY (id, company), q BY (id, company);
final = FOREACH pq_joined GENERATE
q::id AS id,
q::company AS company;
dump final;
(1,c3)
(3,c5)

how to match the two load statements in pig

I have two load statements A and B.
In each one I have a surrogate key. I want to match the surrogate key columns if both keys will match the stored data.
I tried the following code.
A = LOAD 'a/data/' using PigStorage('\t') as (SourceWebSite:chararray,PropertyID:chararray,ListedOn:chararray,ContactName:chararray,TotalViews:int,Price:chararray);
B = LOAD 'b/data/' using PigStorage('\t') as (SourceWebsite:chararray,PropertyType:chararray,IPLSNO:int,Locality:chararray,City:chararray,Price:chararray);
C = COGROUP A BY Price, B BY Price;
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B));
The above command prints all the data.
If I understand it right you would like to have dose data where both A and B has any data for the given price, am I right?
Than you may have to use filter:
D = FILTER C BY (NOT IsEmpty(A) AND NOT IsEmpty(B));
The D will contain those data rows where both A and B has value for the price used to group.

Apache Pig Nulls and COGROUP Operators with multiple relations

The below example is taken from Apache Document. My doubt is When using Nulls and COGROUP Operators with multiple relations, how we are getting null tuple in the output.
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
By default COGROUP will do OUTER JOIN between the two relation, that is the reason you are seeing null in the output.
If you do INNER JOIN, you will not get this null .
C = cogroup A by age INNER, B by age INNER;
This document will give more info about cogroup and Null.
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#cogroup

Self cross-join in pig is disregarded

If one have data like those:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
And then a cross-join is done on A, A:
B = CROSS A, A;
DUMP B;
(1,2,3)
(4,2,1)
Why is second A optimized out from the query?
info: pig version 0.11
== UPDATE ==
If I sort A like:
C = ORDER A BY a1;
D = CROSS A, C;
It will give a correct cross-join.
davek is correct -- you cannot CROSS (or JOIN) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH.
I have several macros that I use frequently and IMPORT by default in all of my Pig scripts in case I need them. One is used for just this purpose:
DEFINE DUPLICATE(in) RETURNS out
{
$out = FOREACH $in GENERATE *;
};
This will work for you wherever in your pipeline you need a duplicate:
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;
Note that even though A1 and A2 are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS or JOIN, this probably doesn't matter.
I think you have to load the data twice to achieve what you want.
i.e.
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};