Looping through element in Pig to generate a new tuple for relation - apache-pig

Say I have a relation as follow:
(A, (1, 2, 3))
(B, (2, 3))
Is it possible to make a new relation by expanding the bag element as follow using Pig Latin?
(A, 1)
(A, 2)
(A, 3)
(B, 2)
(B, 3)
I tried using FOREACH and GENERATE, but I am having difficulty generating a new tuple while looping through a bag element.
Thanks,
-------------
EDIT
-------------
Here's a sample input:
A 1 2 3
B 2 3
Separated by tab and then a whitespace.
I used STRSPLIT to handle whitespace to generate a tuple.
raw_x = LOAD './sample.txt' using PigStorage('\t') AS (title:chararray, links:chararray);
data_x = FOREACH raw_x GENERATE title, STRSPLIT(links, '\\s+') AS links;

Can you try this?
input.txt
A 1 2 3
B 2 3
PigScript:
A = LOAD 'input.txt' USING PigStorage() AS (title:chararray,links:chararray);
B = FOREACH A GENERATE title,FLATTEN(TOKENIZE(links));
DUMP B;
Output:
(A,1)
(A,2)
(A,3)
(B,2)
(B,3)

Related

Transpose array to rows using PIG latin

How to convert ARRY elements in BAG to multiple rows eg: below
My input:
tuple, ARRAY_ELEM
(32,{(1,emp,3271409712),(2,emp,3271409712)})
Output
(32,1,emp,3271409712)
(32,2,emp,3271409712)
You probably need to call FLATTEN twice.
Note, FLATTEN on a tuple just elevates each field in the tuple to a top-level field.
FLATTEN on bag produces a cross product of every record in the bag with all of the other expressions in GENERATE.
A = load 'test.txt' using PigStorage() as (a0:int, t1:(a1:int, b1 {(a3:int,a4:chararray,a5:chararray)}));
describe A;
B = FOREACH A GENERATE FLATTEN(t1);
describe B;
C = FOREACH B GENERATE a1, FLATTEN(b1);
describe C;
dump C;
Output
A: {a0: int,t1: (a1: int,b1: {(a3: int,a4: chararray,a5: chararray)})}
B: {t1::a1: int,t1::b1: {(a3: int,a4: chararray,a5: chararray)}}
C: {t1::a1: int,t1::b1::a3: int,t1::b1::a4: chararray,t1::b1::a5: chararray}
(32,1,emp,3271409712)
(32,2,emp,3271409712)

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Using Pig conditional operator to implement or?

Let's say I have some table f, consisting of the following columns:
a, b
0, 1
0, 0
0, 0
0, 1
1, 0
1, 1
I want to create a new column, c, that is equal to a | b.
I've tried the following:
f = foreach f generate a, b, ((a or b) == 1) ? 1 : 0 as c;
but receive the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: NoViableAltException(91#[])
The OR condition construction is not correct, Can you try this?
f = foreach f generate a, b, (((a==1) or (b==1))?1:0) AS c;
Sample example:
input:
0,1
0,0
0,0
0,1
1,0
1,1
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (a:int,b:int);
B = foreach A generate a, b, (((a==1) or (b==1))?1:0) AS c;
DUMP B;
Output:
(0,1,1)
(0,0,0)
(0,0,0)
(0,1,1)
(1,0,1)
(1,1,1)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

return a databag given a tuple in PIG

I am trying to write an UDF which takes a tuple as an input and return a databag as an output. I am very new to PIG. Please help. The example I have is the UPPER.java class example.
example of what the UDF should do
the input is say <8,9,1,8,9> the output should be <{8,2} {9,2} {1,1}> 8 appears 2 times, 9 does too and 1 appears once.
I found a solution.
Step 1: Load input file
input_data= load '/idn/home/ksing143/tuple_related_data/tuple_frequency.txt' USING PigStorage() AS ip_tuple:tuple(a:int, b:int, c:int, d:int, e:int);
Result:
((8,9,1,8,9))
Step 2: flatten the input tuple
ip_flattened = foreach input_data generate FLATTEN($0);
Step 3: Convert to bag
ip_tobag = foreach ip_flattened generate TOBAG(ip_tuple::a,ip_tuple::b,ip_tuple::c,ip_tuple::d,ip_tuple::e);
Result:
({(8),(9),(1),(8),(9)})
Step 4: Flatten the bag
ip_tobag_flattened = foreach ip_tobag generate FLATTEN($0);
Result:
(8)
(9)
(1)
(8)
(9)
Step 5: Perform Grouping and then count
ip_grouped = group ip_tobag_flattened BY $0;
ip_out = foreach ip_grouped generate group, COUNT($1);
Result:
(1,1)
(8,2)
(9,2)
Step 6: Convert output TOBAG as we want output in the form of bag.
ip_output_bag = foreach ip_out generate TOBAG($0,$1);
Result:
({(1),(1)})
({(8),(2)})
({(9),(2)})