How to loop through tuples in a Bag, Pig - apache-pig

I am new to pig scripting.
I have an input, (A,B,{(XYZ,123,CDE)})
I am looking to loop through the bag inside and print the following records.
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
Can someone please help me out!

Lets say X is your relation and it has (A,B,{(XYZ,123,CDE)}).ToBag converts the expression into bags and FLATTEN unnests the tuples,bag.
Y = FOREACH X GENERATE $0,$1,ToBag(FLATTEN($2));

Solved!!
Let us load below file (Tab separated)
A B {(XYZ,123,CDE)}
input_plus_bag = load '' USING PigStorage() AS (entry1:chararray, entry2:chararray, bag1:bag{(te1:chararray, te2:int, te3:chararray)});
intermed_output = foreach input_plus_bag generate entry1, entry2, FLATTEN(bag1);
Dump intermed_output;
This will give
(A,B,XYZ,123,CDE)
DESCRIBE intermed_output;
intermed_output: {entry1: chararray,entry2: chararray,bag1::te1: chararray,bag1::te2: int,bag1::te3: chararray}
Now perform TOBAG operation
intermed2_output = foreach intermed_output generate entry1, entry2, TOBAG(bag1::te1,bag1::te2,bag1::te3);
DUMP intermed2_output;
This will result in below output:-
(A,B,{(XYZ),(123),(CDE)})
Now final step is FLATTEN the bag
final_output = foreach intermed2_output generate entry1, entry2, FLATTEN($2);
And we have our desired output:-
(A,B,XYZ)
(A,B,123)
(A,B,CDE)

Related

Transpose array to rows using PIG latin

How to convert ARRY elements in BAG to multiple rows eg: below
My input:
tuple, ARRAY_ELEM
(32,{(1,emp,3271409712),(2,emp,3271409712)})
Output
(32,1,emp,3271409712)
(32,2,emp,3271409712)
You probably need to call FLATTEN twice.
Note, FLATTEN on a tuple just elevates each field in the tuple to a top-level field.
FLATTEN on bag produces a cross product of every record in the bag with all of the other expressions in GENERATE.
A = load 'test.txt' using PigStorage() as (a0:int, t1:(a1:int, b1 {(a3:int,a4:chararray,a5:chararray)}));
describe A;
B = FOREACH A GENERATE FLATTEN(t1);
describe B;
C = FOREACH B GENERATE a1, FLATTEN(b1);
describe C;
dump C;
Output
A: {a0: int,t1: (a1: int,b1: {(a3: int,a4: chararray,a5: chararray)})}
B: {t1::a1: int,t1::b1: {(a3: int,a4: chararray,a5: chararray)}}
C: {t1::a1: int,t1::b1::a3: int,t1::b1::a4: chararray,t1::b1::a5: chararray}
(32,1,emp,3271409712)
(32,2,emp,3271409712)

Change separator and generate output through PiG

I want to generate the below output from the given input. What will be the best way to get that.
Input:-
"column,1A,extra-A1,extra-A2",column2A,column3A
"((column,1B,extra-B1))",column2B,column3B
"column,1C,extra-C1,extra-C2,extra-C3,extra-C4",column2C,column3C
"column,1D,extra-D1",column2D,column3D
Output:-
column,1A,extra-A1,extra-A2|column2A|column3A
((column,1B,extra-B1))|column2B|column3B
column,1C,extra-C1,extra-C2,extra-C3,extra-C4|column2C|column3C
column,1D,extra-D1|column2D|column3D
I am able to resolve it using below, let me know if you have any better option
Input:-
"column,1A,extra-A1,extra-A2",column2A,column3A
"((column,1B,extra-B1))",column2B,column3B
"column,1C,extra-C1,extra-C2,extra-C3,extra-C4",column2C,column3C
"column,1D,extra-D1",column2D,column3D
Pig Script:-
A = LOAD '/home/hduser/pig_ex1/sample1.txt' AS line;
B = FOREACH A GENERATE SUBSTRING(line,1,(LAST_INDEX_OF(line,'"'))) AS firstcol, SUBSTRING(line,(LAST_INDEX_OF(line,'"')+2),(INT) SIZE(line)) as lastcol;
C = FOREACH B GENERATE firstcol, FLATTEN(STRSPLIT(lastcol,'\\,',2)) AS (secondcol,thirdcol);
D = FOREACH C GENERATE CONCAT(firstcol,'|',secondcol,'|',thirdcol);
Output:-
(column,1A,extra-A1,extra-A2|column2A|column3A)
(((column,1B,extra-B1))|column2B|column3B)
(column,1C,extra-C1,extra-C2,extra-C3,extra-C4|column2C|column3C)
(column,1D,extra-D1|column2D|column3D)
I am using Regular Expression:
Let be try this code:
a = LOAD '/home/hduser/pig_ex1/sample1.txt' as line;
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'["](.*)["][,](.*)[,](.*)')) AS (f1,f2,f3);
c = FOREACH b GENERATE CONCAT(f1,'|',f2,'|',f3);
dump c;
Try org.apache.pig.piggybank.storage.CSVExcelStorage(','); from piggybank jar.

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Merge two lines in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
Output should be:
ABC,DEF,GHI,JKL
MNO,PQR,STU,VWX
Could anyone please help me?
It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach.
input.txt
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
PigScript:
REGISTER /tmp/datafu-1.2.0.jar;
DEFINE BagSplit datafu.pig.bags.BagSplit();
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = GROUP A ALL;
C = FOREACH B GENERATE FLATTEN(BagSplit(2,$1)) AS mybag;
D = FOREACH C GENERATE FLATTEN(STRSPLIT(REPLACE(BagToString(mybag),'_null_null_null_null',''),'_',4));
E = FOREACH D GENERATE $2,$3,$0,$1;
DUMP E;
Output:
(MNO,PQR,STU,VWX)
(ABC,DEF,GHI,JKL)
Note:
Based on the above input format, my assumption will be 1st row last two cols will be null, 2nd row first two cols will be null, similarly for 3rd and 4th row also

Getting a count of a particular string

How can I count the number of occurances of a particular string,say 'Y' , for each individual row and do calculations on that count after that. For ex. how can I find the number of 'Y' for each 'FMID' and do calculations on that count for each FMID ?Dataset Screenshot
You could use TOKENIZE built-in function which converts row into a BAG and than use nested filtering in the foreach to get a BAG that contains only word you are interested in, on which you can use COUNT. See FOREACH description
For example:
inpt = load '....' as (line : string);
row_bags = foreach inpt generate line, TOKENIZE(line) as word;
cnt = foreach row_bags {
match_1 = filter word by 'Y';
match_2 = filter word by 'X';
generate line, COUNT(match_1) as count_1, COUNT(match_2) as count_2;
};
dump cnt;
Using some functions from DataFu library you could get count for each string in the BAG word.
Here's the simplest way I can think of solving your problem:
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS
(fmid:int, field1:chararray, field2:chararray,
field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
I'm not really certain about the data schema you're working with. I'm going to assume you have a relation in Pig consisting of a sequence of tuples. I'm also going to assume you have a lot of fields, making it a pain to reference each individually.
I'll walk through this example piece by piece to explain. Without loss of generality, I will use this data below for my example:
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
Now that we've loaded the data, we want to transpose each tuple to a bag, because once it is in a bag we can perform counts on the items within it more easily. We'll use TransposeTupleToBag from the DataFu library:
define Transpose datafu.pig.util.TransposeTupleToBag();
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
Note that you need at least Pig 0.11 to use this UDF. Note the use of $1.., which is known as a project-range expression in Pig. If you have many fields this is really convenient.
If you were to dump data2 at this point you would get this:
(1000,{(field1,N),(field2,N),(field3,N),(field4,N)})
(2000,{(field1,N),(field2,Y),(field3,N),(field4,N)})
(3000,{(field1,Y),(field2,Y),(field3,N),(field4,N)})
(4000,{(field1,Y),(field2,Y),(field3,Y),(field4,Y)})
What we've done is taken the fields from the tuple after the 0th element (fmid), and tranposed these into a bag where each tuple has a key and value field.
Now that we have a bag we can do a simple filter and count:
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
Now if you dump data2 you get the expected counts reflecting the number of Y values in the tuple.
(1000,0)
(2000,1)
(3000,2)
(4000,4)
Here is the full source code for my example as a unit test, which you can put directly in the DataFu unit tests to try out:
/**
register $JAR_PATH
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
dump data2;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
dump data2;
STORE data2 INTO 'output';
*/
#Multiline
private String example;
#Test
public void example() throws Exception
{
PigTest test = createPigTestFromString(example);
writeLinesToFile("input",
"1000,N,N,N,N",
"2000,N,Y,N,N",
"3000,Y,Y,N,N",
"4000,Y,Y,Y,Y");
test.runScript();
super.getLinesForAlias(test, "data2");
}