Transpose array to rows using PIG latin - apache-pig

How to convert ARRY elements in BAG to multiple rows eg: below
My input:
tuple, ARRAY_ELEM
(32,{(1,emp,3271409712),(2,emp,3271409712)})
Output
(32,1,emp,3271409712)
(32,2,emp,3271409712)

You probably need to call FLATTEN twice.
Note, FLATTEN on a tuple just elevates each field in the tuple to a top-level field.
FLATTEN on bag produces a cross product of every record in the bag with all of the other expressions in GENERATE.
A = load 'test.txt' using PigStorage() as (a0:int, t1:(a1:int, b1 {(a3:int,a4:chararray,a5:chararray)}));
describe A;
B = FOREACH A GENERATE FLATTEN(t1);
describe B;
C = FOREACH B GENERATE a1, FLATTEN(b1);
describe C;
dump C;
Output
A: {a0: int,t1: (a1: int,b1: {(a3: int,a4: chararray,a5: chararray)})}
B: {t1::a1: int,t1::b1: {(a3: int,a4: chararray,a5: chararray)}}
C: {t1::a1: int,t1::b1::a3: int,t1::b1::a4: chararray,t1::b1::a5: chararray}
(32,1,emp,3271409712)
(32,2,emp,3271409712)

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

How to loop through tuples in a Bag, Pig

I am new to pig scripting.
I have an input, (A,B,{(XYZ,123,CDE)})
I am looking to loop through the bag inside and print the following records.
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
Can someone please help me out!
Lets say X is your relation and it has (A,B,{(XYZ,123,CDE)}).ToBag converts the expression into bags and FLATTEN unnests the tuples,bag.
Y = FOREACH X GENERATE $0,$1,ToBag(FLATTEN($2));
Solved!!
Let us load below file (Tab separated)
A B {(XYZ,123,CDE)}
input_plus_bag = load '' USING PigStorage() AS (entry1:chararray, entry2:chararray, bag1:bag{(te1:chararray, te2:int, te3:chararray)});
intermed_output = foreach input_plus_bag generate entry1, entry2, FLATTEN(bag1);
Dump intermed_output;
This will give
(A,B,XYZ,123,CDE)
DESCRIBE intermed_output;
intermed_output: {entry1: chararray,entry2: chararray,bag1::te1: chararray,bag1::te2: int,bag1::te3: chararray}
Now perform TOBAG operation
intermed2_output = foreach intermed_output generate entry1, entry2, TOBAG(bag1::te1,bag1::te2,bag1::te3);
DUMP intermed2_output;
This will result in below output:-
(A,B,{(XYZ),(123),(CDE)})
Now final step is FLATTEN the bag
final_output = foreach intermed2_output generate entry1, entry2, FLATTEN($2);
And we have our desired output:-
(A,B,XYZ)
(A,B,123)
(A,B,CDE)

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

Getting a count of a particular string

How can I count the number of occurances of a particular string,say 'Y' , for each individual row and do calculations on that count after that. For ex. how can I find the number of 'Y' for each 'FMID' and do calculations on that count for each FMID ?Dataset Screenshot
You could use TOKENIZE built-in function which converts row into a BAG and than use nested filtering in the foreach to get a BAG that contains only word you are interested in, on which you can use COUNT. See FOREACH description
For example:
inpt = load '....' as (line : string);
row_bags = foreach inpt generate line, TOKENIZE(line) as word;
cnt = foreach row_bags {
match_1 = filter word by 'Y';
match_2 = filter word by 'X';
generate line, COUNT(match_1) as count_1, COUNT(match_2) as count_2;
};
dump cnt;
Using some functions from DataFu library you could get count for each string in the BAG word.
Here's the simplest way I can think of solving your problem:
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS
(fmid:int, field1:chararray, field2:chararray,
field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
I'm not really certain about the data schema you're working with. I'm going to assume you have a relation in Pig consisting of a sequence of tuples. I'm also going to assume you have a lot of fields, making it a pain to reference each individually.
I'll walk through this example piece by piece to explain. Without loss of generality, I will use this data below for my example:
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
Now that we've loaded the data, we want to transpose each tuple to a bag, because once it is in a bag we can perform counts on the items within it more easily. We'll use TransposeTupleToBag from the DataFu library:
define Transpose datafu.pig.util.TransposeTupleToBag();
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
Note that you need at least Pig 0.11 to use this UDF. Note the use of $1.., which is known as a project-range expression in Pig. If you have many fields this is really convenient.
If you were to dump data2 at this point you would get this:
(1000,{(field1,N),(field2,N),(field3,N),(field4,N)})
(2000,{(field1,N),(field2,Y),(field3,N),(field4,N)})
(3000,{(field1,Y),(field2,Y),(field3,N),(field4,N)})
(4000,{(field1,Y),(field2,Y),(field3,Y),(field4,Y)})
What we've done is taken the fields from the tuple after the 0th element (fmid), and tranposed these into a bag where each tuple has a key and value field.
Now that we have a bag we can do a simple filter and count:
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
Now if you dump data2 you get the expected counts reflecting the number of Y values in the tuple.
(1000,0)
(2000,1)
(3000,2)
(4000,4)
Here is the full source code for my example as a unit test, which you can put directly in the DataFu unit tests to try out:
/**
register $JAR_PATH
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
dump data2;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
dump data2;
STORE data2 INTO 'output';
*/
#Multiline
private String example;
#Test
public void example() throws Exception
{
PigTest test = createPigTestFromString(example);
writeLinesToFile("input",
"1000,N,N,N,N",
"2000,N,Y,N,N",
"3000,Y,Y,N,N",
"4000,Y,Y,Y,Y");
test.runScript();
super.getLinesForAlias(test, "data2");
}