Pick a random value from a bag - apache-pig

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"

You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

Related

Transpose array to rows using PIG latin

How to convert ARRY elements in BAG to multiple rows eg: below
My input:
tuple, ARRAY_ELEM
(32,{(1,emp,3271409712),(2,emp,3271409712)})
Output
(32,1,emp,3271409712)
(32,2,emp,3271409712)
You probably need to call FLATTEN twice.
Note, FLATTEN on a tuple just elevates each field in the tuple to a top-level field.
FLATTEN on bag produces a cross product of every record in the bag with all of the other expressions in GENERATE.
A = load 'test.txt' using PigStorage() as (a0:int, t1:(a1:int, b1 {(a3:int,a4:chararray,a5:chararray)}));
describe A;
B = FOREACH A GENERATE FLATTEN(t1);
describe B;
C = FOREACH B GENERATE a1, FLATTEN(b1);
describe C;
dump C;
Output
A: {a0: int,t1: (a1: int,b1: {(a3: int,a4: chararray,a5: chararray)})}
B: {t1::a1: int,t1::b1: {(a3: int,a4: chararray,a5: chararray)}}
C: {t1::a1: int,t1::b1::a3: int,t1::b1::a4: chararray,t1::b1::a5: chararray}
(32,1,emp,3271409712)
(32,2,emp,3271409712)

Sort tuples in a bag based on multiple fileds

I am trying to sort tuples inside a bag based on three fields in descending order..
Example : Suppose I have the following bag created by grouping:
{(s,3,my),(w,7,pr),(q,2,je)}
I want to sort the tuples in the above grouped bag based on $0,$1,$2 fields in such a way that first it will sort on $0 of all the tuples. It will pick the tuple with largest $0 value. If $0 are same for all the tuples then it will sort on $1 and so on.
The sorting should be for all the grouped bags through iterating process.
Suppose if we have databag something like:
{(21,25,34),(21,28,64),(21,25,52)}
Then according to the requirement output should be like:
{(21,25,34),(21,25,52),(21,28,64)}
Please let me know if you need any more clarification
Order your tuple in a nested foreach. This will work.
Input:
(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b, c, d;
GENERATE od;
};
DUMP C Result(which resembles your data):
({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})
Output:
({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})
This will work for all the cases.
Generate tuple with highest value:
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b desc , c desc , d desc;
od1 = LIMIT od 1;
GENERATE od1;
};
dump D;
Generate tuple with highest value if all the three fields are different, if all the tuples are same or if field 1 and field2 are same then return all the tuple.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
F = RANK C; //rank used to separate out the value if two tuples are same
R = FOREACH F {
dis = distinct A;
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
R4 = FOREACH R3 {
fil1 = ORDER A by b desc, c desc, d desc;
fil2 = LIMIT fil1 1;
GENERATE rank_C,fil2;
}; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A);
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {
DIS = distinct F1;
GENERATE flatten(DIS);
};
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value,
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;

Using Pig conditional operator to implement or?

Let's say I have some table f, consisting of the following columns:
a, b
0, 1
0, 0
0, 0
0, 1
1, 0
1, 1
I want to create a new column, c, that is equal to a | b.
I've tried the following:
f = foreach f generate a, b, ((a or b) == 1) ? 1 : 0 as c;
but receive the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: NoViableAltException(91#[])
The OR condition construction is not correct, Can you try this?
f = foreach f generate a, b, (((a==1) or (b==1))?1:0) AS c;
Sample example:
input:
0,1
0,0
0,0
0,1
1,0
1,1
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (a:int,b:int);
B = foreach A generate a, b, (((a==1) or (b==1))?1:0) AS c;
DUMP B;
Output:
(0,1,1)
(0,0,0)
(0,0,0)
(0,1,1)
(1,0,1)
(1,1,1)

Doing word count in pig

I have data already processed in following form:
( id ,{ bag of words})
So for example:
(foobar, {(foo), (foo),(foobar),(bar)})
(foo,{(bar),(bar)})
and so on..
describe processed gives me:
processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
Now what I want is.. also count the number of times a word appears in this data and output it as
foobar, foo, 2
foobar,foobar,1
foobar,bar,1
foo,bar,2
and so on...
How do I do this in pig?
Though you can do this in pure pig, it should be much more efficient to do this with a UDF. Something along the lines of:
#outputschema('wordcounts: {T:(word:chararray, count:int)}')
def generate_wordcount(BAG):
d = {}
for word in BAG:
if word in d:
d[word] += 1
else:
d[word] = 1
return d.items()
You can then use this UDF like this:
REGISTER 'myudfs.py' USING jython AS myudfs ;
-- A: (id, words: {T:(word:chararray)})
B = FOREACH A GENERATE id, FLATTEN(myudfs.generate_wordcount(words)) ;
Try this:
$ cat input
foobar foo
foobar foo
foobar foobar
foobar bar
foo bar
foo bar
--preparing
inputs = LOAD 'input' AS (first: chararray, second: chararray);
grouped = GROUP inputs BY first;
formatted = FOREACH grouped GENERATE group, inputs.second AS second;
--what you need
flattened = FOREACH formatted GENERATE group, FLATTEN(second);
result = FOREACH (GROUP flattened BY (group, second)) GENERATE FLATTEN(group), COUNT(flattened);
DUMP result;
Output:
(foo,bar,2)
(foobar,bar,1)
(foobar,foo,2)
(foobar,foobar,1)

Sort related bag

I have a Pig script which generated a relation
A: {x: chararray,B: {(y: chararray,z: int)}}
I want to sort A based on B.y, however the following piece gives me error:
Syntax error, unexpected symbol at or near z
output = foreach A{
sorted = order B by z DSC;
generate x,sorted;
}
Use DESC instead of DSC.
e.g.
output = foreach A{
sorted = order B by z DESC;
generate x,sorted;
}