Using Pig conditional operator to implement or? - apache-pig

Let's say I have some table f, consisting of the following columns:
a, b
0, 1
0, 0
0, 0
0, 1
1, 0
1, 1
I want to create a new column, c, that is equal to a | b.
I've tried the following:
f = foreach f generate a, b, ((a or b) == 1) ? 1 : 0 as c;
but receive the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: NoViableAltException(91#[])

The OR condition construction is not correct, Can you try this?
f = foreach f generate a, b, (((a==1) or (b==1))?1:0) AS c;
Sample example:
input:
0,1
0,0
0,0
0,1
1,0
1,1
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (a:int,b:int);
B = foreach A generate a, b, (((a==1) or (b==1))?1:0) AS c;
DUMP B;
Output:
(0,1,1)
(0,0,0)
(0,0,0)
(0,1,1)
(1,0,1)
(1,1,1)

Related

Write a unique key of the group as folder name and the bag content as records?

Objective : To write unique key of the group as folder name and the bag content as records.
File : employee.txt
#JoiningDate Employee Id Employee Name
20140302 1 A
20140302 2 B
20140302 3 C
20140303 4 D
20140303 5 E
20140303 6 F
Pig script :
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Output of this would be (Y) :
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
Objective is to have tow folders in the output path :
1. outputfolder/20140302 : having three records
20140302,1,A
20140302,2,B
20140302,3,C
2. outputfolder/20140303 :
20140303,4,D
20140303,5,E
20140303,6,F
Tried
store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
Seeing result as below :
1. outputfolder/20140302/20140302-0
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
2. outputfolder/20140303/20140303-0
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
One option could be just flatten the values before store command.
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN($1);
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
Output will be stored in outputfolder/20140302 folder and file name start with something like this 20140302-0,000

Looping through element in Pig to generate a new tuple for relation

Say I have a relation as follow:
(A, (1, 2, 3))
(B, (2, 3))
Is it possible to make a new relation by expanding the bag element as follow using Pig Latin?
(A, 1)
(A, 2)
(A, 3)
(B, 2)
(B, 3)
I tried using FOREACH and GENERATE, but I am having difficulty generating a new tuple while looping through a bag element.
Thanks,
-------------
EDIT
-------------
Here's a sample input:
A 1 2 3
B 2 3
Separated by tab and then a whitespace.
I used STRSPLIT to handle whitespace to generate a tuple.
raw_x = LOAD './sample.txt' using PigStorage('\t') AS (title:chararray, links:chararray);
data_x = FOREACH raw_x GENERATE title, STRSPLIT(links, '\\s+') AS links;
Can you try this?
input.txt
A 1 2 3
B 2 3
PigScript:
A = LOAD 'input.txt' USING PigStorage() AS (title:chararray,links:chararray);
B = FOREACH A GENERATE title,FLATTEN(TOKENIZE(links));
DUMP B;
Output:
(A,1)
(A,2)
(A,3)
(B,2)
(B,3)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

Find all paths of at most length 2 from a set of relationships

I have a connection data set with each row marks A connects B in the form A B. The direct connection between A and B appears only once, either in the form A B or B A. I want to find all the connections at most one hop away, i.e. A and C are at most one hop away, if A and C are directly connected, or A connects C through some B.
For example, I have the following direct connection data
1 2
2 4
3 7
4 5
Then the resulting data I want is
1 {2,4}
2 {1,4,5}
3 {7}
4 {1,2,5}
5 {2,4}
7 {3}
Could anybody help me to find a way as efficient as possible? Thank you.
You could do this:
myudf.py
#outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
script.pig
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
Schema and output for G using your sample input:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})

How to optimize a group by statement in PIG latin?

I have a skewed data set and I need to do a group by operation and then do a nested foreach on it. Because of the skewed data, few reducers are taking long time and others are taking no time. I know there exists skewed join but what is there for group by and foreach? Here is my pig code (renamed the variables):
foo_grouped = GROUP foo_grouped by FOO;
FOO_stats = FOREACH foo_grouped
{
a_FOO_total = foo_grouped.ATTR;
a_FOO_total = DISTINCT a_FOO_total;
bar_count = foo_grouped.BAR;
bar_count = DISTINCT bar_count;
a_FOO_type1 = FILTER foo_grouped by COND1=='Y';
a_FOO_type1 = a_FOO_type1.ATTR;
a_FOO_type1 = DISTINCT a_FOO_type1;
a_FOO_type2 = FILTER foo_grouped by COND2=='Y' OR COND3=='HIGH';
a_FOO_type2 = a_FOO_type2.ATTR;
a_FOO_type2 = DISTINCT a_FOO_type2;
generate group as FOO,
COUNT(a_FOO_total) as a_FOO_total, COUNT(a_FOO_type1) as a_FOO_type1, COUNT(a_FOO_type2) as a_FOO_type2, COUNT(bar_count) as bar_count; }
In your example there are a lot of nested DISTINCT operators within FOREACH which are executed in the reducer, it relies on RAM to calculate unique values and this query produces just one Job. In case of just too many unique elements in a group you could get memory related exceptions as well.
Luckily PIG Latin is a dataflow language and you write sort of execution plan. In order to utilize more CPUs you could change your code in such way that forces more MapReduce jobs which could be executed in parallel. For that we should rewrite query without using nested DISTINCT, the trick is to do distinct operations and than group by as if you had just one column and than merge the results. It is very SQL like, but it works. Here it is:
records = LOAD '....' USING PigStorage(',') AS (g, a, b, c, d, fd, s, w);
selected = FOREACH records GENERATE g, a, b, c, d;
grouped_a = FOREACH selected GENERATE g, a;
grouped_a = DISTINCT grouped_a;
grouped_a_count = GROUP grouped_a BY g;
grouped_a_count = FOREACH grouped_a_count GENERATE FLATTEN(group) as g, COUNT(grouped_a) as a_count;
grouped_b = FOREACH selected GENERATE g, b;
grouped_b = DISTINCT grouped_b;
grouped_b_count = GROUP grouped_b BY g;
grouped_b_count = FOREACH grouped_b_count GENERATE FLATTEN(group) as g, COUNT(grouped_b) as b_count;
grouped_c = FOREACH selected GENERATE g, c;
grouped_c = DISTINCT grouped_c;
grouped_c_count = GROUP grouped_c BY g;
grouped_c_count = FOREACH grouped_c_count GENERATE FLATTEN(group) as g, COUNT(grouped_c) as c_count;
grouped_d = FOREACH selected GENERATE g, d;
grouped_d = DISTINCT grouped_d;
grouped_d_count = GROUP grouped_d BY g;
grouped_d_count = FOREACH grouped_d_count GENERATE FLATTEN(group) as g, COUNT(grouped_d) as d_count;
mrg = JOIN grouped_a_count BY g, grouped_b_count BY g, grouped_c_count BY g, grouped_d_count BY g;
out = FOREACH mrg GENERATE grouped_a_count::g, grouped_a_count::a_count, grouped_b_count::b_count, grouped_c_count::c_count, grouped_d_count::d_count;
STORE out into '....' USING PigStorage(',');
After execution I got following summary which shows that distinct operations did not suffer from the skew in data were processed by the first Job:
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201206061712_0244 669 45 75 8 13 376 18 202 grouped_a,grouped_b,grouped_c,grouped_d,records,selected DISTINCT,MULTI_QUERY
job_201206061712_0245 1 1 3 3 3 12 12 12 grouped_c_count GROUP_BY,COMBINER
job_201206061712_0246 1 1 3 3 3 12 12 12 grouped_b_count GROUP_BY,COMBINER
job_201206061712_0247 5 1 48 27 33 30 30 30 grouped_a_count GROUP_BY,COMBINER
job_201206061712_0248 1 1 3 3 3 12 12 12 grouped_d_count GROUP_BY,COMBINER
job_201206061712_0249 4 1 3 3 3 12 12 12 mrg,out HASH_JOIN ...,
Input(s):
Successfully read 52215768 records (44863559501 bytes) from: "...."
Output(s):
Successfully stored 9 records (181 bytes) in: "..."
From Job DAG we can see that groupby operations were executed in parallel:
Job DAG:
job_201206061712_0244 -> job_201206061712_0248,job_201206061712_0246,job_201206061712_0247,job_201206061712_0245,
job_201206061712_0248 -> job_201206061712_0249,
job_201206061712_0246 -> job_201206061712_0249,
job_201206061712_0247 -> job_201206061712_0249,
job_201206061712_0245 -> job_201206061712_0249,
job_201206061712_0249
It works fine on my datasets where one of the group key values (in column g) makes 95% of the data. It also gets rid of memory related exceptions.
I recently ran into an error with this join.. If there any null in the group then the whole relation(s) will be dropped..