Limit but not Order in PIG - apache-pig

I meet one problem while I using Limit in PIG.
The result of Limit is sorted, but I don't want the result be sorted.
From the example on the website:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Using Limit
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Is it possible that show the top 3 lines without sorted in the reuslt?
(1,2,3)
(4,2,1)
(8,3,4)
My code is below:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = FOREACH C {
topnresult = LIMIT B $lines;
GENERATE FLATTEN(topnresult);
}
dump D;
Thank you very much.

By default LIMIT will execute ORDER command followed by LIMIT command internally, so obviously you will get the sorted list. There are many way to solve this problem, one option could be
input.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
PigScript:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = RANK A;
C = FILTER B BY rank_A<=3;
D = FOREACH C GENERATE a1,a2,a3;
DUMP D;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
Option2:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = GROUP A ALL;
C = FOREACH B {
top3list = LIMIT A 3;
GENERATE FLATTEN(top3list);
}
DUMP C;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
UPDATE: Solution1
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = GROUP C ALL;
E = FOREACH D {
topnresult = LIMIT C $lines;
GENERATE FLATTEN(topnresult);
}
DUMP E;
Solution2:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = RANK C;
E = FILTER D BY rank_C<=$lines;
F = FOREACH E GENERATE $1..;
DUMP F;
I have tested the solution using the below command line and its working fine
>pig -x local -param input='input.txt' -param s_field='$0,$1,$2' -param pattern='$0<10' -param lines=3 myscript.pig

Related

Compare Tuples value present inside a bag with a hardcoded String value

I have a data set with these columns:-
FMID,County,WIC,WICcash
Here is a sample of data:-
1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y
I have grouped the data based on County and have filtered the data based on County = 'Douglas'. Here is the output:
(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})
Now if the WIC and WICcash columns have value as Y then I want to take the combine count of the values from both the columns.
Here, combining WIC and WICcash columns I have 3 Y values, so my output will be
Douglas 3
How can I achieve this?
Below is the code that I have written till now
load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
group_markets_by_county = GROUP load_data BY County;
filter_county = FILTER group_markets_by_county BY group == 'Douglas';
DUMP filter_county;
For looking inside a bag, you can use a nested-foreach.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */
C = FOREACH B {
FILTER_WIC_Y = FILTER A by WIC == 'Y';
COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;
Or, you can replace 'Y'&'N' into 1&0 and add them up.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;
UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.
The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})
You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Random selection in pig after doing group BY

I have a query. I have a data in the format id:int, name:chararray
1, abc
1, def
2, ghi,
2, mno
2, pqr
After that I do Group BY id and my data becomes
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
Any idea what How this can be done ?
The question is I have grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.
Script:
data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
FOREACH grpd {
shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1.f2);
};
DUMP rand;
Output:
(1,abc)
(2,ghi)
Running it again:
(1,abc)
(2,pqr)
And again:
(1,def)
(2,pqr)
One more time!
(1,abc)
(2,ghi)
Whee!
(1,def)
(2,mno)

Dividing counts in Pig Script

ch = LOAD 'ch.txt';
ch_all = GROUP ch ALL;
ch_count = FOREACH ch_all GENERATE COUNT(ch);
ca = LOAD 'ca.txt';
ca_all = GROUP ca ALL;
ca_count = FOREACH ca_all GENERATE COUNT(ca);
I have the above pig script code, which computes two counts.
Now I want to divide ch_count by ca_count and store it in a file.
How do I do that?
There is no convenient way to do this in Pig but a JOIN could help you:
Pig:
ch = LOAD 'ch.txt';
ch_all = GROUP ch ALL;
ch_count = FOREACH ch_all GENERATE 'same' AS key, (DOUBLE) COUNT(ch) AS ct;
ca = LOAD 'ca.txt';
ca_all = GROUP ca ALL;
ca_count = FOREACH ca_all GENERATE 'same' AS key, (DOUBLE) COUNT(ca) AS ct;
ca_ch = JOIN ch_count BY key, ca_count BY key;
ca_ch_div = FOREACH ca_ch GENERATE ch_count::ct / ca_count::ct;
DUMP ca_ch_div;
Output:
(0.6666666666666666)
Input:
cat ch.txt
1
2
cat ca.txt
1
2
3