Random selection in pig after doing group BY

Random selection in pig after doing group BY - apache-pig

I have a query. I have a data in the format id:int, name:chararray
1, abc
1, def
2, ghi,
2, mno
2, pqr
After that I do Group BY id and my data becomes
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
Any idea what How this can be done ?
The question is I have grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"

Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.
Script:
data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
FOREACH grpd {
shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1.f2);
};
DUMP rand;
Output:
(1,abc)
(2,ghi)
Running it again:
(1,abc)
(2,pqr)
And again:
(1,def)
(2,pqr)
One more time!
(1,abc)
(2,ghi)
Whee!
(1,def)
(2,mno)

Related

Compare Tuples value present inside a bag with a hardcoded String value

I have a data set with these columns:-
FMID,County,WIC,WICcash
Here is a sample of data:-
1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y
I have grouped the data based on County and have filtered the data based on County = 'Douglas'. Here is the output:
(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})
Now if the WIC and WICcash columns have value as Y then I want to take the combine count of the values from both the columns.
Here, combining WIC and WICcash columns I have 3 Y values, so my output will be
Douglas 3
How can I achieve this?
Below is the code that I have written till now
load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
group_markets_by_county = GROUP load_data BY County;
filter_county = FILTER group_markets_by_county BY group == 'Douglas';
DUMP filter_county;

For looking inside a bag, you can use a nested-foreach.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */
C = FOREACH B {
FILTER_WIC_Y = FILTER A by WIC == 'Y';
COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;
Or, you can replace 'Y'&'N' into 1&0 and add them up.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;

Printed HashTable in Pig-Latin?

For example, I have this data:
x, 23
y, 492
v, 2034
x, 45
z, 25
v, 29
Which I want to transform into:
x, 23, 45
y, 492
v, 2034, 29
z, 25
It would be the equivalent of a printed hash table.
Here is my current script:
logs = LOAD 'tmp' using MyLoader (Parameters) as
(x:bytearray, y:bytearray, z, x1, y1:bytearray, z1:long, x2:bytearray,
z2:bytearray, z3:bytearray, z4:float, dataMap:map[],
recs:bag{(record:bytearray)}, key:bytearray, colo:bytearray);
filtered_logs = foreach logs {
info = FILTER records BY record MATCHES 'FIRST_REGEX';
info_records = FOREACH info GENERATE GET_FIELDS($0) as
rec:tuple(mClass:bytearray, rType:bytearray,
rName:bytearray, rStatus:bytearray, rDuration:float,
rData:bytearray, rDataMap:map[]);
name = FOREACH info_records GENERATE rec.rName;
matching_requests = FILTER records BY record MATCHES 'SECOND_REGEX';
GENERATE FLATTEN(client_name) as client_name:chararray,
dataMap#'corr_id_', (SIZE(matching_requests) > 0 ? true : false)
as matched:boolean;
}
A = FILTER filtered_logs BY matched;
key_corr_id = foreach A generate (chararray) $1 as key, (chararray) $2 as corr_id;
id_group = group key_corr_id by key; -- ERROR thrown when this line is included.
STORE id_group into '$output' using
org.apache.pig.piggybank.storage.CSVExcelStorage(, 'YES_MULTILINE');
The error being thrown:
java.lang.ClassCastException: org.apache.pig.data.DataByteArrayString cannot be cast to java.lang.String

No need to create a new relation and join.Just group by the key and dump the relation.
key_corr_id = foreach A generate (chararray) $1 as key:chararray, (chararray) $2 as corr_id:chararray;
id_group = group key_corr_id by key;
dump id_group;
Now if you don't want the tuples say for key x , {(23),(45)} but want the items separated like x,23,45 then add another step to use BagToString on the corr_id in the grouping like this
final = foreach id_group generate key,BagToString(A.$1, ',');
dump final;

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;

UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.

The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})

You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Limit but not Order in PIG

I meet one problem while I using Limit in PIG.
The result of Limit is sorted, but I don't want the result be sorted.
From the example on the website:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Using Limit
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Is it possible that show the top 3 lines without sorted in the reuslt?
(1,2,3)
(4,2,1)
(8,3,4)
My code is below:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = FOREACH C {
topnresult = LIMIT B $lines;
GENERATE FLATTEN(topnresult);
}
dump D;
Thank you very much.

By default LIMIT will execute ORDER command followed by LIMIT command internally, so obviously you will get the sorted list. There are many way to solve this problem, one option could be
input.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
PigScript:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = RANK A;
C = FILTER B BY rank_A<=3;
D = FOREACH C GENERATE a1,a2,a3;
DUMP D;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
Option2:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = GROUP A ALL;
C = FOREACH B {
top3list = LIMIT A 3;
GENERATE FLATTEN(top3list);
}
DUMP C;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
UPDATE: Solution1
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = GROUP C ALL;
E = FOREACH D {
topnresult = LIMIT C $lines;
GENERATE FLATTEN(topnresult);
}
DUMP E;
Solution2:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = RANK C;
E = FILTER D BY rank_C<=$lines;
F = FOREACH E GENERATE $1..;
DUMP F;
I have tested the solution using the below command line and its working fine
>pig -x local -param input='input.txt' -param s_field='$0,$1,$2' -param pattern='$0<10' -param lines=3 myscript.pig

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Random selection in pig after doing group BY - apache-pig

Related

Compare Tuples value present inside a bag with a hardcoded String value

Printed HashTable in Pig-Latin?

How to calculate the sum using pig script

How to group by index and index+1 in Pig

Limit but not Order in PIG

Categories

Resources