Printed HashTable in Pig-Latin? - apache-pig

For example, I have this data:
x, 23
y, 492
v, 2034
x, 45
z, 25
v, 29
Which I want to transform into:
x, 23, 45
y, 492
v, 2034, 29
z, 25
It would be the equivalent of a printed hash table.
Here is my current script:
logs = LOAD 'tmp' using MyLoader (Parameters) as
(x:bytearray, y:bytearray, z, x1, y1:bytearray, z1:long, x2:bytearray,
z2:bytearray, z3:bytearray, z4:float, dataMap:map[],
recs:bag{(record:bytearray)}, key:bytearray, colo:bytearray);
filtered_logs = foreach logs {
info = FILTER records BY record MATCHES 'FIRST_REGEX';
info_records = FOREACH info GENERATE GET_FIELDS($0) as
rec:tuple(mClass:bytearray, rType:bytearray,
rName:bytearray, rStatus:bytearray, rDuration:float,
rData:bytearray, rDataMap:map[]);
name = FOREACH info_records GENERATE rec.rName;
matching_requests = FILTER records BY record MATCHES 'SECOND_REGEX';
GENERATE FLATTEN(client_name) as client_name:chararray,
dataMap#'corr_id_', (SIZE(matching_requests) > 0 ? true : false)
as matched:boolean;
}
A = FILTER filtered_logs BY matched;
key_corr_id = foreach A generate (chararray) $1 as key, (chararray) $2 as corr_id;
id_group = group key_corr_id by key; -- ERROR thrown when this line is included.
STORE id_group into '$output' using
org.apache.pig.piggybank.storage.CSVExcelStorage(, 'YES_MULTILINE');
The error being thrown:
java.lang.ClassCastException: org.apache.pig.data.DataByteArrayString cannot be cast to java.lang.String

No need to create a new relation and join.Just group by the key and dump the relation.
key_corr_id = foreach A generate (chararray) $1 as key:chararray, (chararray) $2 as corr_id:chararray;
id_group = group key_corr_id by key;
dump id_group;
Now if you don't want the tuples say for key x , {(23),(45)} but want the items separated like x,23,45 then add another step to use BagToString on the corr_id in the grouping like this
final = foreach id_group generate key,BagToString(A.$1, ',');
dump final;

Related

Compare Tuples value present inside a bag with a hardcoded String value

I have a data set with these columns:-
FMID,County,WIC,WICcash
Here is a sample of data:-
1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y
I have grouped the data based on County and have filtered the data based on County = 'Douglas'. Here is the output:
(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})
Now if the WIC and WICcash columns have value as Y then I want to take the combine count of the values from both the columns.
Here, combining WIC and WICcash columns I have 3 Y values, so my output will be
Douglas 3
How can I achieve this?
Below is the code that I have written till now
load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
group_markets_by_county = GROUP load_data BY County;
filter_county = FILTER group_markets_by_county BY group == 'Douglas';
DUMP filter_county;
For looking inside a bag, you can use a nested-foreach.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */
C = FOREACH B {
FILTER_WIC_Y = FILTER A by WIC == 'Y';
COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;
Or, you can replace 'Y'&'N' into 1&0 and add them up.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;

Pig: how to parse tuple with variable number of elements?

This is my output file, which I wrote out with another Pig script:
1 3,5
2 4,6,7
I'm trying to parse each line as (chararray, tuple)
data = load 'test45' as (x:chararray, y:tuple());
But when I try to dump the tuples, they're empty:
rows = foreach data generate y;
()
()
try this.
X = LOAD 'pigtuple.txt' AS (str:chararray);
X1 = FOREACH X GENERATE FLATTEN(STRSPLIT(str, '\\s+')) AS (id:int, attr:chararray);
X3 = FOREACH X1 GENERATE id, STRSPLIT(attr, ',') AS (y:tuple());
X4 = foreach X3 GENERATE id,y;
dump X4;
if you want access each element in tuple.
X4 = foreach X3 GENERATE y.$0,y.$1;

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;
UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.
The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})
You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Random selection in pig after doing group BY

I have a query. I have a data in the format id:int, name:chararray
1, abc
1, def
2, ghi,
2, mno
2, pqr
After that I do Group BY id and my data becomes
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
Any idea what How this can be done ?
The question is I have grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.
Script:
data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
FOREACH grpd {
shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1.f2);
};
DUMP rand;
Output:
(1,abc)
(2,ghi)
Running it again:
(1,abc)
(2,pqr)
And again:
(1,def)
(2,pqr)
One more time!
(1,abc)
(2,ghi)
Whee!
(1,def)
(2,mno)