matrix multiplication apache pig - apache-pig

I am trying to perform matrix multiplication in pig latin. Here's my attempt so far:
matrix1 = LOAD 'mat1' AS (row,col,value);
matrix2 = LOAD 'mat2' AS (row,col,value);
mult_mat = COGROUP matrix1 BY row, matrix2 BY col;
mult_mat = FOREACH mult_mat {
A = COGROUP matrix1 BY col, matrix2 BY row;
B = FOREACH A GENERATE group AS col, matrix1.value*matrix2.value AS prod;
GENERATE group AS row, B.col AS col, SUM(B.prod) AS value;}
However, this doesn't work. I get stopped at
A = COGROUP matrix1...
with
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 14, column 37> mismatched input 'matrix1' expecting LEFT_PAREN

After some playing around, I figured it out:
matrix1 = LOAD 'mat1' AS (row,col,value);
matrix2 = LOAD 'mat2' AS (row,col,value);
A = JOIN matrix1 BY column FULL OUTER, matrix2 BY row;
B = FOREACH A GENERATE matrix1::row AS m1r, matrix2::column AS m2c, (matrix1::value)*(matrix2::value) AS value;
C = GROUP B BY (m1r, m2c);
multiplied_matrices = FOREACH C GENERATE group.$0 as row, group.$1 as column, SUM(B.value) AS val;
Multiplied matrices should return the product of matrix1*matrix2 in the same format that the 2 matrices were entered, (row, col, value).

Related

Compare Tuples value present inside a bag with a hardcoded String value

I have a data set with these columns:-
FMID,County,WIC,WICcash
Here is a sample of data:-
1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y
I have grouped the data based on County and have filtered the data based on County = 'Douglas'. Here is the output:
(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})
Now if the WIC and WICcash columns have value as Y then I want to take the combine count of the values from both the columns.
Here, combining WIC and WICcash columns I have 3 Y values, so my output will be
Douglas 3
How can I achieve this?
Below is the code that I have written till now
load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
group_markets_by_county = GROUP load_data BY County;
filter_county = FILTER group_markets_by_county BY group == 'Douglas';
DUMP filter_county;
For looking inside a bag, you can use a nested-foreach.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */
C = FOREACH B {
FILTER_WIC_Y = FILTER A by WIC == 'Y';
COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;
Or, you can replace 'Y'&'N' into 1&0 and add them up.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;

Append new columns to a pandas dataframe in a groupby object

I would like to add columns to a pandas dataframe in a groupby object
# create the dataframe
idx = ['a','b','c'] * 10
df = pd.DataFrame({
'f1' : np.random.randn(30),
'f2' : np.random.randn(30),
'f3' : np.random.randn(30),
'f4' : np.random.randn(30),
'f5' : np.random.randn(30)},
index = idx)
colnum = [1,2,3,4,5]
newcol = ['a' + str(s) for s in colnum]
# group by the index
df1 = df.groupby(df.index)
Trying to loop over each group in the groupby object and add new columns to the current dataframe in the group
for group in df1:
tmp = group[1]
for s in range(len(tmp.columns)):
print(s)
tmp.loc[:,newcol[s]] = tmp[[tmp.columns[s]]] * colnum[s]
group[1] = tmp
I'm unable to add the new dataframe to the group object
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
TypeError: 'tuple' object does not support item assignment
Is there a way to replace the dataframe in the groupby object with a new dataframe ?
Base on your code: (PS: df.mul([1,2,3,4,5]) work for you example out put)
grouplist=[]
for _,group in df1:
tmp = group
for s in range(len(tmp.columns)):
print(s)
tmp.loc[:,newcol[s]] = tmp[[tmp.columns[s]]] * colnum[s]
grouplist.append(tmp)
grouplist[1]
Out[217]:
f1 f2 f3 f4 f5 a1 a2 \
b -0.262064 -1.148832 -1.835077 -0.244675 -0.215145 -0.262064 -2.297664
b -1.595659 -0.448111 -0.908683 -0.157839 0.208497 -1.595659 -0.896222
b 0.373039 -0.557571 1.154175 -0.172326 1.236915 0.373039 -1.115142
b -1.485564 1.508292 0.420220 -0.380387 -0.725848 -1.485564 3.016584
b -0.760250 -0.380997 -0.774745 -0.853975 0.041411 -0.760250 -0.761994
b 0.600410 1.822984 -0.310327 -0.281853 0.458621 0.600410 3.645968
b -0.707724 1.706709 -0.208969 -1.696045 -1.644065 -0.707724 3.413417
b -0.892057 1.225944 -1.027265 -1.519110 -0.861458 -0.892057 2.451888
b -0.454419 -1.989300 2.241945 -1.071738 -0.905364 -0.454419 -3.978601
b 1.171569 -0.827023 -0.404192 -1.495059 0.500045 1.171569 -1.654046
a3 a4 a5
b -5.505230 -0.978700 -1.075727
b -2.726048 -0.631355 1.042483
b 3.462526 -0.689306 6.184576
b 1.260661 -1.521547 -3.629239
b -2.324236 -3.415901 0.207056
b -0.930980 -1.127412 2.293105
b -0.626908 -6.784181 -8.220324
b -3.081796 -6.076439 -4.307289
b 6.725834 -4.286954 -4.526821
b -1.212577 -5.980235 2.500226

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;
UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.
The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})
You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Random selection in pig after doing group BY

I have a query. I have a data in the format id:int, name:chararray
1, abc
1, def
2, ghi,
2, mno
2, pqr
After that I do Group BY id and my data becomes
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
Any idea what How this can be done ?
The question is I have grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.
Script:
data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
FOREACH grpd {
shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1.f2);
};
DUMP rand;
Output:
(1,abc)
(2,ghi)
Running it again:
(1,abc)
(2,pqr)
And again:
(1,def)
(2,pqr)
One more time!
(1,abc)
(2,ghi)
Whee!
(1,def)
(2,mno)