Counting how many words of different lengths are in the data so For example, (8,1) (words, length) - apache-pig

The function should output a pair with the format and with examples <”Length 8”, 1> or <”Length 7”, 1>, or similar such as <"8",1>.
To get the length of a string “theWord” in Pig you need to use the function SIZE for each word. To concatenate the size of an word with the string “Length “, you need to use the function CONCAT for each size. And lastly, I know that in order to convert an integer to string in order to concatenate it with another string cast it with (CHARARRAY). For example, I would use "(CHARARRAY)SIZE(word)".
I have code written but when I try to dump the data it does not do what I expect it to. I think I might need to do a count function, but I am a little stumped with this.
p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_unique = group words_lower by word_lower;
words_with_size = foreach words_unique generate SIZE(words_lower) as size, group;
words_with_size_concat = CONCAT words_with_count BY (CHARARRAY)size(words_lower) DESC, group;

I figured it out! The code should be as such:
p1 = LOAD 'poems/input/Poem1.txt' USING TextLoader AS(line:Chararray);
p2 = LOAD 'poems/input/Poem2.txt' USING TextLoader AS(line:Chararray);
p3 = LOAD 'poems/input/Poem3.txt' USING TextLoader AS(line:Chararray);
p4 = LOAD 'poems/input/Poem4.txt' USING TextLoader AS(line:Chararray);
p5 = LOAD 'poems/input/Poem5.txt' USING TextLoader AS(line:Chararray);
p6 = LOAD 'poems/input/Poem6.txt' USING TextLoader AS(line:Chararray);
p = UNION p1, p2, p3, p4, p5, p6;
words = foreach p generate flatten(TOKENIZE(line , ' ,;:!?\t\n\r\f\\.\\-')) as word;
words_lower = foreach words generate LOWER(word) as word_lower;
words_length = foreach words generate CONCAT('Length ', (CHARARRAY)SIZE(word)) as word_length;
words_unique = group words_length by word_length
words_with_count = foreach words_unique generate COUNT(words_length) as cnt, group;
words_with_count_sorted = ORDER words_with_count BY cnt DESC, group;
store words_with_count_sorted into 'poems/output/wordcount1';

I believe the code word_length should be modified. It does not make sense to convert the words to the lower case without using them. Consider...
words_length = foreach words_lower generate CONCAT('Length ', (CHARARRAY)SIZE(word)) as word_length;
In fact, the word length count can be applied directly to words without lowering them so the words_lower code may be removed.
On the words_unique, don't forget ';'

Related

matrix multiplication apache pig

I am trying to perform matrix multiplication in pig latin. Here's my attempt so far:
matrix1 = LOAD 'mat1' AS (row,col,value);
matrix2 = LOAD 'mat2' AS (row,col,value);
mult_mat = COGROUP matrix1 BY row, matrix2 BY col;
mult_mat = FOREACH mult_mat {
A = COGROUP matrix1 BY col, matrix2 BY row;
B = FOREACH A GENERATE group AS col, matrix1.value*matrix2.value AS prod;
GENERATE group AS row, B.col AS col, SUM(B.prod) AS value;}
However, this doesn't work. I get stopped at
A = COGROUP matrix1...
with
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 14, column 37> mismatched input 'matrix1' expecting LEFT_PAREN
After some playing around, I figured it out:
matrix1 = LOAD 'mat1' AS (row,col,value);
matrix2 = LOAD 'mat2' AS (row,col,value);
A = JOIN matrix1 BY column FULL OUTER, matrix2 BY row;
B = FOREACH A GENERATE matrix1::row AS m1r, matrix2::column AS m2c, (matrix1::value)*(matrix2::value) AS value;
C = GROUP B BY (m1r, m2c);
multiplied_matrices = FOREACH C GENERATE group.$0 as row, group.$1 as column, SUM(B.value) AS val;
Multiplied matrices should return the product of matrix1*matrix2 in the same format that the 2 matrices were entered, (row, col, value).

Pig: how to parse tuple with variable number of elements?

This is my output file, which I wrote out with another Pig script:
1 3,5
2 4,6,7
I'm trying to parse each line as (chararray, tuple)
data = load 'test45' as (x:chararray, y:tuple());
But when I try to dump the tuples, they're empty:
rows = foreach data generate y;
()
()
try this.
X = LOAD 'pigtuple.txt' AS (str:chararray);
X1 = FOREACH X GENERATE FLATTEN(STRSPLIT(str, '\\s+')) AS (id:int, attr:chararray);
X3 = FOREACH X1 GENERATE id, STRSPLIT(attr, ',') AS (y:tuple());
X4 = foreach X3 GENERATE id,y;
dump X4;
if you want access each element in tuple.
X4 = foreach X3 GENERATE y.$0,y.$1;

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;
UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.
The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})
You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Dividing counts in Pig Script

ch = LOAD 'ch.txt';
ch_all = GROUP ch ALL;
ch_count = FOREACH ch_all GENERATE COUNT(ch);
ca = LOAD 'ca.txt';
ca_all = GROUP ca ALL;
ca_count = FOREACH ca_all GENERATE COUNT(ca);
I have the above pig script code, which computes two counts.
Now I want to divide ch_count by ca_count and store it in a file.
How do I do that?
There is no convenient way to do this in Pig but a JOIN could help you:
Pig:
ch = LOAD 'ch.txt';
ch_all = GROUP ch ALL;
ch_count = FOREACH ch_all GENERATE 'same' AS key, (DOUBLE) COUNT(ch) AS ct;
ca = LOAD 'ca.txt';
ca_all = GROUP ca ALL;
ca_count = FOREACH ca_all GENERATE 'same' AS key, (DOUBLE) COUNT(ca) AS ct;
ca_ch = JOIN ch_count BY key, ca_count BY key;
ca_ch_div = FOREACH ca_ch GENERATE ch_count::ct / ca_count::ct;
DUMP ca_ch_div;
Output:
(0.6666666666666666)
Input:
cat ch.txt
1
2
cat ca.txt
1
2
3