Doing word count in pig - apache-pig

I have data already processed in following form:
( id ,{ bag of words})
So for example:
(foobar, {(foo), (foo),(foobar),(bar)})
(foo,{(bar),(bar)})
and so on..
describe processed gives me:
processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
Now what I want is.. also count the number of times a word appears in this data and output it as
foobar, foo, 2
foobar,foobar,1
foobar,bar,1
foo,bar,2
and so on...
How do I do this in pig?

Though you can do this in pure pig, it should be much more efficient to do this with a UDF. Something along the lines of:
#outputschema('wordcounts: {T:(word:chararray, count:int)}')
def generate_wordcount(BAG):
d = {}
for word in BAG:
if word in d:
d[word] += 1
else:
d[word] = 1
return d.items()
You can then use this UDF like this:
REGISTER 'myudfs.py' USING jython AS myudfs ;
-- A: (id, words: {T:(word:chararray)})
B = FOREACH A GENERATE id, FLATTEN(myudfs.generate_wordcount(words)) ;

Try this:
$ cat input
foobar foo
foobar foo
foobar foobar
foobar bar
foo bar
foo bar
--preparing
inputs = LOAD 'input' AS (first: chararray, second: chararray);
grouped = GROUP inputs BY first;
formatted = FOREACH grouped GENERATE group, inputs.second AS second;
--what you need
flattened = FOREACH formatted GENERATE group, FLATTEN(second);
result = FOREACH (GROUP flattened BY (group, second)) GENERATE FLATTEN(group), COUNT(flattened);
DUMP result;
Output:
(foo,bar,2)
(foobar,bar,1)
(foobar,foo,2)
(foobar,foobar,1)

Related

Transpose array to rows using PIG latin

How to convert ARRY elements in BAG to multiple rows eg: below
My input:
tuple, ARRAY_ELEM
(32,{(1,emp,3271409712),(2,emp,3271409712)})
Output
(32,1,emp,3271409712)
(32,2,emp,3271409712)
You probably need to call FLATTEN twice.
Note, FLATTEN on a tuple just elevates each field in the tuple to a top-level field.
FLATTEN on bag produces a cross product of every record in the bag with all of the other expressions in GENERATE.
A = load 'test.txt' using PigStorage() as (a0:int, t1:(a1:int, b1 {(a3:int,a4:chararray,a5:chararray)}));
describe A;
B = FOREACH A GENERATE FLATTEN(t1);
describe B;
C = FOREACH B GENERATE a1, FLATTEN(b1);
describe C;
dump C;
Output
A: {a0: int,t1: (a1: int,b1: {(a3: int,a4: chararray,a5: chararray)})}
B: {t1::a1: int,t1::b1: {(a3: int,a4: chararray,a5: chararray)}}
C: {t1::a1: int,t1::b1::a3: int,t1::b1::a4: chararray,t1::b1::a5: chararray}
(32,1,emp,3271409712)
(32,2,emp,3271409712)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

Getting a count of a particular string

How can I count the number of occurances of a particular string,say 'Y' , for each individual row and do calculations on that count after that. For ex. how can I find the number of 'Y' for each 'FMID' and do calculations on that count for each FMID ?Dataset Screenshot
You could use TOKENIZE built-in function which converts row into a BAG and than use nested filtering in the foreach to get a BAG that contains only word you are interested in, on which you can use COUNT. See FOREACH description
For example:
inpt = load '....' as (line : string);
row_bags = foreach inpt generate line, TOKENIZE(line) as word;
cnt = foreach row_bags {
match_1 = filter word by 'Y';
match_2 = filter word by 'X';
generate line, COUNT(match_1) as count_1, COUNT(match_2) as count_2;
};
dump cnt;
Using some functions from DataFu library you could get count for each string in the BAG word.
Here's the simplest way I can think of solving your problem:
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS
(fmid:int, field1:chararray, field2:chararray,
field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
I'm not really certain about the data schema you're working with. I'm going to assume you have a relation in Pig consisting of a sequence of tuples. I'm also going to assume you have a lot of fields, making it a pain to reference each individually.
I'll walk through this example piece by piece to explain. Without loss of generality, I will use this data below for my example:
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
Now that we've loaded the data, we want to transpose each tuple to a bag, because once it is in a bag we can perform counts on the items within it more easily. We'll use TransposeTupleToBag from the DataFu library:
define Transpose datafu.pig.util.TransposeTupleToBag();
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
Note that you need at least Pig 0.11 to use this UDF. Note the use of $1.., which is known as a project-range expression in Pig. If you have many fields this is really convenient.
If you were to dump data2 at this point you would get this:
(1000,{(field1,N),(field2,N),(field3,N),(field4,N)})
(2000,{(field1,N),(field2,Y),(field3,N),(field4,N)})
(3000,{(field1,Y),(field2,Y),(field3,N),(field4,N)})
(4000,{(field1,Y),(field2,Y),(field3,Y),(field4,Y)})
What we've done is taken the fields from the tuple after the 0th element (fmid), and tranposed these into a bag where each tuple has a key and value field.
Now that we have a bag we can do a simple filter and count:
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
Now if you dump data2 you get the expected counts reflecting the number of Y values in the tuple.
(1000,0)
(2000,1)
(3000,2)
(4000,4)
Here is the full source code for my example as a unit test, which you can put directly in the DataFu unit tests to try out:
/**
register $JAR_PATH
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
dump data2;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
dump data2;
STORE data2 INTO 'output';
*/
#Multiline
private String example;
#Test
public void example() throws Exception
{
PigTest test = createPigTestFromString(example);
writeLinesToFile("input",
"1000,N,N,N,N",
"2000,N,Y,N,N",
"3000,Y,Y,N,N",
"4000,Y,Y,Y,Y");
test.runScript();
super.getLinesForAlias(test, "data2");
}

Counting result lines in pig latin

I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx, is the total number of SOME_VALUE found.
How can I search for multiple values and print each one as above ?
What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B and C so you don't do any COUNTs you don't need to.

Boolean to int conversion in Pig

I have a data set that looks like this:
foo,R
foo,Y
bar,C
foo,R
baz,Y
foo,R
baz,Y
baz,R
...
I'd like to generate a report that sums up the number of 'R', 'Y' and 'C' records for each unique value in the first column. For this data set, it would look like:
foo,3,1,0
bar,0,0,1
baz,1,2,0
Where the 2nd column is the number of 'R' records, the third is the number of 'Y' records and the last is the number of 'C' records.
I know I can first filter by record type, group and aggregate, but that leads to an expensive join of the three sub-reports. I would much rather group once and GENERATE each of the {R, Y, C} columns in my group.
How can I convert the Boolean result of comparing the second column in my data set to 'R', 'Y' or 'C' to a numeric value I can aggregate? Ideally I want 1 for a match and 0 for a non-match for each of the three columns.
Apache PIG is perfectly adapted for such type of problems. It can be solved with one GROUP BY and one nested FOREACH
inpt = load '~/pig/data/group_pivot.csv' using PigStorage(',') as (val : chararray, cat : chararray);
grp = group inpt by (val);
final = foreach grp {
rBag = filter inpt by cat == 'R';
yBag = filter inpt by cat == 'Y';
cBag = filter inpt by cat == 'C';
generate flatten(group) as val, SIZE(rBag) as R, SIZE(yBag) as Y, SIZE(cBag) as C;
};
dump final;
--(bar,0,0,1)
--(baz,1,2,0)
--(foo,3,1,0)
bool = foreach final generate val, (R == 0 ? 0 : 1) as R, (Y == 0 ? 0 : 1) as Y, (C == 0 ? 0 : 1) as C;
dump bool;
--(bar,0,0,1)
--(baz,1,1,0)
--(foo,1,1,0)
I have tried it on your example and got the expected result. The idea is that after GROUP BY each value has a BAG that contains all rows with R, Y, C categories. Using FILTER within FOREACH we create 3 separate BAGs (one per R, Y, C) and SIZE(bag) in GENERATE counts the number of rows in each bag.
The only problem you might encounter is when there are too many rows with the same value in val column, as nested FOREACH relies on in memory operations and resulting intermidiate BAGs could get quite large. If you start getting memory related exceptions, then you can inspire from How to handle spill memory in pig. The idea would be to use 2 GROUP BY operations, first one to get counts per (val, cat) and second to pivot R, Y, C around val, thus avoiding expensive JOIN operation (see Pivoting in Pig).
Regarding the question with BOOLEAN: I have used bincond operator.
If you do not need the counts, you could use IsEmpty(bag) instead of SIZE(bag), it would be slightly faster and bincond to get your 0 and 1 conversions.