How to get the number of words per line in pig? - apache-pig

I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?

if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.

Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)

Related

I have a seemingly simple Pig generate and then filter issue

I am trying to run a simple Pig script on a simple csv file and I can not get FILTER to do what I want. I have a test.csv file that looks like this:
john,12,44,,0
bob,14,56,5,7
dave,13,40,5,5
jill,8,,,6
Here is my script that does not work:
people = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filtered = FILTER data BY first == 13;
DUMP filtered;
When I dump data, everything looks good. I get the name and the first and last integer as expected. When I describe the data, everything looks good:
data: {name: bytearray,first: int,second: int}
When I try and filter out data by the first value being 13, I get nothing. DUMP filtered simply returns nothing. Oddly enough, if I change it to first > 13, then all "rows" will print out.
However, this script works:
peopletwo = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',') AS (f1:chararray,f2:int,f3:int,f4:int,f5:int);
datatwo = FOREACH peopletwo GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filteredtwo = FILTER datatwo BY first == 13;
DUMP filteredtwo;
What is the difference between filteredtwo and filtered (or data and datatwo for that matter)? I want to know why the new relation obtained using GENERATE (i.e. data) won't filter in the first script as one would expect.
Specify the datatype in the load itself.See below
people = LOAD 'test5.csv' USING PigStorage(',') as (f1:chararray,f2:int,f3:int,f4:int,f5:int);
filtered = FILTER people BY f2 == 13;
DUMP filtered;
Output
Changing the filter to use > gives
filtered = FILTER people BY f2 > 13;
Output
EDIT
When converting from bytearray you will have to explicitly cast the value of the fields in the FOREACH.This works.
people = LOAD 'test5.csv' USING PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray,(int)$1 AS f1,(int)$4 AS f2;
filtered = FILTER data BY f1 == 13;
DUMP filtered;

Apache Pig floating number SUM error in precision

I have rows with a double values.
The sum of them however has additional floating digits which I dont want in the output. Any idea to avoid this problem ?
A = LOAD 'test.csv' Using PigStorage(',') AS (
ID: chararray,
COST:double
);
B = GROUP A BY (ID);
C = FOREACH B GENERATE SUM(A.COST);
STORE C INTO 'out.txt' USING PigStorage(',');
INPUT FILE
A,0.51
A,0.51
B,4.81
B,4.81
EXPECTED OUTPUT FILE
A,1.02
B,9.62
ACTUAL INVALID OUTPUT FILE
10.020000457763672
9.619999885559082
Try C = FOREACH B GENERATE ROUND(SUM(A.COST)*100.0)/100.0;
EDIT
It works, see below the output

Error while returning output of Pig macro via tuple

The error is in the function below, I'm trying to generate 2 measures of entropy (the latter removes all events with <5 frequency).
My error:
ERROR 1200: Cannot expand macro 'TOTUPLE'. Reason: Macro must be defined before expansion.
Which is weird, because TOTUPLE is a built-in function. Other pig scripts use TOTUPLE with no problems.
Code:
define dual_entropies (search, field) returns entropies {
summary = summary_total($search, $field);
entr1 = count_sum_entropy(summary, $field);
summary = filter summary by events >= 5L;
entr2 = count_sum_entropy(summary, $field);
$entropies = TOTUPLE(entr1, entr2);
};
Note that entr1 and entr2 are both single numbers, not vectors of numbers - I suspect that's part of the issue.
I ran into similar confusions. I'm not sure if it's true in general but Pig only liked TOTUPLE when it's part of a FOREACH operation. I worked around by doing group by all, which returns a bag with a single tuple in it, followed by a FOREACH .. GENERATE such as:
B = group A ALL;
C = foreach B generate 'x', 2, TOTUPLE('a', 'b', 'c');
dump C;
...
(x,2,(hi,2,3))
Perhaps this will help

How to extract keys from map?

How do I extract all keys from a map field?
I have a bag of tuples where one of the fields is a map that contains HTTP headers (and their values). I want to create a set of all possible keys (in my dataset) for a HTTP header and count how many times I've seen them.
Ideally, something like:
A = LOAD ...
B = FOREACH A GENERATE KEYS(http_headers)
C = GROUP FLATTEN(B) BY $0
D = FOREACH C GENERATE group, COUNT($0)
(didn't test it but it illustrates the idea..)
How do I do something like this? If I can extract a bag of keys from a map it would actually solve it. I just couldn't find any function like this in piglatin's documentation.
Yes there is a command in Pig to accomplish this.
Example:
/* data */
[a#1,b#2,c#3]
[green#sam,eggs#I,ham#am]
A = load 'data' as (M:[]);
B = foreach A generate KEYSET($0);
dump B
Output:
({(b),(c),(a)})
({(ham),(eggs),(green)})

Counting result lines in pig latin

I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx, is the total number of SOME_VALUE found.
How can I search for multiple values and print each one as above ?
What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B and C so you don't do any COUNTs you don't need to.