Counting result lines in pig latin - apache-pig

I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx, is the total number of SOME_VALUE found.
How can I search for multiple values and print each one as above ?

What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B and C so you don't do any COUNTs you don't need to.

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39.
I don’t want to specify each and every field like $0, $1,$2 etc
Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case?
You can use the .. notation.
First 40 fields
B = FOREACH A GENERATE $0..$39;
All fields
B = FOREACH A GENERATE $0..;
Multiple ranges,for example first 10,15-20,25-50
B = FOREACH A GENERATE $0..$9,$14..$19,$24..;
Random fields 22,33-44,46
B = FOREACH A GENERATE $21,$32..$43,$45;

character Counting in apache

I have few text files and I'm looking to count letters in all those text files combined in total. For example text1.txt contains "Stackoverflow is so cool". I'm looking to get the total letter count
Load all the files using wildcard character * into field of type chararray.Split the line into words and then into letters and count them.
A = LOAD '/path/text*.txt' AS (lines:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(lines))) AS words;
C = FOREACH B GENERATE FLATTEN(TOKENIZE(REPLACE(words,'','|'), '|')) AS letters;
D = GROUP C BY letters;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;

Pig script to count the number of letters in a file

I want to extend the hello world program of hadoop word count to be able to count the number of letters in the input file.
I have written this so far and I'm unable to figure out what is wrong with this code. Any help identifying the issue will be appreciated.
A = load '/tmp/alice.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(REGEX_EXTRACT_ALL(word, '([a-zA-Z])')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into '/tmp/alice_wordcount';
Let me say that I am a PIG newbie, but somehow this query got me interested. I diverged into all kinds of complex stuff like nested foreach, UDFs etc. But in the end, the answer is pretty simple. It's just a correction in one of your pig latin lines as below:
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
Instead of using regexp_extract_all, I instead opt to REPLACE each letter boundary with a special character ('|' here, though you can use an uncommon sequence also if you like) and then TOKENIZE around that delimiter.
try the following code
Load the data A = load '/tmp/alice.txt';
Split the line into words B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
Split words into chars C = foreach B generate flatten(TOKENIZE(REPLACE($0,'','|'),'|')) as letter;
Group the letters D = GROUP C BY letter;
Generate the results with count of each letter E = foreach D generate COUNT(C), group;
Store F into '/tmp/alice_wordcount';

How to get the number of words per line in pig?

I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?
if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.
Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)