Pig script to count the number of letters in a file - apache-pig

I want to extend the hello world program of hadoop word count to be able to count the number of letters in the input file.
I have written this so far and I'm unable to figure out what is wrong with this code. Any help identifying the issue will be appreciated.
A = load '/tmp/alice.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(REGEX_EXTRACT_ALL(word, '([a-zA-Z])')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into '/tmp/alice_wordcount';

Let me say that I am a PIG newbie, but somehow this query got me interested. I diverged into all kinds of complex stuff like nested foreach, UDFs etc. But in the end, the answer is pretty simple. It's just a correction in one of your pig latin lines as below:
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
Instead of using regexp_extract_all, I instead opt to REPLACE each letter boundary with a special character ('|' here, though you can use an uncommon sequence also if you like) and then TOKENIZE around that delimiter.

try the following code
Load the data A = load '/tmp/alice.txt';
Split the line into words B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
Split words into chars C = foreach B generate flatten(TOKENIZE(REPLACE($0,'','|'),'|')) as letter;
Group the letters D = GROUP C BY letter;
Generate the results with count of each letter E = foreach D generate COUNT(C), group;
Store F into '/tmp/alice_wordcount';

Related

character Counting in apache

I have few text files and I'm looking to count letters in all those text files combined in total. For example text1.txt contains "Stackoverflow is so cool". I'm looking to get the total letter count
Load all the files using wildcard character * into field of type chararray.Split the line into words and then into letters and count them.
A = LOAD '/path/text*.txt' AS (lines:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(lines))) AS words;
C = FOREACH B GENERATE FLATTEN(TOKENIZE(REPLACE(words,'','|'), '|')) AS letters;
D = GROUP C BY letters;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;

How to match ',' in PIG?

The below pig script gives the count of various characters in a file. It works for all characters except ','.
My code :
A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
This matches all characters except ',' and gives an output.
Input: (cat a.txt)
HI, I.
Output:(output in file generated)
1 H
2 I
1 .
It doesn't give the count of , in the file. I don't understand why it isn't giving the count of ',' !
The first tokenize will eliminate the token separators space, double quote("), coma(,) parenthesis(()), star(*).Instead use replace to tokenize each character and then count.See below
Input
HI, I.
PigScript
A = LOAD 'test3.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
Output

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Counting result lines in pig latin

I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx, is the total number of SOME_VALUE found.
How can I search for multiple values and print each one as above ?
What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B and C so you don't do any COUNTs you don't need to.

Apache Pig occurences of a character in descending order

I want to count the number of occurrences of a character in descending order/ascending order and neglecting special characters in Apache Pig? Can anyone give the solution for this?
My input file is like the following:
adaek#482;awst%16
alf$951;adftu*15
Desired output:
a : 5
d,t,f:2
e,k,w,l,u: 1
You'll need a UDF StringToCharArray that breaks a string into a bag of characters (wrap toCharArray() and return bag) and then do the following:
a = load ... as (inp : chararray);
b = foreach a generate flatten(StringToCharArray(inp)) as singlechar;
c = group b by singlechar;
d = foreach c generate c.group as singlechar, COUNT_STAR(b) as total;
e = group d by total;
f = foreach e generate d as chargroup, e.group as total;
dump f;