How to match ',' in PIG? - apache-pig

The below pig script gives the count of various characters in a file. It works for all characters except ','.
My code :
A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
This matches all characters except ',' and gives an output.
Input: (cat a.txt)
HI, I.
Output:(output in file generated)
1 H
2 I
1 .
It doesn't give the count of , in the file. I don't understand why it isn't giving the count of ',' !

The first tokenize will eliminate the token separators space, double quote("), coma(,) parenthesis(()), star(*).Instead use replace to tokenize each character and then count.See below
Input
HI, I.
PigScript
A = LOAD 'test3.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
Output

Related

character Counting in apache

I have few text files and I'm looking to count letters in all those text files combined in total. For example text1.txt contains "Stackoverflow is so cool". I'm looking to get the total letter count
Load all the files using wildcard character * into field of type chararray.Split the line into words and then into letters and count them.
A = LOAD '/path/text*.txt' AS (lines:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(lines))) AS words;
C = FOREACH B GENERATE FLATTEN(TOKENIZE(REPLACE(words,'','|'), '|')) AS letters;
D = GROUP C BY letters;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;

Pig script to count the number of letters in a file

I want to extend the hello world program of hadoop word count to be able to count the number of letters in the input file.
I have written this so far and I'm unable to figure out what is wrong with this code. Any help identifying the issue will be appreciated.
A = load '/tmp/alice.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(REGEX_EXTRACT_ALL(word, '([a-zA-Z])')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into '/tmp/alice_wordcount';
Let me say that I am a PIG newbie, but somehow this query got me interested. I diverged into all kinds of complex stuff like nested foreach, UDFs etc. But in the end, the answer is pretty simple. It's just a correction in one of your pig latin lines as below:
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
Instead of using regexp_extract_all, I instead opt to REPLACE each letter boundary with a special character ('|' here, though you can use an uncommon sequence also if you like) and then TOKENIZE around that delimiter.
try the following code
Load the data A = load '/tmp/alice.txt';
Split the line into words B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
Split words into chars C = foreach B generate flatten(TOKENIZE(REPLACE($0,'','|'),'|')) as letter;
Group the letters D = GROUP C BY letter;
Generate the results with count of each letter E = foreach D generate COUNT(C), group;
Store F into '/tmp/alice_wordcount';

Sort tuples in a bag based on multiple fileds

I am trying to sort tuples inside a bag based on three fields in descending order..
Example : Suppose I have the following bag created by grouping:
{(s,3,my),(w,7,pr),(q,2,je)}
I want to sort the tuples in the above grouped bag based on $0,$1,$2 fields in such a way that first it will sort on $0 of all the tuples. It will pick the tuple with largest $0 value. If $0 are same for all the tuples then it will sort on $1 and so on.
The sorting should be for all the grouped bags through iterating process.
Suppose if we have databag something like:
{(21,25,34),(21,28,64),(21,25,52)}
Then according to the requirement output should be like:
{(21,25,34),(21,25,52),(21,28,64)}
Please let me know if you need any more clarification
Order your tuple in a nested foreach. This will work.
Input:
(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b, c, d;
GENERATE od;
};
DUMP C Result(which resembles your data):
({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})
Output:
({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})
This will work for all the cases.
Generate tuple with highest value:
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b desc , c desc , d desc;
od1 = LIMIT od 1;
GENERATE od1;
};
dump D;
Generate tuple with highest value if all the three fields are different, if all the tuples are same or if field 1 and field2 are same then return all the tuple.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
F = RANK C; //rank used to separate out the value if two tuples are same
R = FOREACH F {
dis = distinct A;
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
R4 = FOREACH R3 {
fil1 = ORDER A by b desc, c desc, d desc;
fil2 = LIMIT fil1 1;
GENERATE rank_C,fil2;
}; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A);
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {
DIS = distinct F1;
GENERATE flatten(DIS);
};
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value,
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Apache Pig occurences of a character in descending order

I want to count the number of occurrences of a character in descending order/ascending order and neglecting special characters in Apache Pig? Can anyone give the solution for this?
My input file is like the following:
adaek#482;awst%16
alf$951;adftu*15
Desired output:
a : 5
d,t,f:2
e,k,w,l,u: 1
You'll need a UDF StringToCharArray that breaks a string into a bag of characters (wrap toCharArray() and return bag) and then do the following:
a = load ... as (inp : chararray);
b = foreach a generate flatten(StringToCharArray(inp)) as singlechar;
c = group b by singlechar;
d = foreach c generate c.group as singlechar, COUNT_STAR(b) as total;
e = group d by total;
f = foreach e generate d as chargroup, e.group as total;
dump f;