Split string and use last value? - apache-pig

I would like to split a string fields into parts (space separator) and use the last value of a field. I know i can split data using strsplit, but how i can take the last value?
eg: input:
AAA BB CC
SS DD
AA
output:
CC
DD
AA
thanks

You can do that with a combination of LAST INDEX_OF, SUBSTRING and SIZE.

input
AAA BB CC
SS DD
AA
A = load 'input.txt' as (line : chararray);
B = FOREACH A generate line, LAST_INDEX_OF(line,' ') AS ind;
C = FOREACH B GENERATE (ind>0?SUBSTRING(line,ind+1,ind+3):SUBSTRING(line,0,2));
Dump C;
output
(CC)
(DD)
(AA)
if last value size is not same in this case use size() instead of ind+3

One more solution. It works well for all combination.
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'\\s*(\\w+)$',1);
DUMP B;
Output:
(CC)
(DD)
(AA)

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

How to match ',' in PIG?

The below pig script gives the count of various characters in a file. It works for all characters except ','.
My code :
A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
This matches all characters except ',' and gives an output.
Input: (cat a.txt)
HI, I.
Output:(output in file generated)
1 H
2 I
1 .
It doesn't give the count of , in the file. I don't understand why it isn't giving the count of ',' !
The first tokenize will eliminate the token separators space, double quote("), coma(,) parenthesis(()), star(*).Instead use replace to tokenize each character and then count.See below
Input
HI, I.
PigScript
A = LOAD 'test3.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
Output

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Apache Pig : Append two data sets to one

i have two data sets
1st set A
(111)
(222)
(555)
2nd set B
(333)
(444)
(666)
i did
C = UNION A,B;
after appending two data sets output should be first data set and next second data set
Expected output C is
(111)
(222)
(555)
(333)
(444)
(666)
But my output C is
(333)
(444)
(666)
(111)
(222)
(555)
if i apply union the result is in not order
it is difficult to me to append them in set order
How can i do this ?
i cant think of any but any help will be appreciated.
Add an extra column to each of the files giving the file_number and then do union of the modified data sets, followed by sorting based on the column giving 'file_number'
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A_mod = FOREACH A GENERATE a, 1 AS file_number;
B_mod = FOREACH A GENERATE b, 2 AS file_number;
unified_mod = UNION A_mod, B_mod;
output = SORT unified_mod BY file_number;
I've try the classic union and for me the data stay in order.
But let's try to force-it if it doesn't :)
well as I said in the previous comment it's not efficient but it makes the job.
--In order to determine nbA you can run the following cmd in the shell : wc -l A.txt
%default nbA 3
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A = RANK A;
B = RANK B;
--DESCRIBE B;
B = FOREACH B GENERATE rank_B + $nbA, $1;
C= UNION B,A;
C= ORDER C BY $0;
C= FOREACH C GENERATE $1; --If you want to drop the first column
DUMP C;
Output :
(111)
(222)
(555)
(333)
(444)
(666)
Where :
A.txt
111
222
555
And B.txt:
333
444
666