Pig min command and order by - apache-pig

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?

Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Basic statistics with Apache Pig

I am trying to characterize fractions of rows having certain properties using Apache Pig.
For example, if the data looks like:
a,15
a,16
a,17
b,3
b,16
I would like to get:
a,0.6
b,0.4
I am trying to do the following:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
which gives me total = (5), but then when I attempt to use this 'total':
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
I get an error.
Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?
One more way to do the same:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
You will have to project and cast it to double:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
For some reason the following modification of what #inquisitive-mind suggested works:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Apache Pig floating number SUM error in precision

I have rows with a double values.
The sum of them however has additional floating digits which I dont want in the output. Any idea to avoid this problem ?
A = LOAD 'test.csv' Using PigStorage(',') AS (
ID: chararray,
COST:double
);
B = GROUP A BY (ID);
C = FOREACH B GENERATE SUM(A.COST);
STORE C INTO 'out.txt' USING PigStorage(',');
INPUT FILE
A,0.51
A,0.51
B,4.81
B,4.81
EXPECTED OUTPUT FILE
A,1.02
B,9.62
ACTUAL INVALID OUTPUT FILE
10.020000457763672
9.619999885559082
Try C = FOREACH B GENERATE ROUND(SUM(A.COST)*100.0)/100.0;
EDIT
It works, see below the output

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Merge two lines in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
Output should be:
ABC,DEF,GHI,JKL
MNO,PQR,STU,VWX
Could anyone please help me?
It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach.
input.txt
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
PigScript:
REGISTER /tmp/datafu-1.2.0.jar;
DEFINE BagSplit datafu.pig.bags.BagSplit();
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = GROUP A ALL;
C = FOREACH B GENERATE FLATTEN(BagSplit(2,$1)) AS mybag;
D = FOREACH C GENERATE FLATTEN(STRSPLIT(REPLACE(BagToString(mybag),'_null_null_null_null',''),'_',4));
E = FOREACH D GENERATE $2,$3,$0,$1;
DUMP E;
Output:
(MNO,PQR,STU,VWX)
(ABC,DEF,GHI,JKL)
Note:
Based on the above input format, my assumption will be 1st row last two cols will be null, 2nd row first two cols will be null, similarly for 3rd and 4th row also