Sort tuples in a bag based on multiple fileds - apache-pig

I am trying to sort tuples inside a bag based on three fields in descending order..
Example : Suppose I have the following bag created by grouping:
{(s,3,my),(w,7,pr),(q,2,je)}
I want to sort the tuples in the above grouped bag based on $0,$1,$2 fields in such a way that first it will sort on $0 of all the tuples. It will pick the tuple with largest $0 value. If $0 are same for all the tuples then it will sort on $1 and so on.
The sorting should be for all the grouped bags through iterating process.
Suppose if we have databag something like:
{(21,25,34),(21,28,64),(21,25,52)}
Then according to the requirement output should be like:
{(21,25,34),(21,25,52),(21,28,64)}
Please let me know if you need any more clarification

Order your tuple in a nested foreach. This will work.
Input:
(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b, c, d;
GENERATE od;
};
DUMP C Result(which resembles your data):
({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})
Output:
({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})
This will work for all the cases.
Generate tuple with highest value:
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b desc , c desc , d desc;
od1 = LIMIT od 1;
GENERATE od1;
};
dump D;
Generate tuple with highest value if all the three fields are different, if all the tuples are same or if field 1 and field2 are same then return all the tuple.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
F = RANK C; //rank used to separate out the value if two tuples are same
R = FOREACH F {
dis = distinct A;
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
R4 = FOREACH R3 {
fil1 = ORDER A by b desc, c desc, d desc;
fil2 = LIMIT fil1 1;
GENERATE rank_C,fil2;
}; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A);
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {
DIS = distinct F1;
GENERATE flatten(DIS);
};
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value,
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;

Related

Create a key comparing two rows in SAS

I have SAS data set which has 2 columns
Var1 Var2
A B
B C
C D
D E
F G
H F
Can i create a same unique key for above rows. The final output which i want is
Var1 Var2 Key
A B 1
B C 1
C D 1
D E 1
F G 2
H F 2
The general problem of assigning a group identifier based on row-to-row linkages can be very rich and difficult. However, for the sequential case the solution is not so bad.
Sample code
Presume the group identity changes when both variable values are not present in the prior row.
data have;input
Var1 $ Var2 $;datalines;
A B
B C
C D
D E
F G
H F
run;
data want;
set have;
group_id + ( var1 ne lag(var2) AND var2 ne lag(var1) );
run;
Complex case
#Vivek Gupta states in comments
There are random arrangement of rows in the dataset
Consider arbitrary rows p and q with items X and Y. Groups are created by linkages whose criteria is:
p.X = q.X
OR p.X = q.Y
OR p.y = q.x
OR p.y = q.y
A hash based solver will populate groups initially from a data scan. Repeated scans of data with hash lookups migrate items into lower groups (thus enlarging the group) until there is a scan with no migrations.
data pairs;
id + 1;
input item1 $ item2 $ ;
cards;
A B
C D
D E
B C
H F
X Y
F G
run;
data _null_ ;
length item $8 group 8;
retain item '' group .;
if 0 then set pairs;
declare hash pairs();
pairs.defineKey('item1', 'item2');
pairs.defineDone();
declare hash map(ordered:'A');
map.definekey ('item');
map.definedata ('item', 'group');
map.definedone();
_groupId = 0;
noMappings = 0;
nPass = 0;
do until (end);
set pairs end=end;
pairs.replace();
found1 = map.find(key:item1) eq 0; item1g = group;
found2 = map.find(key:item2) eq 0; item2g = group;
put item1= item2= found1= found2= item1g= item2=;
select;
when ( found1 and not found2) map.add(key:item2,data:item2,data:item1g);
when (not found1 and found2) map.add(key:item1,data:item1,data:item2g);
when (not found1 and not found2) do;
_groupId + 1;
map.add(key:item1,data:item1,data:_groupId);
map.add(key:item2,data:item2,data:_groupId);
end;
otherwise
;
end;
end;
declare hiter data('pairs');
do iteration = 1 to 1000 until (discrete);
put iteration=;
discrete = 1;
do index = 1 by 1 while (data.next() = 0);
found1 = map.find(key:item1) eq 0; item1g = group;
found2 = map.find(key:item2) eq 0; item2g = group;
put index= item1= item2= item1g= item2g=;
if (item1g < item2g) then do; map.replace(key:item2,data:item2,data:item1g); discrete=0; end;
if (item2g < item1g) then do; map.replace(key:item1,data:item1,data:item2g); discrete=0; end;
end;
end;
if discrete then put 'NOTE: discrete groups at' iteration=; else put 'NOTE: Groups not discrete after ' iteration=;
map.output(dataset:'map');
run;
Complex case #2
Groups are created by linkages whose criteria is
p.X = q.X
OR p.y = q.y
The following example is offsite and too long to post here.
How to create groups from rows associated by linkages in either of two variables
General statement of problem:
Given: P = p{i} = (p{i,1),p{i,2}), a set of pairs (key1, key2).
Find: The distinct groups, G = g{x}, of P,
such that each pair p in a group g has this property:
key1 matches key1 of any other pair in g.
-or-
key2 matches key2 of any other pair in g.
In short, the example shows
An iterative way using hashes.
Two hashes maintain the groupId assigned to each key value.
Two additional hashes are used to maintain group mapping paths.
When the data can be passed without causing a mapping, then the groups
have been fully determined.
A final pass is done
groupIds are assigned to each pair
data is output to a table
As you have not describe any logic so for your sample output below query will work
select Var1, Var2, 1 as [key]
from t

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

PIG flatten vs group on nested bag

I'm learning PIG and I have a question that I know might be in books, but unfortunately I don't have the time to do the research.
I have two pipelines:
(option 1):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
e = foreach d generate flatten(year) as year, event, mpg;
f = group e by year;
g = foreach f generate group, AVG(e.mpg);
x = limit g 10;
dump x;
I load 2 files, then join then, then I take the last 2 digits of the date to get year, after I used flatten to simplify things before grouping to get average of mpg.
(option 2):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
f = group d by year;
g = foreach f generate group, AVG(d.mpg);
x = limit g 10;
dump x;
Same thing, but I don't use flatten to group and then get average of mpg.
I get the same results but, is there a significant difference? In this case the dataset I used is not big, but I'm curious about how it would be the case if I have a couple of millions of records.
Thanks.

Apache Pig occurences of a character in descending order

I want to count the number of occurrences of a character in descending order/ascending order and neglecting special characters in Apache Pig? Can anyone give the solution for this?
My input file is like the following:
adaek#482;awst%16
alf$951;adftu*15
Desired output:
a : 5
d,t,f:2
e,k,w,l,u: 1
You'll need a UDF StringToCharArray that breaks a string into a bag of characters (wrap toCharArray() and return bag) and then do the following:
a = load ... as (inp : chararray);
b = foreach a generate flatten(StringToCharArray(inp)) as singlechar;
c = group b by singlechar;
d = foreach c generate c.group as singlechar, COUNT_STAR(b) as total;
e = group d by total;
f = foreach e generate d as chargroup, e.group as total;
dump f;