accessing an element like array in pig - apache-pig

I have data in the form:
id,val1,val2
example
1,0.2,0.1
1,0.1,0.7
1,0.2,0.3
2,0.7,0.9
2,0.2,0.3
2,0.4,0.5
So first I want to sort each id by val1 in decreasing order..so somethng like
1,0.2,0.1
1,0.2,0.3
1,0.1,0.7
2,0.7,0.9
2,0.4,0.5
2,0.2,0.3
And then select the second element id,val2 combination for each id
So for example:
1,0.3
2,0.5
How do I approach this?
Thanks

Pig is a scripting language and not relational one like SQL, it is well suited to work with groups with operators nested inside a FOREACH. Here is the solutions:
A = LOAD 'input' USING PigStorage(',') AS (id:int, v1:float, v2:float);
B = GROUP A BY id; -- isolate all rows for the same id
C = FOREACH B { -- here comes the scripting bit
elems = ORDER A BY v1 DESC; -- sort rows belonging to the id
two = LIMIT elems 2; -- select top 2
two_invers = ORDER two BY v1 ASC; -- sort in opposite order to bubble second value to the top
second = LIMIT two_invers 1;
GENERATE FLATTEN(group) as id, FLATTEN(second.v2);
};
DUMP C;
In your example id 1 has two rows with v1 == 0.2 but different v2, thus the second value for the id 1 can be 0.1 or 0.3

A = LOAD 'input' USING PigStorage(',') AS (id:int, v1:int, v2:int);
B = ORDER A BY id ASC, v1 DESC;
C = FOREACH B GENERATE id, v2;
DUMP C;

Related

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you
Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;
Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Apache pig about sort top n

Recently i try to use pig to sort some data, and following is my script to data order by count (for example i want to find top 3) :
in = load 'data.txt';
split = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group split by tmp;
result = foreach C generate group, COUNT(split) as cnt;
des = ORDER result BY cnt DESC;
fin = LIMIT des 3;
And then output just like:
A,10
B,9
C,8
But if we have another data which count is also 8, it can't be output. In detail, when i type DUMP des, contents like following:
A,10
B,9
C,8
D,8
E,8
F,7
.
.
If i want to output top 3, it also need to include D,8 E,8 in the result, but LIMIT in pig script language can't do it. Does someone have experience deal with the problem by using pig language, or must write an UDF function to handle it?
Limit will not work in your case, you have to use RANK and FILTER operator.
data.txt
A,A,A,A,A,A,A,A,A,A
B,B,B,B,B,B,B,B,B
C,C,C,C,C,C,C,C
D,D,D,D,D,D,D,D
E,E,E,E,E,E,E,E
F,F,F,F,F,F,F
PigScript:
in = load 'data.txt';
sp = foreach in generate flatten(TOKENIZE((chararray)$0)) as tmp;
C = group sp by tmp;
result = foreach C generate group, COUNT(sp) as cnt;
myrank = RANK result BY cnt DESC DENSE;
top3 = FILTER myrank BY rank_result<=3;
finalOutput = FOREACH top3 GENERATE group,cnt;
DUMP finalOutput;
Output:
(A,10)
(B,9)
(C,8)
(D,8)
(E,8)

How to perform a DISTINCT in Pig Latin on a subset of columns?

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:
You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).
It is simple to perform a DISTINCT operation on all of the columns:
A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;
Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?
Here's an example of input and expected output:
A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;
(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)
-- insert DISTINCT operation on a1,a2,a3 here:
-- ...
DUMP A_unique;
(1 2 3 4)
(1 2 4 4)
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY a4) {
b = A.(a1,a2,a3);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};
The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:
A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE
group.f1, group.f2, f4, f5, group.f3 ;
When you group on certain fields, the selection would have unique values for the group in a each tuple.
For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.
A_unique = DISTINCT A;
Here are 2 possible solutions, are there any other good approaches?
Solution 1 (using LIMIT 1):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;
grouped_and_distinct = FOREACH grouped_by_a4 {
single = LIMIT A2 1;
GENERATE FLATTEN(single);
};
Solution 2 (using DISTINCT):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;
-- Perform the distinct on a projection of combined and flatten
grouped_and_distinct = FOREACH grouped_by_a4 {
combined_unique = DISTINCT A2.combined;
GENERATE FLATTEN(combined_unique);
};
unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
limit_a = LIMIT A 1;
GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};
I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:
A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);
I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};

ranking in Apache Pig

Is there a good way to do ranking on a column in Apache Pig after you've sorted it? Even better would be if the ranking handled ties.
A = LOAD 'file.txt' as (score:int, name:chararray);
B = foreach A generate score, name order by score;
....
Try the Rank operation
A = load 'data' AS (f1:chararray,f2:int,f3:chararray);
DUMP A;
(David,1,N)
(Tete,2,N)
B = rank A;
dump B;
(1,David,1,N)
(2,Tete,2,N)
Reference https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
I think you could use "ORDER BY" operator. And here is the link
B = ORDER A BY score DESC;
or
B = ORDER A BY score ASC;
You should use a mix of both the solutions
B = ORDER A BY score DESC;
C = rank B;
Lets say you want the second largest
D = filter C by $0 == 2;
You can use Rank in PIG and it will handle also ties, but it will use only one reducer while applying rank ,so performance will impact.