How to compute substraction of the data from different alias in pig? - apache-pig

dump count_a;
(10)
dump count_b;
(20)
dump count_c;
(30)
Now I want to to calculate: count_c - count_b - count_a. How to achieve this in PIG script ?

You'll need to join the three aliases together, for which you'll need a common field to join upon. Assuming they are all single record aliases you can create a field n to join on:
prep_a = FOREACH count_a GENERATE 1 AS n, a;
prep_b = FOREACH count_b GENERATE 1 AS n, b;
prep_c = FOREACH count_c GENERATE 1 AS n, c;
Then you can join a & b & c all by the common field n:
ab = JOIN prep_a by n FULL, prep_b by n;
abc = JOIN ab by prep_a::n FULL, prep_c by n;
Then finally calculate the final result:
result = FOREACH abc GENERATE (c - b - a) AS result;

Related

Pig - Convert rows into multiple columns

Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")

Apache Pig Distinct and Count

I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;

Pig :: Get MAX value from the COUNTs of a groups

I have a file Bank name, location and few other fields too. I want to find out the bank with maximum branches.
A = LOAD 'bank.txt';
B = GROUP A by $0;
C = FOREACH B GENERATE COUNT($1);
I go the Bank wise counts. Now I am stuck how to refer C to get the bank with MAX branches.
Since you are grouping by Bank,you will have to generate the grouping and count the field that represents the branch,then order by the count desc and get the top row.
A = LOAD 'bank.txt';
B = GROUP A by $0;
C = FOREACH B GENERATE group as Bank,COUNT(B.Branches_Field) cnt;
D = ORDER C BY cnt DESC;
E = LIMIT D 1;
DUMP E;

Distinct on Multiple columns of a pig

I have a file
(1,1,100)
(1,1,200)
(1,2,300)
Now I want the distinct to be applied on two columns and want the output to be
I did this
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY id3) {
b = A.(id1,id2);
s = DISTINCT b;
GENERATE FLATTEN(s);
};
DUMP A_unique;
Output comes out to be
(1,1)
(1,1)
(1,2)
I expected it to be
(1,1)
(1,2)
Here you go this should give you the desired output -
a = load 'sample1.txt' using PigStorage(',') as (id1:int, id2:int, id3:int);
b = group a by (id1, id2);
c = foreach b {
first_e = limit a.id3 1;
generate flatten(group) as (id1, id2);
}
Below code generates the required result.
a = load '$dir/data' using PigStorage(',') as (d1:int,d2:int,d3:int);
b= group a all;
c= foreach b {
d = a.(d1,d2);
e = DISTINCT d;
generate FLATTEN(e);
}
dump c ;
~

how to combine/concat two bags in pig latin

I have two datasets:
A = {uid, url}; B = {uid, url};
now I do a cogroup:
C = COGROUP A BY uid, B BY uid;
and I want to change C into {group AS uid, DISTINCT A.url+B.url};
My question is how do I do this concatenation of two bags A.url and B.url?
Or to put it differently, how do I do DISTINCT on multiple columns?
It cannot be what you're expecting but that's what I understood from your question:
C = JOIN A BY uid, B BY uid;
D = DISTINCT C;
Concatenation is done the following way:
E = FOREACH D GENERATE CONCAT(A::uid,B::uid);
A = LOAD 'A' using PigStorage() as (uid,url);
B = LOAD 'B' using PigStorage() as (uid,url);
C = JOIN A by uid ,B by uid;
D = FOREACH C GENERATE $0,CONCAT(A::url,B::url);
E= DISTINCT D;
dump E;