How can I sum two fileds from different tables in PIG? - sum

table example:
A = LOAD 'data' AS (a1:int,a2:int);
DUMP A;
(1,2)
(1,3)
(2,2)
(3,4)
(3,1)
and I get
A2 = GROUP A BY a1;
DUMP A2;
(1,{(1,2),(1,3)})
(2,{(2,2)})
(3,{(3,4),(3,1)})
B = LOAD 'data2' AS (b1:int,b2:int);
(1,4)
(2,3)
(3,2)
The results that I want are
(1,{(1,6),(1,7)})
(2,{(2,5)})
(3,{(3,6),(3,3)})
That is,
FOREACH A2 GENERATE group,A.a2+B.b2
WHERE A.a1==B.b1 but the error shows :
Invalid scalar projection: B
Any thoughts would be great,thanks.

You may have to do join first and then add and then do group by.
joined_data = JOIN A by a1, B by b1;
summed_data = FOREACH joined_data GENERATE a1 as a1,a2+b2 as sum;
final_answer = GROUP summed_data by a1;

Related

Summarized table in postgreSQL for better performance

I am using postgreSQL as my database. I have a table MASTER(A, B, C, D, N1, N2, N3, N4, N5, N6) where the primary key is (A, B, C, D) and N1, N2, N3, N4, N5, N6 are the numeric columns.
I have a query as below to get the summarized data of each A selected from each list in MASTERCOMB.
SELECT MASTERCOM.A
,STATS.sumn1
,STATS.sumn2
,STATS.sumn3
,STATS.sumn4
,STATS.sumn5
,STATS.sumn6
FROM (WITH
sum1 AS (SELECT A, SUM(N1) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N1) DESC LIMIT $2),
sum2 AS (SELECT A, SUM(N2) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N2) DESC LIMIT $2),
sum3 AS (SELECT A, SUM(N3) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N3) DESC LIMIT $2),
sum4 AS (SELECT A, SUM(N4) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N4) DESC LIMIT $2),
sum5 AS (SELECT A, SUM(N5) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N5) DESC LIMIT $2),
sum6 AS (SELECT A, SUM(N6) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N6) DESC LIMIT $2)
SELECT DISTINCT COALESCE(sum1.A, sum2.A, sum3.A, sum4.A, sum5.A, sum6.A) A
FROM sum1
FULL OUTER JOIN sum2 ON sum2.A = sum1.A
FULL OUTER JOIN sum3 ON sum3.A = sum1.A
FULL OUTER JOIN sum4 ON sum4.A = sum1.A
FULL OUTER JOIN sum5 ON sum5.A = sum1.A
FULL OUTER JOIN sum6 ON sum6.A = sum1.A) MASTERCOMB
LEFT JOIN (SELECT A
,SUM(N1) sumn1
,SUM(N2) sumn2
,SUM(N3) sumn3
,SUM(N4) sumn4
,SUM(N5) sumn5
,SUM(N6 sumn6)
FROM MASTER WHERE B = $1 GROUP BY A) AS STATS
ON STATS.A = MASTERCOMB.A
This is just one kind of query with B in the WHERE clause. I may have to query with different combinations like 'WHERE C = $3' OR 'WHERE D = $4'. In rare cases I may have to query with combinations of multiple conditions on B, C and D together;
As the table grows, the performance of the queries could drop. So I am thinking of two aproaches
Approach #1:
Create Summary Tables SMRY_A_B, SMRY_A_C, SMRY_A_D
On each insert, update and delete of MASTER table, SUM the values and insert/update/delete respective tables
Approach #2:
Create a Summary table SMRY_A_B_C_D with primary key (A, B, C, D)
On each insert, update and delete of MASTER table, SUM the values and insert/update/delete SMRY_A_B_C_D table
possible values for SMRY_A_B_C_D could be
(valA, valB, 'N/A', 'N/A', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
(valA, 'N/A, valC, 'N/A', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
(valA, 'N/A, 'N/A', 'valD', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
Questions:
Which approach is better to go with?
Should I not consider both the approaches and query from the master table itself? If so should I optimize the query?

Pig - Convert rows into multiple columns

Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")

Apache Pig Distinct and Count

I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;

Distinct on Multiple columns of a pig

I have a file
(1,1,100)
(1,1,200)
(1,2,300)
Now I want the distinct to be applied on two columns and want the output to be
I did this
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
A_unique =
FOREACH (GROUP A BY id3) {
b = A.(id1,id2);
s = DISTINCT b;
GENERATE FLATTEN(s);
};
DUMP A_unique;
Output comes out to be
(1,1)
(1,1)
(1,2)
I expected it to be
(1,1)
(1,2)
Here you go this should give you the desired output -
a = load 'sample1.txt' using PigStorage(',') as (id1:int, id2:int, id3:int);
b = group a by (id1, id2);
c = foreach b {
first_e = limit a.id3 1;
generate flatten(group) as (id1, id2);
}
Below code generates the required result.
a = load '$dir/data' using PigStorage(',') as (d1:int,d2:int,d3:int);
b= group a all;
c= foreach b {
d = a.(d1,d2);
e = DISTINCT d;
generate FLATTEN(e);
}
dump c ;
~

GROUP BY without the key in the resulting bag

I have:
a b
a c
a d
and I would like to generate:
a, {(b),(c),(d)}
Doing this by using GROUP results in:
a, {(a,b),(a,c),(a,d)}
How do I get rid of the first field in the bag?
Thanks.
There is no option to do this in GROUP. You'll have to project that column out in a FOREACH.
-- DESCRIBE A ;
-- A: {c1: chararray, c2: chararray}
-- DUMP A ;
-- a b
-- a c
-- a d
B = GROUP A BY c1 ;
C = FOREACH B GENERATE group AS c1, A.c2 AS grpd_c2 ;
In cases where I have to do this I generally use this way for brevity:
D = FOREACH (GROUP A BY c1)
GENERATE group AS c1, A.c2 AS grpd_c2 ;
(Also, this way helps to remind me to not to use B.c2)
The key is A.c2 which returns a bag with only the c2 column from the original bag. If, for example, you had 3 fields (c1, c2, c3) you would use A.(c2, c3) instead.
B = GROUP A BY c1 ;
If you have more fields, it will be something like this:
C = FOREACH B GENERATE group AS c1, A.(c2,....);