GROUP BY without the key in the resulting bag - apache-pig

I have:
a b
a c
a d
and I would like to generate:
a, {(b),(c),(d)}
Doing this by using GROUP results in:
a, {(a,b),(a,c),(a,d)}
How do I get rid of the first field in the bag?
Thanks.

There is no option to do this in GROUP. You'll have to project that column out in a FOREACH.
-- DESCRIBE A ;
-- A: {c1: chararray, c2: chararray}
-- DUMP A ;
-- a b
-- a c
-- a d
B = GROUP A BY c1 ;
C = FOREACH B GENERATE group AS c1, A.c2 AS grpd_c2 ;
In cases where I have to do this I generally use this way for brevity:
D = FOREACH (GROUP A BY c1)
GENERATE group AS c1, A.c2 AS grpd_c2 ;
(Also, this way helps to remind me to not to use B.c2)
The key is A.c2 which returns a bag with only the c2 column from the original bag. If, for example, you had 3 fields (c1, c2, c3) you would use A.(c2, c3) instead.

B = GROUP A BY c1 ;
If you have more fields, it will be something like this:
C = FOREACH B GENERATE group AS c1, A.(c2,....);

Related

How to get all Contract no against Leads in oracle sql query?

I need to create a sql query for below scenario:
Table name is remark
Columns are contractno and leadid.
1 contractno can have multiple leadid.
similarly,
1 leadid can assigned to multiple contractno.
Lets assume:
C1 --> L1
C2 --> L1, L2
C3 --> L2
I will get only one contractno i.e. C1 as parameter.
Now I have to find all Contracts against C1 through leadid.
Please help me out how I can achieve this.
Thank you.
SELECT r1.contractno
FROM remark r1
JOIN remark r2
ON r1.leadid = r2.leadid
WHERE r2.contractno = 'C1'
AND r1.contractno <> 'C1'
This assume your table has this format:
contractno leadid
C1 L1
C2 L1
C2 L2
C3 L1
If you dont, then you need to split the csv value into rows first:
Turning a Comma Separated string into individual rows
You can use a LISTAGG if you have to group list of contracts. Here too it is assumed that your table has linear format and not comma separated leadids
WITH cn
AS (SELECT DISTINCT leadid
FROM remark
WHERE contractno = 'C1')
SELECT Listagg(r.contractno, ',')
within GROUP (ORDER BY ROWNUM) contractno_C1
FROM remark r
join cn
ON r.leadid = cn.leadid
WHERE r.contractno <> 'C1'
GROUP BY cn.leadid;
http://sqlfiddle.com/#!4/54e48/1/0

Pig - Convert rows into multiple columns

Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")

How to avoid using a subquery

Here a sample table :
A | B
----------
DF RUI
EF RUI
AF FRO
EF FRO
I want to get all results except WHERE (A = 'EF' AND B = 'RUI') like this :
A | B
----------
DF RUI
AF FRO
EF FRO
But is it possible to do this without a subquery ?
EDIT :
I have add some extra results to show what I want to get. I want to get result if A = EF or B = RUI but i don't want to get result if A = EF AND B = RUI
just add a NOT condition in front of where clause:
SELECT A,B FROM table_name
WHERE NOT (A = 'EF' AND B = 'RUI')
If I've understood you correctly . . .
select * from your_table
where not (A = 'EF' and B = 'RUI');
SELECT A, B FROM table WHERE NOT (A = 'EF' AND B = 'RUI')
or
SELECT A, B FROM table WHERE (A <> 'EF' OR B <> 'RUI')
The where clause is essentially a boolean expression, so you can do any kind of boolean transforms that you're used to (it's a bit more complicated if it involved NULL values, which I assumed will not be the case for your example; if it does, you might need to add some additional rules or check the behaviour under SQLs terinary logic).
SELECT A, B FROM table WHERE (A = 'EF' XOR B = 'RUI')
Select * from SAMPLETABLE where A <> 'EF' OR B <> 'RUI';

How can I sum two fileds from different tables in PIG?

table example:
A = LOAD 'data' AS (a1:int,a2:int);
DUMP A;
(1,2)
(1,3)
(2,2)
(3,4)
(3,1)
and I get
A2 = GROUP A BY a1;
DUMP A2;
(1,{(1,2),(1,3)})
(2,{(2,2)})
(3,{(3,4),(3,1)})
B = LOAD 'data2' AS (b1:int,b2:int);
(1,4)
(2,3)
(3,2)
The results that I want are
(1,{(1,6),(1,7)})
(2,{(2,5)})
(3,{(3,6),(3,3)})
That is,
FOREACH A2 GENERATE group,A.a2+B.b2
WHERE A.a1==B.b1 but the error shows :
Invalid scalar projection: B
Any thoughts would be great,thanks.
You may have to do join first and then add and then do group by.
joined_data = JOIN A by a1, B by b1;
summed_data = FOREACH joined_data GENERATE a1 as a1,a2+b2 as sum;
final_answer = GROUP summed_data by a1;

SQL: Flatten hierarchy with missing levels

I have a table with a parent-child hierarchy together with a column that tells on which hierarchy level the current id is.
An example:
id pid level
A H1
B A H2
C B H3
D C H4
E H1
F E H3
G F H4
I want this to be transposed or flatten creating two rows one for each id on the lowest level.
Like this:
id H1 H2 H3 H4
D A B C D
G E F G
Can you do it in SQL using pivot? I was thinking that maybe the value of the "level" column could be used as name for the columns in the result table? Value "H1" maps to column name "H1" and so on.
Stored procedure would also be a possible solution that I could think of. Anyone who have done something like this?
Thanks for your help!
/Andreas
I created a table ttt that contains your data. This query will pivot five levels for you:
select id
, [H1], [H2], [H3], [H4], [H5]
from
(
select distinct
coalesce(c6.id,c5.id,c4.id,c3.id,c2.id,c1.id) as id
, c1.id as lid
, c1.level
from ttt c1
left join ttt c2
on c1.id = c2.pid
left join ttt c3
on c2.id = c3.pid
left join ttt c4
on c3.id = c4.pid
left join ttt c5
on c4.id = c5.pid
left join ttt c6
on c5.id = c6.pid ) as sourcetable
pivot ( max(lid)
for level in ([H1], [H2], [H3], [H4], [H5])
) as pivottable;
you can check it out at sqlfiddle: example