Hive: count matches between INNER JOIN on unique values of a column - sql

I am trying to count matches between columns resulting from an INNER JOIN of two tables on unique values of a single column in one of the two tables. An example may make things more clear:
If I had the following two tables:
Table A
-------
id_A: info_A
1 'a'
2 'b'
3 'c'
3 'd'
Table B
-------
id_B: info_B
1 'a'
3 'c'
5 'b'
I want to find the unique id_A: [1,2,3] and the info_A associated with them: ['a','b','c','d'].
I want to create a table that looks like the following:
Table join of A+B
-----------------
id_A: info_A id_B info_B match_cnt
1 'a' 1 'a' 1
3 'c','d' 3 'c' 0.5
where match_cnt is the number of matches between info_A and info_B for a given id_A. FYI, the actual tables I'm working with have billions of rows.
A code chunk demonstrates what I've tried, plus variations (not shown below):
SELECT z.id_A, z.info_A, z.id_B, z.info_B
FROM(
SELECT u.id_A AS id_A, u.info_A AS info_A, y.id_B AS true_id_B, y.info_B AS true_info_B
FROM db.table_A u
WHERE EXISTS
( SELECT id_B, info_B
FROM table_B l
where l.id_B= u.id_A)
INNER JOIN table_B y
ON u.id_A = y.id_B
) z

You may use some thing like below :-
WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B group by ID_A,INFO_A)
select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from
(select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)
tmp join T1 on T1.ID_A=tmp.ID_A

select id
,collect_list (case when a=1 then info end) as info_a
,collect_list (case when b=1 then info end) as info_b
,count (case when a=1 and b=1 then 1 end) / count(*) as match_cnt
from (select id
,info
,min (case when tab = 'A' then 1 end) as a
,min (case when tab = 'B' then 1 end) as b
from ( select 'A' as tab ,id_A as id ,info_A as info from A
union all select 'B' as tab ,id_B as id ,info_B as info from B
) t
group by id
,info
) t
group by id
having min(a) = 1
and min(b) = 1
;
+----+-----------+--------+-----------+
| id | info_a | info_b | match_cnt |
+----+-----------+--------+-----------+
| 1 | ["a"] | ["a"] | 1.0 |
| 3 | ["c","d"] | ["c"] | 0.5 |
+----+-----------+--------+-----------+

Related

How to get 2 values of same num to 2 separate columns

I got table like this:
num |type|value
--------------
1 | a | 5
1 | b | 7
3 | c | 9
2 | a | 6
2 | b | 9
and want this kind of result:
num| value (a) | value (b)
-------------------------
1 | 5 | 7
2 | 6 | 9
You can use a self-join which will also remove the rows with just one value (num = 3 in your sample data)
select t1.num, t1.value as value_a, t2.value as value_b
from the_table t1
join the_table t2 on t1.num = t2.num and t2.type = 'b'
where t1.type = 'a'
You can use GROUP BY and CASE, as in:
select
num,
max(case when type = 'a' then value end) as value_a,
max(case when type = 'b' then value end) as value_b
from t
group by num
I'd join the table on itself, once for a and once for b
SELECT a.num, a.value, b.value
FROM mytable a
JOIN mytable b ON a.num = b.num AND a.type = 'a' AND b.type = 'b'

Oracle join to either of multiple columns

I have a RELATION table
NUM1 | NUM2 | NUM3
-- --- -----
1 2 3
2 4 5
3 4 null
3 4 null
and the actual INFO table where NUM is primary key.
NUM | A_LOT_OF_OTHER_INFO
--- --------------------
1 asdff
2 werwr
3 erert
4 ghfgh
5 cvbcb
I want to create a view to see the count of the NUM that appeared in any of the NUM1, NUM2, NUM3 of the RELATION table.
MY_VIEW
NUM | A_LOT_OF_OTHER_INFO | TOTAL_COUNT
--- -------------------- ------------
1 asdff 1
2 werwr 2
3 erert 3
4 ghfgh 3
5 cvbcb 1
I can do this by doing three selects from RELATION table and UNION them, but I do not want to use UNION because the tables have a lot of records, MY_VIEW is already large enough and I am looking for a better way to join to the RELATION table in the view. Can you suggest a way?
What i would try is to unpivot the relation table.
After that join the info table on the values and count the number of times the val gets repeated.
create table relation(num1 int,num2 int, num3 int);
insert into relation values(1,2,3);
insert into relation values(2,4,5);
insert into relation values(3,4,null);
create table info(num int, a_lot_of_other_info varchar2(100));
insert into info
select 1,'asdff' from dual union all
select 2,'werwr' from dual union all
select 3,'erert' from dual union all
select 4,'ghfgh' from dual union all
select 5,'cvbcb' from dual
select a.num
,max(a_lot_of_other_info) as a_lot_of_other_info
,count(*) as num_of_times
from info a
join (select val
from relation a
unpivot(val for x in (num1,num2,num3))
)b
on a.num=b.val
group by a.num
order by 1
I would suggest a correlated subquery:
select i.*,
(select ((case when r.num1 = i.num then 1 else 0 end) +
(case when r.num2 = i.num then 1 else 0 end) +
(case when r.num3 = i.num then 1 else 0 end)
)
from relation r
where i.num in (r.num1, r.num2, r.num3)
) as total_count
from info i;
If performance is a consideration, it might be faster to use left joins:
select i.*,
((case when r1.num1 is not null then 1 else 0 end) +
(case when r2.num1 is not null then 1 else 0 end) +
(case when r3.num1 is not null then 1 else 0 end)
) as total_count
from info i left join
relation r1
on i.num = r1.num1 left join
relation r2
on i.num = r2.num2 left join
relation r3
on i.num = r3.num3;
In particular, this will make optimal use of three separate indexes on relation: relation(num1), relation(num2), and relation(num3).
It seems what you want is UNPIVOT. Perhaps easiest to do with a cross join in this case:
select NUM, count(*) as TOTAL_COUNT
from (
select decode(column_value, 1, NUM1, 2, NUM2, 3, NUM3) as NUM
from RELATION cross join table(sys.odcinumberlist(1,2,3))
)
group by NUM
;
Then join this to the second table; the join part is really irrelevant here.

Select count of rows in two other tables

I have 3 tables. The main one in which I want to retrieve some information and two others for row count only.
I used a request like this :
SELECT A.*,
COUNT(B.id) AS b_count
FROM A
LEFT JOIN B on B.a_id = A.id
WHERE A.id > 50 AND B.ID < 100
GROUP BY A.id
from Gerry Shaw's comment here. It works perfectly but only for one table.
Now I need to add the row count for the third (C) table. I tried
SELECT A.*,
COUNT(B.id) AS b_count
COUNT(C.id) AS c_count
FROM A
LEFT JOIN B on B.a_id = A.id
LEFT JOIN C on C.a_id = A.id
GROUP BY A.id
but, because of the two left joins, my b_count and my c_count are false and equal to each other. In fact my actual b_count and c_count are equal to real_b_count*real_c_count. Any idea of how I could fix this without adding a lot of complexity/subqueries ?
Data sample as requested:
Table A (primary key : id)
id | data1 | data2
------+-------+-------
1 | 0,45 | 0,79
----------------------
2 | -2,24 | -0,25
----------------------
3 | 1,69 | 1,23
Table B (primary key : (a_id,fruit))
a_id | fruit
------+-------
1 | apple
------+-------
1 | banana
--------------
2 | apple
Table C (primary key : (a_id,color))
a_id | color
------+-------
2 | blue
------+-------
2 | purple
--------------
3 | blue
expected result:
id | data1 | data2 | b_count | c_count
------+-------+-------+---------+--------
1 | 0,45 | 0,79 | 2 | 0
----------------------+---------+--------
2 | -2,24 | -0,25 | 1 | 2
----------------------+---------+--------
3 | 1,69 | 1,23 | 0 | 1
There are two possible solutions. One is using subqueries behind SELECT
SELECT A.*,
(
SELECT COUNT(B.id) FROM B WHERE B.a_id = A.id AND B.ID < 100
) AS b_count,
(
SELECT COUNT(C.id) FROM C WHERE C.a_id = A.id
) AS c_count
FROM A
WHERE A.id > 50
the second are two SQL queries joined together
SELECT t1.*, t2.c_count
FROM
(
SELECT A.*,
COUNT(B.id) AS b_count
FROM A
LEFT JOIN B on B.a_id = A.id
WHERE A.id > 50 AND B.ID < 100
GROUP BY A.id
) t1
JOIN
(
SELECT A.*,
COUNT(C.id) AS c_count
FROM A
LEFT JOIN C on C.a_id = A.id
WHERE A.id > 50
GROUP BY A.id
) t2 ON t1.id = t2.id
I prefer the second syntax since it clearly shows the optimizer that you are interested in GROUP BY, however, the query plans are usually the same.
If tables B & C also have their own key fields, then you can use COUNT DISTINCT on the primary key rather than foreign key. That gets around the multi-line problem you see on link to several tables. If you can post the table structures then we can advise further.
Try something like this
SELECT A.*,
(SELECT COUNT(B.id) FROM B WHERE B.a_id = A.id) AS b_count,
(SELECT COUNT(C.id) FROM C WHERE C.a_id = A.id) AS c_count
FROM A
That is the easier way I can think:
Create table #a (id int, data1 float, data2 float)
Create table #b (id int, fruit varchar(50))
Create table #c (id int, color varchar(50))
Insert into #a
SELECT 1, 0.45, 0.79
UNION ALL SELECT 2, -2.24, -0.25
UNION ALL SELECT 3, 1.69, 1.23
Insert into #b
SELECT 1, 'apple'
UNION ALL SELECT 1, 'banana'
UNION ALL SELECT 2, 'orange'
Insert into #c
SELECT 2, 'blue'
UNION ALL SELECT 2, 'purple'
UNION ALL SELECT 3, 'orange'
SELECT #a.*,
(SELECT COUNT(#b.id) FROM #b where #b.id = #a.id) AS b_count,
(SELECT COUNT(#c.id) FROM #c where #c.id = #a.id) AS b_count
FROM #a
ORDER BY #a.id
Result:
id data1 data2 b_count b_count
1 0,45 0,79 2 0
2 -2,24 -0,25 1 2
3 1,69 1,23 0 1
If table b and c have unique id, you can try this:
SELECT A.*,
COUNT(distinct B.fruit) AS b_count,
COUNT(distinct C.color) AS c_count
FROM A
LEFT JOIN B on B.a_id = A.id
LEFT JOIN C on C.a_id = A.id
GROUP BY A.id
See SQLFiddle MySQL demo.

Merge multiple columns into one column with multiple rows

In PostgreSQL, how can I merge multiple columns into one column with multiple rows?
The columns are all boolean, so I want to:
Filter for true values only
Replace the true value (1) with the name of the column (A, B or C)
I have this table:
ID | A | B | C
1 0 1 0
2 1 1 0
3 0 0 1
4 1 0 1
5 1 0 0
6 0 1 1
I want to get this table:
ID | Letter
1 B
2 A
2 B
3 C
4 A
4 C
5 A
6 B
6 C
I think you need something like this:
SELECT ID, 'A' as Letter FROM table WHERE A=1
UNION ALL
SELECT ID, 'B' as Letter FROM table WHERE B=1
UNION ALL
SELECT ID, 'C'as Letter FROM table WHERE C=1
ORDER BY ID, Letter
SELECT ID,
(CASE
WHEN TABLE.A = 1 then 'A'
WHEN TABLE.B = 1 then 'B'
WHEN TABLE.C = 1 then 'C'
ELSE NULL END) AS LETTER
from TABLE
You may try this.
insert into t2 select id, 'A' from t1 where A=1;
insert into t2 select id, 'B' from t2 where B=1;
insert into t2 select id, 'C' from t3 where C=1;
If you care about the order, then you can do this.
insert into t3 select id, letter from t2 order by id, letter;
W/o UNION
You can use a single query to get the desired output.Real time example
select id
,regexp_split_to_table((
concat_ws(',', case
when a = 0
then null
else 'a'
end, case
when b = 0
then null
else 'b'
end, case
when c = 0
then null
else 'c'
end)
), ',') l
from c1;
regexp_split_to_table() & concat_ws()

Exclude value of a record in a group if another is present

In the example table below, I'm trying to figure out a way to sum amount over id for all marks where mark 'C' doesn't exist within an id. When mark 'C' does exist in an id, I want the sum of amounts over that id, excluding the amount against mark 'A'. As illustration, my desired output is at the bottom. I've considered using partitions and the EXISTS command, but I'm having trouble conceptualizing the solution. If any of you could take a look and point me in the right direction, it would be greatly appreciated :)
sample table:
id mark amount
------------------
1 A 1
2 A 3
2 B 2
3 A 2
4 A 1
4 B 3
5 A 1
5 C 3
6 A 2
6 C 2
desired output:
id sum(amount)
-----------------
1 1
2 5
3 2
4 4
5 3
6 2
select
id,
case
when count(case mark when 'C' then 1 else null end) = 0
then
sum(amount)
else
sum(case when mark <> 'A' then amount else 0 end)
end
from sampletable
group by id
Here is my effort:
select id, sum(amount) from table t where not t.id = 'A' group by id
having id in (select id from table t where mark = 'C')
union
select id, sum(amount) from table t where t.id group by id
having id not in (select id from table t where mark = 'C')
SELECT
id,
sum(amount) AS sum_amount
FROM atable t
WHERE mark <> 'A'
OR NOT EXISTS (
SELECT *
FROM atable
WHERE id = t.id
AND mark = 'C'
)
GROUP BY
id
;