Calculating overlap between groups - sql

I have a table with two columns of interest, item_id and bucket_id. There are a fixed number of values for bucket_id and I'm okay with listing them out if I need to.
Each item_id can appear multiple times, but each occurrence will have a separate bucket_id value. For example, the item_id of 123 can appear twice in the table, once under bucket_id of A, once under B.
My goal is to determine how much overlap exists between each pair of bucket_id values and display it as an N-by-N matrix.
For example, consider the following small example table:
item_id bucket_id
========= ===========
111 A
111 B
111 C
222 B
222 D
333 A
333 C
444 C
So for this dataset, buckets A and B have one item_id in common, buckets C and D have no items in common, etc.
I would like to get the above table formatted into something like the following:
A B C D
===================================
A 2 1 2 0
B 1 2 1 1
C 2 1 3 0
D 0 1 0 1
In the above table, the intersect of a row and column tells you how many records exist in both bucket_id values. For example, where the A row intersects the C column we have a 2, because there are 2 records that exist in both bucket_id A and C. Because the intersection of X and Y is the same as the intersection of Y and X, the above table is mirrored across the diagonal.
I imagine the query involves a PIVOT, but I can't for the life of me figure out how to get it working.

You can use simple PIVOT:
SELECT t1.bucket_id,
SUM( CASE WHEN t2.bucket_id = 'A' THEN 1 ELSE 0 END ) AS A,
SUM( CASE WHEN t2.bucket_id = 'B' THEN 1 ELSE 0 END ) AS B,
SUM( CASE WHEN t2.bucket_id = 'C' THEN 1 ELSE 0 END ) AS C,
SUM( CASE WHEN t2.bucket_id = 'D' THEN 1 ELSE 0 END ) AS D
FROM table1 t1
JOIN table1 t2 ON t1.item_id = t2.item_id
GROUP BY t1.bucket_id
ORDER BY 1
;
or you can use Oracle PIVOT clause (works on 11.2 and above):
SELECT * FROM (
SELECT t1.bucket_id AS Y_bid,
t2.bucket_id AS x_bid
FROM table1 t1
JOIN table1 t2 ON t1.item_id = t2.item_id
)
PIVOT (
count(*) FOR x_bid in ('A','B','C','D')
)
ORDER BY 1
;
Examples: http://sqlfiddle.com/#!4/39d30/7

I believe this should get you the data you need. Pivoting the table could then be done programmatically (or in Excel, etc.).
-- This gets the distinct pairs of buckets
select distinct
a.name,
b.name
from
bucket a
join bucket b
where
a.name < b.name
order by
a.name,
b.name
+ --------- + --------- +
| name | name |
+ --------- + --------- +
| A | B |
| A | C |
| A | D |
| B | C |
| B | D |
| C | D |
+ --------- + --------- +
6 rows
-- This gets the distinct pairs of buckets with the counts you are looking for
select distinct
a.name,
b.name,
count(distinct bi.item_id)
from
bucket a
join bucket b
left outer join bucket_item ai on ai.bucket_name = a.name
left outer join bucket_item bi on bi.bucket_name = b.name and ai.item_id = bi.item_id
where
a.name < b.name
group by
a.name,
b.name
order by
a.name,
b.name
+ --------- + --------- + ------------------------------- +
| name | name | count(distinct bi.item_id) |
+ --------- + --------- + ------------------------------- +
| A | B | 2 |
| A | C | 1 |
| A | D | 0 |
| B | C | 2 |
| B | D | 0 |
| C | D | 0 |
+ --------- + --------- + ------------------------------- +
6 rows
Here's the entire example with the DDL and inserts to set it up (this is in mysql but the same ideas apply elsewhere):
use example;
drop table if exists bucket;
drop table if exists item;
drop table bucket_item;
create table bucket (
name varchar(1)
);
create table item(
id int
);
create table bucket_item(
bucket_name varchar(1) references bucket(name),
item_id int references item(id)
);
insert into bucket values ('A');
insert into bucket values ('B');
insert into bucket values ('C');
insert into bucket values ('D');
insert into item values (111);
insert into item values (222);
insert into item values (333);
insert into item values (444);
insert into item values (555);
insert into bucket_item values ('A',111);
insert into bucket_item values ('A',222);
insert into bucket_item values ('A',333);
insert into bucket_item values ('B',222);
insert into bucket_item values ('B',333);
insert into bucket_item values ('B',444);
insert into bucket_item values ('C',333);
insert into bucket_item values ('C',444);
insert into bucket_item values ('D',555);
-- query to get distinct pairs of buckets
select distinct
a.name,
b.name
from
bucket a
join bucket b
where
a.name < b.name
order by
a.name,
b.name
;
select distinct
a.name,
b.name,
count(distinct bi.item_id)
from
bucket a
join bucket b
left outer join bucket_item ai on ai.bucket_name = a.name
left outer join bucket_item bi on bi.bucket_name = b.name and ai.item_id = bi.item_id
where
a.name < b.name
group by
a.name,
b.name
order by
a.name,
b.name
;

Related

How can I get full results in SQL query using 3 tables, where 1 of them keeps relation of 2 another?

I need help writing a query to display results I want.
"Table 3 - relations" keeps all relations between table 1 and 2.Often, relation between table 1 and 2 will not exist in table 3 so I want to see missing relation in the results for all Table 1 rows - see expected Results below.
I can't modify these tables - I have only SELECT privilege.
Data and expected result below:
Table 1 - a:
a_id, a_name
e.g.:
1 A
2 B
Table 2 - b:
b_id, b_name
e.g.:
1 X
2 Y
Table 3 - relation:
asset1_id (it's always id from Table 1), asset2_id (it's always id from Table 2), relation_type
e.g.:
1 1 covers
1 2 covers
Expected result:
Table1_name, Table2_name, Table3_relation_type (including NULL for b_name and relation_type when such relation does not exist in Table 3 - relation)
e.g.
A X covers
A Y covers
B NULL NULL
I can't get the 3rd expected line with NULLs.
I think that this query will produce those results.
select a.name as a_name,b.name as b_name, r.relation_type from relation r
join a on a.id=r.asset1_id
join b on b.id=r.asset2_id
union
select a.name as a_name,b.name as b_name,r.relation_type from relation r
full outer join a on a.id=r.asset1_id
full outer join b on b.id=r.asset2_id
where a.id is null or b.id is null
With your data sample you could try this one.
It should work both hive or impala.
SELECT t1.name ,t2.name ,r.relation_type
FROM relation r
FULL OUTER JOIN table1 t1 ON(t1.id = r.id1)
FULL OUTER JOIN table2 t2 ON(t2.id = r.id2);
+------+------+---------------+
| name | name | relation_type |
+------+------+---------------+
| A | X | covers |
| A | Y | covers |
| B | NULL | NULL |
+------+------+---------------+
WITH
cte_A AS (
SELECT id as a_id, name as a_name
FROM a
),
cte_C AS (
SELECT c.asset_id1 as a_id, b.name, c.relation
FROM c
LEFT JOIN b ON c.id=b.asset_id2
)
SELECT cte_A.a_name, cte_C.name as c_name, cte_C.relation
FROM cte_A
LEFT JOIN cte_C ON cte_A.a_id=cte_C.a_id

Get count of foreign key from multiple tables

I have 3 tables, with Table B & C referencing Table A via Foreign Key. I want to write a query in PostgreSQL to get all ids from A and also their total occurrences from B & C.
a | b | c
-----------------------------------
id | txt | id | a_id | id | a_id
---+---- | ---+----- | ---+------
1 | a | 1 | 1 | 1 | 3
2 | b | 2 | 1 | 2 | 4
3 | c | 3 | 3 | 3 | 4
4 | d | 4 | 4 | 4 | 4
Output desired (just the id from A & total count in B & C) :
id | Count
---+-------
1 | 2 -- twice in B
2 | 0 -- occurs nowhere
3 | 2 -- once in B & once in C
4 | 4 -- once in B & thrice in C
SQL so far SQL Fiddle :
SELECT a_id, COUNT(a_id)
FROM
( SELECT a_id FROM b
UNION ALL
SELECT a_id FROM c
) AS union_table
GROUP BY a_id
The query I wrote fetches from B & C and counts the occurrences. But if the key doesn't occur in B or C, it doesn't show up in the output (e.g. id=2 in output). How can I start my selection from table A & join/union B & C to get the desired output
If the query involves large parts of b and / or c it is more efficient to aggregate first and join later.
I expect these two variants to be considerably faster:
SELECT a.id,
, COALESCE(b.ct, 0) + COALESCE(c.ct, 0) AS bc_ct
FROM a
LEFT JOIN (SELECT a_id, count(*) AS ct FROM b GROUP BY 1) b USING (a_id)
LEFT JOIN (SELECT a_id, count(*) AS ct FROM c GROUP BY 1) c USING (a_id);
You need to account for the possibility that some a_id are not present at all in a and / or b. count() never returns NULL, but that's cold comfort in the face of LEFT JOIN, which leaves you with NULL values for missing rows nonetheless. You must prepare for NULL. Use COALESCE().
Or UNION ALL a_id from both tables, aggregate, then JOIN:
SELECT a.id
, COALESCE(ct.bc_ct, 0) AS bc_ct
FROM a
LEFT JOIN (
SELECT a_id, count(*) AS bc_ct
FROM (
SELECT a_id FROM b
UNION ALL
SELECT a_id FROM c
) bc
GROUP BY 1
) ct USING (a_id);
Probably slower. But still faster than solutions presented so far. And you could do without COALESCE() and still not loose any rows. You might get occasional NULL values for bc_ct, in this case.
Another option:
SELECT
a.id,
(SELECT COUNT(*) FROM b WHERE b.a_id = a.id) +
(SELECT COUNT(*) FROM c WHERE c.a_id = a.id)
FROM
a
Use left join with a subquery:
SELECT a.id, COUNT(x.id)
FROM a
LEFT JOIN (
SELECT id, a_id FROM b
UNION ALL
SELECT id, a_id FROM c
) x ON (a.id = x.a_id)
GROUP BY a.id;

TSQL - retrieve results from table A that contains exact data contained in table B

I have TableA (id bigint, name varchar) and TableB (name varchar) that contains the following data:
Table A: Table B: Results:
------------- --------- ---------------
| 1 | "A" | | "A" | | 1 | "A" |
| 1 | "B" | | "B" | | 1 | "B" |
| 2 | "A" | --------- | 4 | "A" |
| 3 | "B" | | 4 | "B" |
| 4 | "A" | ---------------
| 4 | "B" |
-------------
I want to return results from TableA that contains an EXACT match of what's in table B.
Using the 'IN' clause only retrieves back an occurrence.
Also, another example, if TableB has only "A", I want it to return back: 2-"A"
I understand your question but it is a tricky one as not exactly in line with the relational logic. You are looking for id's for which SELECT name FROM TableA WHERE id IN ... ORDER BY name; is identical to SELECT name FROM B order by name;.
Can you assume that A(id,name) is unique and B(name) is unique? Better said, are there constraints like that or can you set them up?
If yes, here is a solution:
1. Get rid of ids in A with rows not matching the rows in B
SELECT id, A.name FROM A WHERE id NOT IN
(SELECT id FROM A LEFT JOIN B ON A.name = B.name WHERE B.name IS NULL);
2. Count rows per each id (this is why the unique constraints are necessary)
SELECT id, COUNT(*) FROM
(
SELECT id, A.name FROM A WHERE id NOT IN
(SELECT id FROM A LEFT JOIN B ON A.name = B.name WHERE B.name IS NULL)
) t
GROUP BY id;
3. Only retain those that match the number of rows of B.
SELECT id, COUNT(*) FROM
(
SELECT id, A.name FROM A WHERE id NOT IN
(SELECT id FROM A LEFT JOIN B ON A.name = B.name WHERE B.name IS NULL)
) t
GROUP BY id
HAVING COUNT(*) = SELECT COUNT(*) FROM B;
This works in SQL Server
select * from TableA a
where
(select count(*) from TableB) = (select count(*) from TableA where id = a.id) and
(select count(*) from TableB) =
(
select count(*) from
(
select name from TableA where id = a.id
intersect
select name from TableB
) as b
)

SQL Join to Get Row with Maximum Value from Right table

I am having problem with sql join (oracle/ms sql)
I have two tables
A
ID | B_ID
---|------
1 | 1
1 | 4
2 | 3
2 | 2
----------
B
B_ID | B_VA| B_VB
-------|--------|-------
1 | 1 | a
2 | 2 | b
3 | 5 | c
4 | 2 | d
-----------------------
From these two tables I need A.ID, B.B_ID, B.B_VA (MAX), B.B_VB (with max B.B_VA)
So result table would be like
ID | B_ID | B_VA| B_VB
-------|--------|--------|-------
1 | 4 | 2 | d
2 | 3 | 5 | c
I tried some joins without success. Can anyone help me with query to get the result I want.
Thank you
Your logic as described doesn't quite correspond to the data. For instance, b_va is numeric, but the column in the output is a string.
Perhaps you want this. The data in a to be aggregated to get the maximum b_id value. Then each column to be joined to get the corresponding b_vb column. That, at least, conforms to your desired output:
select a.id, a.b_id, b1.b_vb as b_va, b2.b_vb
from (select id, max(b_id) as b_id
from a
group by id
) a join
b b1
on a.id = b1.b_id join
b b2
on a.b_id = b2.b_id;
EDIT:
For the corrected data, I think this is what you want:
select a.id, a.b_id, max(b1.b_va) as b_va, b2.b_vb
from (select id, max(b_id) as b_id
from a
group by id
) a join
b b1
on a.id = b1.b_id join
b b2
on a.b_id = b2.b_id
group by a.id, a.b_id, b2.b_vb;
Try this
SELECT X.ID, Y.B_ID, X.B_VA, Y.B_VB
FROM (SELECT A.ID, MAX(B_VA) AS B_VA
FROM A INNER JOIN B ON A.B_ID = B.B_ID
GROUP BY A.ID) AS X INNER JOIN
A AS Z ON X.ID = Z.ID INNER JOIN
B AS Y ON Z.B_ID=Y.B_ID AND X.B_VA=Y.B_VA

SQL Server Create VIEW for Sorting

I have a database table that has the following structure:
TABLE_A
DOC_ID | STATUS
1 | 0
2 | 1
TABLE_B
PK_ID | DOC_ID | NAME | VALUE
1 | 1 | A | 1
2 | 1 | B | 2
3 | 2 | A | 1
4 | 2 | B | 1
5 | 2 | C | 1
DOC_ID is the FOREIGN KEY on TABLE_B.
Then I create a VIEW so that I may more easily sort on NAME.
CREATE VIEW [dbo].[V_MY_VIEW] AS
SELECT a.DOC_ID, a1.VALUE AS 'A', a2.VALUE AS 'B', a3.VALUES AS 'C'
FROM dbo.TABLE_A a,
( SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'A') a1
LEFT OUTER JOIN ( SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'B') a2
ON a1.DOC_ID = a2.DOC_ID
LEFT OUTER JOIN ( SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'C') a3
ON a1.DOC_ID = a3.DOC_ID
WHERE a.STATUS IN (0, 1)
This view will only include the rows with DOC_ID = 2 since the rows with DOC_ID = 1 do not have a row with NAME = C. How should I modify the VIEW so that it will include all the rows from TABLE_B?
Thanks.
CREATE VIEW [dbo].[V_MY_VIEW] AS
SELECT a.DOC_ID, a1.VALUE AS A, a2.VALUE AS B, a3.VALUE AS C
FROM dbo.TABLE_A a
LEFT JOIN (SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'A') a1
ON a.DOC_ID = a1.DOC_ID
LEFT OUTER JOIN ( SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'B') a2
ON a.DOC_ID = a2.DOC_ID
LEFT OUTER JOIN ( SELECT DOC_ID, VALUE FROM dbo.TABLE_B WHERE NAME = 'C') a3
ON a.DOC_ID = a3.DOC_ID
WHERE a.STATUS IN (0, 1)
Look the results at http://sqlfiddle.com/#!3/5574b/4/0
SELECT * FROM TABLE_A LEFT OUTER JOIN TABLE_B ON TABLE_A.DOC_ID = TABLE_B.DOC_ID
WHERE TABLE_A.STATUS IN (0, 1)
Replace * with the columns you want to display.
Include an ISNULL(Name, ) in your order by clause unless you want none matches at the top.