Count of Intersection Returns Unexpected Value - sql

I have to tables, say A and B and a common column X which isn't nullable.
Query 1:
SELECT COUNT(*)
FROM A
WHERE A.X in
(SELECT distinct(B.X) FROM B)
Query 2:
SELECT COUNT(*)
FROM B
WHERE B.X in
(SELECT distinct(X) FROM A)
Query 3:
SELECT COUNT(*)
FROM A, B
WHERE A.X=B.X
Query 1 results 5990. Query 2 and 3 results 6222. Removing distinct or checking distinct count on top doesn't change the results. Can someone explain why the results aren't the same for all queries as they all return intersection count?

Assume A has values
A
B
C
Assume B has values
A
B
C
D
E
C
in this case A would have a count of 3 (QRY1)
while B would have a count of 4 since c is repeated (QRY2)
when you use a join C will match will all values of C has more duplicates than A. A may have duplicates but fewer of them. 4 again (QRY3)

Related

Three-Way Diff in SQL

I have three SQL tables (A, B, and C), representing three different version of a dataset. I want to devise an SQL query whose effect is to extract the ids of rows/tuples whose values are different in all three tables. (Two records are different if there exists a field where the records do not share the same value.) For simplicity, let's just assume that A, B, and C each have N records with record ids ranging from 1 to N, so for every id from 1 to N, there is a record in each table with that ID.
What might be the most efficient way to do this in SQL? One way would be to do something like
(SELECT id FROM
((SELECT * FROM A EXCEPT SELECT * FROM B) EXCEPT (SELECT * FROM C)) result)
EXCEPT
(SELECT id FROM
((SELECT * FROM A) INTERSECT (SELECT * FROM B)) result2)
Basically what I've done above is first found the ids of records where the version in A differs from the version of B and from the version in C (in the first two lines of the SQL query I've written). What's left is to filter out the ids of record where the version in B matches the version in C (which is done in the last two lines). But this seems horribly inefficient; is there a better, more concise way?
Note: I'm using PostgreSQL syntax here.
I would do it like this:
select id,
a.id is null as "missing in a",
b.id is null as "missing in b",
c.id is null as "missing in c",
a is distinct from b as "a and b different",
a is distinct from c as "a and c different",
b is distinct from c as "b and c different"
from a
full join b using (id)
full join c using (id)
where a is distinct from b
or b is distinct from c
or a is distinct from c
The id column is assumed to be a primary (or unique) key.
Online example
You can use the group by and having as follows:
select id from
(select * from A
union select * from B
union select * from C)
group by id
-- use characters that you know will not appear in this columns for concat
having count(distinct column1 || '#~#' || column2 || '#~#' || column3) = 3

Subqueries vs Multi Table Join

I've 3 tables A, B, C. I want to list the intersection count.
Way 1:-
select count(id) from A a join B b on a.id = b.id join C c on B.id = C.id;
Result Count - X
Way 2:-
SELECT count(id) FROM A WHERE id IN (SELECT id FROM B WHERE id IN (SELECT id FROM C));
Result Count - Y
The result count in each of the query is different. What exactly is wrong?
A JOIN can multiply the number of rows as well as filtering out rows.
In this case, the second count should be the correct one because nothing is double counted -- assuming id is unique in a. If not, it needs count(distinct a.id).
The equivalent using JOIN would use COUNT(DISTINCT):
select count(distinct a.id)
from A a join
B b
on a.id = b.id join
C c
on B.id = C.id;
I mention this for completeness but do not recommend this approach. Multiplying the number of rows just to remove them using distinct is inefficient.
In many databases, the most efficient method might be:
select count(*)
from a
where exists (select 1 from b where b.id = a.id) and
exists (select 1 from c where c.id = a.id);
Note: This assumes there are indexes on the id columns and that id is unique in a.

How to fill in missing rows in a table with default values in sqlite

I have a table with 3 columns (a, b, c) and I want to make sure that for each possible combination of values in the first two columns, there is a row containing that combination. For example if this is my table:
a b c
--- --- ---
P X 1
Q Y 2
Q Z 3
R Y 4
S Y 5
S Z 6
The unique values in column a are P, Q, R, S, and the unique values in column b are X, Y, Z. So I want to create a query that returns 12 rows (4×3) that fills in missing values in column c with a default value like 0, for example:
a b c
--- --- ---
P X 1
P Y 0
P Z 0
Q X 0
Q Y 2
Q Z 3
R X 0
R Y 4
R Z 0
S X 0
S Y 5
S Z 6
The way I'm currently doing it is this:
select a, b, ifnull(c, 0)
from (select distinct a from table),
(select distinct b from table)
left join table using (a, b)
Unfortunately, this query is very slow since the table contains like ten thousand rows. If I precompute the query and store it in a table, then accessing the results is faster, but it takes a lot of space, most of which is probably just filled with zeros in the c column. Is there any way to make this query faster?
For this query:
select a.a, b.b, coalesce(c.c, 0)
from (select distinct a from table) a cross join
(select distinct b from table) b left join
table c
using (a, b);
You want indexes on:
(a, b)
(b)
The first index can be used for the select distinct a and for the join. The second can be used for the select distinct b.

How can I write a query to match groups of records

This is a puzzling SQL task. I'd like to do it with a query instead of stepping through with a cursor and doing it the "hard way".
If I have two tables TableA and TableB each with a grouping column as below:
TableA TableB
------------- -------------
id group id group
------ ------ ------ ------
1 D 1 X
2 D 2 X
3 D 3 Y
4 D 4 Y
4 E 5 Y
5 E 5 Z
5 F 6 Z
Note the group names are not the same name.
I want to know if a given TableB group is comprised entirely of IDs which are also grouped together in any group in TableA. The TableA group can have more IDs than the TableB group, so long as it has all of the same IDs as the TableB group. IDs can be in more than one group in either table.
From the tables above, I should find out that group X from TableB matches a group in TableA, but groups Y and Z do not.
I've tried many different queries, subqueries, recursive CTEs. I've only ended up with wrong results and headaches. The real dataset is significantly larger, so performance should be considered a factor too. Unfortunately, that means the cross-join solution proposed in an answer below won't work.
Is this even possible with SQL?
The idea for this query is to consider every group to every other group. Then count the number of times that the ids match. This involves a cross join to generate all the group pairs, then some joins and aggregation:
select b.group, a.group
from (select group, count(*) as cnt b group by group) bg cross join
(select distinct group a) ag left join
b
on b.group = bg.group left join
a
on a.group = ag.group and a.id = b.id
group by bg.group, ag.group, bg.cnt
having bg.cnt = count(a.id)
This query could give the answer.
SELECT B.[group]
FROM
TableB B
FULL JOIN TableA A ON B.id = A.id
GROUP BY
B.[group]
HAVING
COUNT(DISTINCT A.[group]) = 1

Multiple NOT distinct

I've got an MS access database and I would need to create an SQL query that allows me to select all the not distinct entries in one column while still keeping all the values.
In this case more than ever an example is worth thousands of words:
Table:
A B C
1 x q
2 y w
3 y e
4 z r
5 z t
6 z y
SQL magic
Result:
B C
y w
y e
z r
z t
z y
Basically it removes all unique values of column B but keeps the multiple rows of the
data kept. I can "group by b" and then "count>1" to get the not distinct but the result will only list one row of B not the 2 or more that I need.
Any help?
Thanks.
Select B, C
From Table
Where B In
(Select B From Table
Group By B
Having Count(*) > 1)
Another way of returning the results you want would be this:
select *
from
my_table
where
B in
(select B from my_table group by B having count(*) > 1)
select
*
from
my_table t1,
my_table t2
where
t1.B = t2.B
and
t1.C != t2.C
-- apparently you need to use <> instead of != in Access
-- Thanks, Dave!
Something like that?
join the unique values of B you determined with group by b and count > 1 back to the original table to retrieve the C values from the table.