filter duplicates in SQL join - sql

When using a SQL join, is it possible to keep only rows that have a single row for the left table?
For example:
select * from A, B where A.id = B.a_id;
a1 b1
a2 b1
a2 b2
In this case, I want to remove all except the first row, where a single row from A matched exactly 1 row from B.
I'm using MySQL.

This should work in MySQL:
select * from A, B where A.id = B.a_id GROUP BY A.id HAVING COUNT(*) = 1;
For those of you not using MySQL, you will need to use aggregate functions (like min() or max()) on all the columns (except A.id) so your database engine doesn't complain.

It helps if you specify the keys of your tables when asking a question such as this. It isn't obvious from your example what the key of B might be (assuming it has one).
Here's a possible solution assuming that ID is a candidate key of table B.
SELECT *
FROM A, B
WHERE B.id =
(SELECT MIN(B.id)
FROM B
WHERE A.id = B.a_id);

First, I would recommend using the JOIN syntax instead of the outdated syntax of separating tables by commas. Second, if A.id is the primary key of the table A, then you need only inspect table B for duplicates:
Select ...
From A
Join B
On B.a_id = A.id
Where Exists (
Select 1
From B B2
Where B2.a_id = A.id
Having Count(*) = 1
)

This avoids the cost of counting matching rows, which can be expensive for large tables.
As usual, when comparing various possible solutions, benchmarking / comparing the execution plans is suggested.
select
*
from
A
join B on A.id = B.a_id
where
not exists (
select
1
from
B B2
where
A.id = b2.a_id
and b2.id != b.id
)

Related

Subqueries vs Multi Table Join

I've 3 tables A, B, C. I want to list the intersection count.
Way 1:-
select count(id) from A a join B b on a.id = b.id join C c on B.id = C.id;
Result Count - X
Way 2:-
SELECT count(id) FROM A WHERE id IN (SELECT id FROM B WHERE id IN (SELECT id FROM C));
Result Count - Y
The result count in each of the query is different. What exactly is wrong?
A JOIN can multiply the number of rows as well as filtering out rows.
In this case, the second count should be the correct one because nothing is double counted -- assuming id is unique in a. If not, it needs count(distinct a.id).
The equivalent using JOIN would use COUNT(DISTINCT):
select count(distinct a.id)
from A a join
B b
on a.id = b.id join
C c
on B.id = C.id;
I mention this for completeness but do not recommend this approach. Multiplying the number of rows just to remove them using distinct is inefficient.
In many databases, the most efficient method might be:
select count(*)
from a
where exists (select 1 from b where b.id = a.id) and
exists (select 1 from c where c.id = a.id);
Note: This assumes there are indexes on the id columns and that id is unique in a.

select join result as columns to same row

I have a one-to-many relation in a pg database. I have table A and table B, where rows of B have a foreign key to A.
I want to select certain rows from A and attach certain columns from matching rows of B to same row from A.
E.g.
A
id | created_at |
B
id | created_at | a_id | type |
I tried to do multiple subqueries, e.g.
select A.id,
(select created_at from B where b.a_id = a.id and B.type = 'some_type' limit 1) as some_type_created_at,
(select created_at from B where b.a_id = a.id and B.type = 'another_type' limit 1) as another_type_created_at
from A
But this is obviously ugly and wrong, feels like that. What is the better way of achieving it in Postgres?
Ofcourse I can do join and get the full cartesian product, but I want the result from the db to be directly like this.
There's nothing wrong about using scalar subqueries the way you are doing it. That will work well and will give you the result you want.
Alternatively, you could use lateral table expressions; that will also give you the same result, it's more complex, and in this case I don't see any particular benefit to use them. Lateral queries will take the form:
select
a.id,
b1.created_at as some_type_created_at,
b2.created_at as another_type_created_at
from a
left join lateral (
select created_at from B where b.a_id = a.id and B.type = 'some_type' limit 1
) b1 on true,
left join lateral (
select created_at from B where b.a_id = a.id and B.type = 'another_type' limit 1
) b2 on true
In sum, you are good as you are.

Query left join without all the right rows from B table

I have 2 tables, A and B.
I need all columns from A + 1 column from B in my select.
Unfortunately, B has multiples rows(all identicals) for 1 row in A
on the join condition.
I tried but I can't isolate one row in A for one row in B with left join for example while keeping my select.
How can I do this query ? Query in ORACLE SQL
Thanks in advance.
This is a good use for outer apply. The structure of the query looks like this:
select a.*, b.col
from a outer apply
(select top 1 b.col
from b
where b.? = a.?
) b;
Normally, you would only use top 1 with order by. In this case, it doesn't seem to make a difference which row you choose.
You can group by on all columns from A, and then use an aggregate (like max or min) to pick any of the identical B values:
select a.*
, b.min_col1
from TableA a
left join
(
select a_id
, min(col1) as min_col1
from TableB
group by
a_id
) b
on b.a_id = a.id

INNER JOIN where **every** row must match the WHERE clause?

Here's a simplified example of what I'm trying to do. I have two tables, A and B.
A B
----- -----
id id
name a_id
value
I want to select only the rows from A where ALL the values of the rows from B match a where clause. Something like:
SELECT * from A INNER JOIN B on B.a_id = A.id WHERE B.value > 2
The problem with the above query is that if ANY row from B has a value > 2 I'll get the corresponding row from A, and I only want the row from A if
1.) ALL the rows in B for B.a_id = A.id match the WHERE, OR
2.) There are no rows in B that reference A
B is basically a table of filters.
SELECT *
FROM a
WHERE NOT EXISTS
(
SELECT NULL
FROM b
WHERE b.a_id = a.a_id
AND (b.value <= 2 OR b.value IS NULL)
)
This should solve your problem:
SELECT *
FROM a
WHERE NOT EXISTS (SELECT *
FROM b
WHERE b.a_id = a.id
AND b.value <= 2)
Here is the way in which this is obtained.
Suppose that we have available a universal quantifier (parallel to EXISTS, the existential quantifier), with a syntax like:
FORALL table WHERE condition1 : condition2
(to be read: FORALL the elements of table that satisfy the condition1, then condition2 is true)
So you could write your query in this way:
SELECT *
FROM a
WHERE FORALL b WHERE b.a_id = a.id : b.value > 2
(Note that forall is true even when no element in b exists with a value of a.id)
Then we can transform the universal quantifier in the existential one, with a double negation, as usual:
SELECT *
FROM a
WHERE NOT EXISTS b WHERE b.a_id = a.id : NOT (b.value > 2)
In plain SQL this can be written as:
SELECT *
FROM a
WHERE NOT EXISTS (SELECT *
FROM b
WHERE b.a_id = a.id
AND (b.value > 2) IS NOT TRUE)
This technique is very handy in case of universal quantification.
Answering this question (which it seems you actually meant to ask):
Return all rows from A, where all rows in B with B.a_id = A.id also pass the test B.value > 2.
Which is equivalent to:
Return all rows from A, where no row in B with B.a_id = A.id fails the test B.value > 2.
SELECT a.* -- "rows from A" (so don't include other columns)
FROM a
LEFT JOIN b ON b.a_id = a.id
AND (b.value > 2) IS NOT TRUE -- safe inversion of logic
WHERE b.a_id IS NULL;
When inverting a WHERE condition carefully consider NULL. IS NOT TRUE is the simple and safe way to perfectly invert a WHERE condition. The alternative would be (b.value <= 2 OR b.value IS NULL) which is longer but may be faster (easier to support with index).
Select rows which are not present in other table
Try this
SELECT * FROM A
LEFT JOIN B ON B.a_id = A.id
WHERE B.value > 2 OR B.a_id IS NULL
SELECT * FROM A LEFT JOIN B ON b.a_id = a.id
WHERE B.a_id IS NULL OR NOT EXIST (
SELECT 1
FROM b
WHERE b.value <= 2)
SELECT a.is, a.name, c.id as B_id, c.value from A
INNER JOIN (Select b.id, b.a_id, b.value from B WHERE B.value > 2) C
on C.a_id = A.id
Note it is a poor practice to use select *. You shoudl only specify fields you need. IN this case, I might possibly remove the b.Id refernces becasue they are probably not needed. If you have a join there is a 100% chance you are wasting resouces sending data you don't need becasue the join fields will be repeated. That is why I did nto include a_id in the final result set.
If you prefer not to use EXISTS, you can use an outer join.
SELECT A.*
FROM
A
LEFT JOIN B ON
B.a_id = A.id
AND B.value <= 2 -- note: condition reversed!!
WHERE B.id IS NULL
This works by searching for the existence of a failing record in B. If it finds one, then the join will match, and the final WHERE clause will exclude that record.

Is there alternative way to write this query?

I have tables A, B, C, where A represents items which can have zero or more sub-items stored in C. B table only has 2 foreign keys to connect A and C.
I have this sql query:
select * from A
where not exists (select * from B natural join C where B.id = A.id and C.value > 10);
Which says: "Give me every item from table A where all sub-items have value less than 10.
Is there a way to optimize this? And is there a way to write this not using exists operator?
There are three commonly used ways to test if a value is in one table but not another:
NOT EXISTS
NOT IN
LEFT JOIN ... WHERE ... IS NULL
You have already shown code for the first. Here is the second:
SELECT *
FROM A
WHERE id NOT IN (
SELECT b.id
FROM B
NATURAL JOIN C
WHERE C.value > 10
)
And with a left join:
SELECT *
FROM A
LEFT JOIN (
SELECT b.id
FROM B
NATURAL JOIN C
WHERE C.value > 10
) BC
ON A.id = BC.id
WHERE BC.id IS NULL
Depending on the database type and version, the three different methods can result in different query plans with different performance characteristics.