How do I find records that are not joined? - sql

I have two tables that are joined together.
A has many B
Normally you would do:
select * from a,b where b.a_id = a.id
To get all of the records from a that has a record in b.
How do I get just the records in a that does not have anything in b?

select * from a where id not in (select a_id from b)
Or like some other people on this thread says:
select a.* from a
left outer join b on a.id = b.a_id
where b.a_id is null

select * from a
left outer join b on a.id = b.a_id
where b.a_id is null

The following image will help to understand SQL LET JOIN :

Another approach:
select * from a where not exists (select * from b where b.a_id = a.id)
The "exists" approach is useful if there is some other "where" clause you need to attach to the inner query.

SELECT id FROM a
EXCEPT
SELECT a_id FROM b;

You will probably get a lot better performance (than using 'not in') if you use an outer join:
select * from a left outer join b on a.id = b.a_id where b.a_id is null;

SELECT <columnns>
FROM a WHERE id NOT IN (SELECT a_id FROM b)

In case of one join it is pretty fast, but when we are removing records from database which has about 50 milions records and 4 and more joins due to foreign keys, it takes a few minutes to do it.
Much faster to use WHERE NOT IN condition like this:
select a.* from a
where a.id NOT IN(SELECT DISTINCT a_id FROM b where a_id IS NOT NULL)
//And for more joins
AND a.id NOT IN(SELECT DISTINCT a_id FROM c where a_id IS NOT NULL)
I can also recommended this approach for deleting in case we don't have configured cascade delete.
This query takes only a few seconds.

The first approach is
select a.* from a where a.id not in (select b.ida from b)
the second approach is
select a.*
from a left outer join b on a.id = b.ida
where b.ida is null
The first approach is very expensive. The second approach is better.
With PostgreSql 9.4, I did the "explain query" function and the first query as a cost of cost=0.00..1982043603.32.
Instead the join query as a cost of cost=45946.77..45946.78
For example, I search for all products that are not compatible with no vehicles. I've 100k products and more than 1m compatibilities.
select count(*) from product a left outer join compatible c on a.id=c.idprod where c.idprod is null
The join query spent about 5 seconds, instead the subquery version has never ended after 3 minutes.

Another way of writing it
select a.*
from a
left outer join b
on a.id = b.id
where b.id is null
Ouch, beaten by Nathan :)

This will protect you from nulls in the IN clause, which can cause unexpected behavior.
select * from a where id not in (select [a id] from b where [a id] is not null)

Related

Re-Writing a SQL Statement with a Subquery to Have a Join

I have to re-write a SQL statement with a subquery so that it has a join for my job. So far, this is what I have.
SELECT * FROM Table_A
WHERE TABLE_A.A_ID NOT IN
(SELECT LK.A_ID FROM Link_Table LK
LEFT JOIN Table_B B
ON B.B_ID = LK.B_ID)
I am really having a hard time with this. I feel like this is because of the link tables though. Can anyone give me advice on altering this query?
Seems like you want a LEFT JOin with a IS NULL in the where:
SELECT {Column list} --Don't use *
FROM dbo.Table_A A
LEFT JOIN dbo.Link_Table LK ON A.A_ID = LK.A_ID
WHERE LK.A_ID IS NULL;
You don't need the reference to Table_B at all here.
Personally, however, I would prefer an EXISTS, but that is a subquery again:
SELECT {Column List}
FROM dbo.Table_A A
WHERE NOT EXISTS (SELECT 1
FROM dbo.Link_Table LK
WHERE A.A_ID = LK.A_ID);

Best way to eliminate duplicates rows after multiple joins

I'll consider three simple tables. A, B are my entity tables and C is an intermediate table that creates a many-to-many relationship between A & B.
Schemas:
A: (id INTEGER PRIMARY KEY)
B: (id INTEGER PRIMARY KEY)
C: (
A_id INTEGER,
B_id INTEGER,
FOREIGN KEY(A_id) REFERENCES A(id),
FOREIGN KEY(B_id) REFERENCES B(id)
)
Now, consider the below query
SELECT
A.id
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...;
This query would result in duplicate values of A.id, which is expected because C might have multiple rows associated with each row of A. My question is what's the best way to eliminate these duplicates and get the A records. I only need the A records.
I am aware of two ways,
-- Using DISTINCT
SELECT
DISTINCT(A.id), ...
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...
ORDER BY A.id;
And
-- Or using A.id IN (above query)/ A.id = Any(above query)
SELECT
...
FROM A
WHERE A.id IN (
SELECT
A.id
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...
);
I'm using PostgreSQL. I need to include all the tables for filtering, so not joining a table cannot be considered as an improvement. I've analyzed both the queries but I still feel there might be a better way to do this(in terms of performance).
Any help is really appreciated!
I would suggest exists:
SELECT A.id
FROM A
WHERE EXISTS (SELECT 1
FROM C JOIN
B
ON C.B_id = B.id
WHERE A.id = C.A_id AND . . .
)
You can also try following query:
SELECT
a.* -- or whatever columns you need of a
FROM a
WHERE EXISTS(
SELECT 1
FROM c
WHERE c.a_id = a.id
)
Note, that there is no need to join table b as the existence of the row in c always guarantees for the row in b and you do not need any information contained in this row/table.
Perhaps even more clean might be:
SELECT DISTINCT
a.* -- or whatever columns you need of a
FROM a
LEFT JOIN c
You can have a look at the query plans and execution times using EXPLAIN ANALYZE <query>. Perhaps this gives you a hint on what to use best.
But be aware of caching, repeat both queries multiple times this way to see comparable results.

Postgresql LATERAL vs INNER JOIN

JOIN
SELECT *
FROM a
INNER JOIN (
SELECT b.id, Count(*) AS Count
FROM b
GROUP BY b.id ) AS b ON b.id = a.id;
LATERAL
SELECT *
FROM a,
LATERAL (
SELECT Count(*) AS Count
FROM b
WHERE a.id = b.id ) AS b;
I understand that here join will be computed once and then merge with the main request vs the request for each FROM.
It seems to me that if join will rotate a few rows to one frame then it will be more efficient but if it will be 1 to 1 then LATERAL - I think right?
If I understand you right you are asking which of the two statements is more efficient.
You can test that yourself using EXPLAIN (ANALYZE), and I guess that the answer depends on the data:
If there are few rows in a, the LATERAL join will probably be more efficient if there is an index on b(id).
If there are many rows in a, the first query will probably be more efficient, because it can use a hash or merge join.

Postgresql: alternative to WHERE IN respective WHERE NOT IN

I have several statements which access very large Postgresql tables i.e. with:
SELECT a.id FROM a WHERE a.id IN ( SELECT b.id FROM b );
SELECT a.id FROM a WHERE a.id NOT IN ( SELECT b.id FROM b );
Some of them even access even more tables in that way. What is the best approach to increase the performence, should I switch i.e. to joins?
Many thanks!
JOIN will be far more efficient, or you can use EXISTS:
SELECT a.id FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
The subquery will return at most 1 row.
Here's a way to filter rows with an INNER JOIN:
SELECT a.id
FROM a
INNER JOIN b ON a.id = b.id
Note that each version can perform differently; sometimes IN is faster, sometimes EXISTS, and sometimes the INNER JOIN.
Yes, i would recomend going to joins. It will speed up the select statements.

How can I implement SQL INTERSECT and MINUS operations in MS Access

I have researched and haven't found a way to run INTERSECT and MINUS operations in MS Access. Does any way exist
INTERSECT is an inner join. MINUS is an outer join, where you choose only the records that don't exist in the other table.
INTERSECT
select distinct
a.*
from
a
inner join b on a.id = b.id
MINUS
select distinct
a.*
from
a
left outer join b on a.id = b.id
where
b.id is null
If you edit your original question and post some sample data then an example can be given.
EDIT: Forgot to add in the distinct to the queries.
INTERSECT is NOT an INNER JOIN. They're different. An INNER JOIN will give you duplicate rows in cases where INTERSECT WILL not. You can get equivalent results by:
SELECT DISTINCT a.*
FROM a
INNER JOIN b
on a.PK = b.PK
Note that PK must be the primary key column or columns. If there is no PK on the table (BAD!), you must write it like so:
SELECT DISTINCT a.*
FROM a
INNER JOIN b
ON a.Col1 = b.Col1
AND a.Col2 = b.Col2
AND a.Col3 = b.Col3 ...
With MINUS, you can do the same thing, but with a LEFT JOIN, and a WHERE condition checking for null on one of table b's non-nullable columns (preferably the primary key).
SELECT DISTINCT a.*
FROM a
LEFT JOIN b
on a.PK = b.PK
WHERE b.PK IS NULL
That should do it.
They're done through JOINs. The old fashioned way :)
For INTERSECT, you can use an INNER JOIN. Pretty straightforward. Just need to use a GROUP BY or DISTINCT if you have don't have a pure one-to-one relationship going on. Otherwise, as others had mentioned, you can get more results than you'd expect.
For MINUS, you can use a LEFT JOIN and use the WHERE to limit it so you're only getting back rows from your main table that don't have a match with the LEFT JOINed table.
Easy peasy.
Unfortunately MINUS is not supported in MS Access - one workaround would be to create three queries, one with the full dataset, one that pulls the rows you want to filter out, and a third that left joins the two tables and only pulls records that only exist in your full dataset.
Same thing goes for INTERSECT, except you would be doing it via an inner join and only returning records that exist in both.
No MINUS in Access, but you can use a subquery.
SELECT DISTINCT a.*
FROM a
WHERE a.PK NOT IN (SELECT DISTINCT b.pk FROM b)
I believe this one does the MINUS
SELECT DISTINCT
a.CustomerID,
b.CustomerID
FROM
tblCustomers a
LEFT JOIN
[Copy Of tblCustomers] b
ON
a.CustomerID = b.CustomerID
WHERE
b.CustomerID IS NULL