SQL: Performance comparison for exclusion (Join vs Not in) - sql

I am curious on the most efficient way to query exclusion on sql. E.g. There are 2 tables (tableA and tableB) which can be joined on 1 column (col1). I want to display the data of tableA for all the rows which col1 does not exist in tableB.
(So, in other words, tableB contains a subset of col1 of tableA. And I want to display tableA without the data that exists in tableB)
Let's say tableB has 100 rows while tableA is gigantic (more than 1M rows). I know 'Not in (not exists)' can be used but perhaps there are more efficient ways (less comp. time) to do it.? I don't maybe with outer joins?
Code snippets and comments are much appreciated.

Depends on the RDBMS. For Microsoft SQL Server NOT EXISTS is preferred to the OUTER JOIN as it can use the more efficient Anti-Semi join.
For Oracle Minus is apparently preferred to NOT EXISTS (where suitable)
You would need to look at the execution plans and decide.

I prefer to use
Select a.Col1
From TableA a
Left Join TableB b on a.Col1 = b.Col1
Where b.Col1 Is Null
I believe this will be quicker as you are utilising the FK constraint (providing you have
them of course)
Sample data:
create table #a
(
Col1 int
)
Create table #b
(
col1 int
)
insert into #a
Values (1)
insert into #a
Values (2)
insert into #a
Values (3)
insert into #a
Values (4)
insert into #b
Values (1)
insert into #b
Values (2)
Select a.Col1
From #a a
Left Join #b b on a.col1 = b.Col1
Where b.Col1 is null

The questions has been asked several times. The often fastest way is to do this:
SELECT * FROM table1
WHERE id in (SELECT id FROM table1 EXCEPT SELECT id FROM table2)
As the whole joining can be done on indexes, where using NOT IN it generally cannot.

There is no correct answer to this question. Every RDBMS has query optimizer that will determine best execution plan based on available indices, table statistics (number of rows, index selectivity), join condition, query condition, ...
When you have relatively simple query like in your question, there is often several ways you can get results in SQL. Every self respecting RDBMS will recognize your intention and will create same execution plan, no matter which syntax you use (subqueries with IN or EXISTS operator, query with JOIN, ...)
So, best solution here is to write simplest query that works and then check execution plan.
If that solution is not acceptable then you should try to find better query.

Related

Very slow query with TOP and ORDER BY

I have a query in SQL Server 2014 that takes a lot of time to get the results when I execute it.
When I remove the TOPor the ORDER BYintructions, it executes faster, but if I write both of them, it takes a lot of time.
SELECT TOP (10) A.ColumnValue AS ValueA
FROM TableA AS A
INNER JOIN TableB AS B
ON A.ID = B.ID
WHERE A.DateValue > '1982-05-02'
ORDER BY ValueA
How could I make it faster?
You say
When I remove the TOP or the ORDER BY ... it executes faster
Which would indicate that SQL Server has no problem generating the entire result set in the desired order. It just goes pear shaped with the limiting of TOP 10. This is a common issue with rowgoals. When SQL Server knows you just need the first few results it can choose a different plan attempting to optimise for this case that can backfire.
More recent versions include the hint DISABLE_OPTIMIZER_ROWGOAL to disable this on a per query basis. On older versions you can use QUERYTRACEON 4138 as below.
SELECT TOP (10) A.ColumnValue AS ValueA
FROM TableA AS A
INNER JOIN TableB AS B
ON A.ID = B.ID
WHERE A.DateValue > '1982-05-02'
ORDER BY ValueA
OPTION (QUERYTRACEON 4138)
You can use this to verify the cause but may find permissions to run QUERYTRACEON are a problem.
In that eventuality you can hide the TOP value in a variable as below
DECLARE #Top INT = 10
SELECT TOP (#Top) A.ColumnValue AS ValueA
FROM TableA AS A
INNER JOIN TableB AS B
ON A.ID = B.ID
WHERE A.DateValue > '1982-05-02'
ORDER BY ValueA
option (optimize for (#Top = 1000000))
create the index based on ID column of both tables
CREATE INDEX index_nameA
ON TableA (ID, DateValue)
;
CREATE INDEX index_nameB
ON TableB (ID)
it will create better plan in times of query execution
The best way would be to use the indexes to improve performance.
Here, in this case, the index can be put on (date_value).
For uses of indexes refer to this URL:using indexes
This is pretty hopeless, unless most of your data has an earlier date. If the date is special, you could create a computed persisted column to speed up the query in general. However, I doubt that is the case.
I can envision a better execution plan for the query phrased this way:
SELECT TOP (10) A.ColumnValue AS ValueA
FROM TableA A
WHERE EXISTS (SELECT 1 FROM TableB b WHERE A.ID = B.ID) AND
A.DateValue > '1982-05-02'
ORDER BY ValueA;
with an indexes on TableA(ValueA, DateValue, Id, ColumnValue) and TableB(id). That execution plan would scan the index from the beginning and then do the test on DateValue and Id and return ColumnValue for the corresponding matching rows.
However, I don't think SQL Server would generate this plan (although it is worth a try), and I don't know how to force it if it doesn't.

SQL: Is it advised to filter table with 'select ... where' before join?

Which of those two variants is more advised in general SQL practice:
Lets consider table A with columns: 1,2,3 and table B with columns 3,4. Filtering table with select first:
select col2,col4 from
(select col1,col2 from tabA
where tabA.col3='sth') as t
join tabB using (col2);
or using plain join?:
select col2,col4 from tabA
join tabB using(col2)
where col3='sth';
We can assume where clause matches 1 row. Tables are of similar size. Does Oracle planner deal with such joins properly ot it's gonna create huge joined table and then filter it?
Test it yourself on the real tables using explain plans to learn how many rows are evaluated. I don't believe it is possible to know which would be better, or if there would be any difference. The available indexes make a difference to the optimizer's choice of approach for example.
Regarding your 2 examples I don't like "natural join" syntax ("using") so the first option below is I believe the more common approach (where clause refers directly to the "from table"):
select a.col2,b.col4
from tabA a
inner join tabB b on a.col2 = b.col2
where a.col3='sth'
;
but you could also try a join condition like this:
select b.col2,b.col4
from tabB b
inner join tabA a on b.col2 = a.col2 and a.col3='sth'
;
note this reverses the table relationship.
Your second version is an excellent way of writing the query, with the minor exceptions that col4 and col3 are not qualified:
select col2, col4
from tabA a join
tabB b
using (col2)
where a.col3 = 'sth';
Just for the record, this is not a natural join. That is a separate construct in SQL -- and rather abominable.
In my experience, Oracle has a good query optimizer. It will know how to optimize the query for the given conditions. The subquery should make no difference whatsoever to the query plan. Oracle is not going to filter before the join, unless that is the right thing to do.
The only time an inner filter will perform better than an outer filter is in (usually complex) cases where the query optimizer chooses a different query plan for one query versus the other.
Think of it in the same way as a query hint
By reorganizing the query you can influence the query plan without explicitly doing so with a hint
As far as which is more advised, whichever is easiest to read is usually best. Even if one performs better than the other because of where you put your filter, you should instead focus on having the correct indexes to ensure good performance.

Why adding where cause with NOT IN to a SQL slow in Postgres

I have written a SQL query that seems very simple, tableA with primary key x, and ytime is a timestamp field.
select a.x, b.z from
# tableA join tableB etc.
where a.id not in (select x from tableA where a.ytime is not null)
There is another stackflow similar to this one but that one only talks about smaller subset of data.
Why SQL "NOT IN" is so slow?
Not sure if I need to index ytime column.
I have no science to back this up -- only antecdotal history, but for cases where the "in" list is significantly large, I have found that the semi-join (or anti-join in this case) is typically much more efficient than the in list:
select a1.x, b.z
from
tableA a1
join tableB b on ...
where not exists (
select null
from tablea a2
where a1.id = a2.x and a2.ytime is not null
)
When I say significant, a query than ran for minutes using the in-list ran in just seconds after I changed it to the semi.
Try it and see if it makes a difference.
I made some assumptions since your code was largely notional, but I think you get the idea.

Improving SQL cartesian product performance by reducing columns

I have an SQL query which uses cartesian product on a large table. However, I only need one column from one of the tables. Would it actually perform better, if I selected only that one column before using the cartesian product?
So, in other words, would this:
SELECT A.Id, B.Id
FROM (SELECT Id FROM Table1) AS A , Table2 AS B;
be faster than this, given that Table1 has more columns than Id?:
SELECT A.Id, B.Id
FROM Table1 AS A , Table2 AS B;
Or does the number of columns not matter?
On most databases, the two forms would have the same execution plan.
The first could would be worse on a database (such as MySQL) that materializes subqueries.
The second should be better with indexes on the two tables . . . table1(id) and table2(id). The index would be used to get the value rather than the base data.
Try it out yourself! But generally speaking having a subquery reduce the number of rows will help improve the performance. Your query should, however, be written differently:
select a.id aid, b.id bid from
(Select id from table1 where id=<specific_id>) a, table2 b

Where or Join Which one is evaluated first in sql server?

Query
select * from TableA a join TableB b
on a.col1=b.col1
where b.col2 = 'SomeValue'
I'm expecting the server, first filter the col2 from TableB then do the join. This will be more efficient.
Is that the sql server evaluates the where clause first and then Join?
Any link the to know in which order sql will process a query ?
Thanks In Advance
Already answered ... read both answers ...
https://dba.stackexchange.com/questions/5038/sql-server-join-where-processing-order
To summarise: it depends on the server implementation and its execution plan ... so you will need to read up on your server in order to optimise your queries.
But I'm sure that simple joins get optimised by each server as best as it can.
If you are not sure measure execution time on a large dataset.
It's decided by the sql server query optmiser engine based on which which execution plan have lesser cost.
If you think that the filter clause will benefit your query performance, you can get the subset of the table by filtering it with your desired value and make a CTE for it.
Then join the cte expression with your other table.
You can check which query performs better in your case in SSMS and go with it :)
We will use this code:
IF OBJECT_ID(N'tempdb..#TableA',N'U') IS NOT NULL DROP TABLE #TableA;
IF OBJECT_ID(N'tempdb..#TableB',N'U') IS NOT NULL DROP TABLE #TableB;
CREATE TABLE #TableA (col1 INT NOT NULL,Col2 NVARCHAR(255) NOT NULL)
CREATE TABLE #TableB (col1 INT NOT NULL,Col2 NVARCHAR(255) NOT NULL)
INSERT INTO #TableA VALUES (1,'SomeValue'),(2,'SomeValue2'),(3,'SomeValue3')
INSERT INTO #TableB VALUES (1,'SomeValue'),(2,'SomeValue2'),(3,'SomeValue3')
select * from #TableA a join #TableB b
on a.col1=b.col1
where b.col2 = 'SomeValue'
Let`s analyze query plan in MSSQL Management studio. Mark full SELECT statement and right click --> Diplay Estimated Execution Plan. As you can seen on the picture below
first it does Table Scan for the WHERE clause, then JOIN.
1.Is that the sql server evaluates the where clause first and then Join?
First the where clause then JOIN
2.Any link the to know in which order sql will process a query?
I think you will find useful information here:
Execution Plan Basics
Graphical Execution Plans for Simple SQL Queries