Which is faster - NOT IN or NOT EXISTS? - sql

I have an insert-select statement that needs to only insert rows where a particular identifier of the row does not exist in either of two other tables. Which of the following would be faster?
INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
AND NOT EXISTS (SELECT 'Y' from Table3 t3 where t2.SomeFK = t3.RefToSameFK)
AND NOT EXISTS (SELECT 'Y' from Table4 t4 where t2.SomeFK = t4.RefToSameFK AND ...)
... or...
INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table3)
AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table4 WHERE ...)
... or do they perform about the same? Additionally, is there any other way to structure this query that would be preferable? I generally dislike subqueries as they add another "dimension" to the query that increases runtime by polynomial factors.

Usually it does not matter if NOT IN is slower / faster than NOT EXISTS, because they are NOT equivalent in presence of NULL. Read:
NOT IN vs NOT EXISTS
In these cases you almost always want NOT EXISTS, because it has the usually expected behaviour.
If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.
In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.

You could use a LEFT OUTER JOIN and check if the value in the RIGHT table is NULL. If the value is NULL, the row doesn't exist. That is one way to avoid subqueries.
SELECT (...) FROM Table2 t2
LEFT OUTER JOIN t3 ON (t2.someFk = t3.ref)
WHERE t3.someField IS NULL

It's dependent on the size of the tables, the available indices, and the cardinality of those indices.
If you don't get the same execution plan for both queries, and if neither query plans out to perform a JOIN instead of a sub query, then I would guess that version two is faster. Version one is correlated and therefore would produce many more sub queries, version two can be satisfied with three queries total.
(Also, note that different engines may be biased in one direction or another. Some engines may correctly determine that the queries are the same (if they really are the same) and resolve to the same execution plan.)

For bigger tables, it's recomended to use NOT EXISTS/EXISTS, because the IN clause runs the subquery a lot of times depending of the architecture of the tables.
Based on cost optimizer:
There is no difference.

Related

Explanation of using the operator EXISTS on a correlated subqueries

What is an explanation of the mechanics behind the following Query?
It looks like a powerful method of doing dynamic filtering on a table.
CREATE TABLE tbl (ID INT, amt INT)
INSERT tbl VALUES
(1,1),
(1,1),
(1,2),
(1,3),
(2,3),
(2,400),
(3,400),
(3,400)
SELECT *
FROM tbl T1
WHERE EXISTS
(
SELECT *
FROM tbl T2
WHERE
T1.ID = T2.ID AND
T1.amt < T2.amt
)
Live test of it here on SQL Fiddle
You can usually convert correlated subqueries into an equivalent expression using explicit joins. Here is one way:
SELECT distinct t1.*
FROM tbl T1 left outer join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
where t2.id is null
Martin Smith shows another way.
The question of whether they are a "powerful way of doing dynamic filtering" is true, but (usually) unimportant. You can do the same filtering using other SQL constructs.
Why use correlated subqueries? There are several positives and several negatives, and one important reason that is both. On the positive side, you do not have to worry about "multiplication" of rows, as happens in the above query. Also, when you have other filtering conditions, the correlated subquery is often more efficient. And, sometimes using delete or update, it seems to be the only way to express a query.
The Achilles heel is that many SQL optimizers implement correlated subqueries as nested loop joins (even though do not have to). So, they can be highly inefficient at times. However, the particular "exists" construct that you have is often quite efficient.
In addition, the nature of the joins between the tables can get lost in nested subqueries, which complicated conditions in where clauses. It can get hard to understand what is going on in more complicated cases.
My recommendation. If you are going to use them on large tables, learn about SQL execution plans in your database. Correlated subqueries can bring out the best or the worst in SQL performance.
Possible Edit. This is more equivalent to the script in the OP:
SELECT distinct t1.*
FROM tbl T1 inner join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
Let's translate this to english:
"Select rows from tbl where tbl has a row of the same ID and bigger amt."
What this does is select everything except the rows with maximum values of amt for each ID.
Note, the last line SELECT * FROM tbl is a separate query and probably not related to the question at hand.
As others have already pointed out, using EXISTS in a correlated subquery is essentially telling the database engine "return all records for which there is a corresponding record which meets the criteria specified in the subquery." But there's more.
The EXISTS keyword represents a boolean value. It could also be taken to mean "Where at least one record exists that matches the criteria in the WHERE statement." In other words, if a single record is found, "I'm done, and I don't need to search any further."
The efficiency gain that CAN result from using EXISTS in a correlated subquery comes from the fact that as soon as EXISTS returns TRUE, the subquery stops scanning records and returns a result. Similarly, a subquery which employs NOT EXISTS will return as soon as ANY record matches the criteria in the WHERE statement of the subquery.
I believe the idea is that the subquery using EXISTS is SUPPOSED to avoid the use of nested loop searches. As #Gordon Linoff states above though, the query optimizer may or may not perform as desired. I believe MS SQL Server usually takes full advantage of EXISTS.
My understanding is that not all queries benefit from EXISTS, but often, they will, particularly in the case of simple structures such as that in your example.
I may have butchered some of this, but conceptually I believe it's on the right track.
The caveat is that if you have a performance-critical query, it would be best to evaluate execution of a version using EXISTS with one using simple JOINS as Mr. Linoff indicates. Depending on your database engine, table structure, time of day, and the alignment of the moon and stars, it is not cut-and-dried which will be faster.
Last note - I agree with lc. When you use SELECT * in your subquery, you may well be negating some or all of any performance gain. SELECT only the PK field(s).

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

Most optimal order (of joins) for left join

I have 3 tables Table1 (with 1020690 records), Table2(with 289425 records), Table 3(with 83692 records).I have something like this
SELECT * FROM Table1 T1 /* OK fine select * is bad when not all columns are needed, this is just an example*/
LEFT JOIN Table2 T2 ON T1.id=T2.id
LEFT JOIN Table3 T3 ON T1.id=T3.id
and a query like this
SELECT * FROM Table1 T1
LEFT JOIN Table3 T3 ON T1.id=T3.id
LEFT JOIN Table2 T2 ON T1.id=T2.id
The query plan shows me that it uses 2 Merge Join for both the joins. For the first query, the first merge is with T1 and T2 and then with T3. For the second query, the first merge is with T1 and T3 and then with T2.
Both these queries take about the same time(40 seconds approx.) or sometimes Query1 takes couple of seconds longer.
So my question is, does the join order matter ?
The join order for a simple query like this should not matter. If there's a way to reorder the joins to improve performance, that's the job of the query optimizer.
In theory, you shouldn't worry about it -- that's the point of SQL. Trying to outthink the query optimizer is generally not going to give better results. Especially in MS SQL Server, which has a very good query optimizer.
I wouldn't expect this query to take 40 seconds. You might not have the right indexes defined. You should use tools like SQL Server Profiler or SQL Server Database Engine Tuning Advisor to see if it can recommend any new indexes.
The query optimizer will use a combination of the constraints, indexes, and statistics collected on the table to build an execution plan. In most cases this works well. However, I do occasionally encounter scenarios where the execution plan is chosen poorly. Often times tweaking the query can effectively coerce the optimizer into a choosing a better plan. I can offer no general rules for doing this though. When all else fails you could resort to the FORCE ORDER query hint.
And yes, the join order can have a significant impact on execution time of your query. The idea is that by joining the tables that yield the smallest results first will cause the next join to be computed more quickly. Edit: It is important to note, however, that in the abscense of FORCE ORDER and in all other things being equal the order you specify in the query may have no correlation with the way the optimizer builds the execution plan.
In general, SQL Server is smart enough to pick out the best way to join and it will not only use the order you wrote in the query. That said, I find it easier to understand a complex query if all the inner joins are first and then the left joins.

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php

is it better to put more logic in your ON clause or should it only have the minimum necessary?

Given these two queries:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId
WHERE t2.aField <> 'C'
OR:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId and t2.aField <> 'C'
Is there a demonstrable difference between the two? Seems to me that the clause "t2.aField <> 'C'" will run on every row in t2 that meets the join criteria regardless. Am I incorrect?
Update: I did an "Include Actual Execution Plan" in SQL Server. The two queries were identical.
I prefer to use the Join criteria for explaining how the tables are joined together.
So I would place the additional clause in the where section.
I hope (although I have no stats), that SQL Server would be clever enough to find the optimal query plan regardless of the syntax you use.
HOWEVER, if you have indexes which also have id, and aField in them, I would suggest placing them together in the inner join criteria.
It would be interesting to see the query plan's in these 2 (or 3) scenarios, and see what happens. Nice question.
There is a difference. You should do an EXPLAIN PLAN for both of the selects and see it in detail.
As for a simplier explanation:
The WHERE clause gets executed only after the joining of the two tables, so it executes for each row returned from the join and not nececerally every one from table2.
Performance wise its best to eliminate unwanted results early on so there should be less rows for joins, where clauses or other operations to deal with later on.
In the second example, there are 2 columns that have to be same for the rows to be joined together so it usually will give different results than the first one.
It depends.
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId
WHERE
t2.SomeValue IS NULL
is different from
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId AND t2.SomeValue IS NULL
It is different because the former crosses out all records from t2 that have NULL in t2.SomeValue and those from t1 that are not referenced in t2. The latter crosses out only the t2 records that have NULL in t2.SomeValue.
Just use the ON clause for the join condition and the WHERE clause for the filter.
Unless moving the join condition to the where clause changes the meaning of the query (like in the left join example above), then it doesn't matter where you put them. SQL will re-arrange them, and as long as they are provably equivalent, you'll get the same query.
That being said, I think it's more of a logical / readability thing. I usually put anything that relates two tables in the join, and anything that filters in the where.
I'd prefer first query. SQL server will use the best join type for your query based on indexes you have, after that will apply WHERE clause. But you can run both queries at the same time, look at execution plans, compare and choose the fastest (optimize adding indexes also).
unless you are working on a single-user app or something similarly small that creates trivial load, the only considerations that mean anything is how the server will process your query.
The answers that mention query plans give good advice.
In addition, set io statistics on to get an idea of how many reads your query will generate (I especially love Azder's post).
Think of every DB server as a pump of data from disk to client. That pump goes faster if it performs only the IO needed to get the job done. If the data is in cache it will be even faster. But you don't want to be reading more than you need from disk - that will result in crowding out of your cache useful data for no good reason.