I have heard a lot of people over the years say that:
"join" operators are preferred over “NOT EXISTS”
Why?
In MySQL, Oracle, SQL Server and PostgreSQL, NOT EXISTS is of the same efficiency or even more efficient than LEFT JOIN / IS NULL.
While it may seem that "the inner query should be executed for each record from the outer query" (which seems to be bad for NOT EXISTS and even worse for NOT IN, since the latter query is not even correlated), it may be optimized just as well as all other queries are optimized, using appropriate anti-join methods.
In SQL Server, actually, LEFT JOIN / IS NULL may be less efficient than NOT EXISTS / NOT IN in case of unindexed or low cardinality column in the inner table.
It is often heard that MySQL is "especially bad in treating subqueries".
This roots from the fact that MySQL is not capable of any join methods other than nested loops, which severely limits its optimization abilities.
The only case when a query would benefit from rewriting subquery as a join would be this:
SELECT *
FROM big_table
WHERE big_table_column IN
(
SELECT small_table_column
FROM small_table
)
small_table will not be queried completely for each record in big_table: though it does not seem to be correlated, it will be implicitly correlated by the query optimizer and in fact rewritten to an EXISTS (using index_subquery to search for the first much if needed if small_table_column is indexed)
But big_table would always be leading, which makes the query complete in big * LOG(small) rather than small * LOG(big) reads.
This could be rewritten as
SELECT DISTINCT bt.*
FROM small_table st
JOIN big_table bt
ON bt.big_table_column = st.small_table_column
However, this won't improve NOT IN (as opposed to IN). In MySQL, NOT EXISTS and LEFT JOIN / IS NULL are almost the same, since with nested loops the left table should always be leading in a LEFT JOIN.
You may want to read these articles:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
IN vs. JOIN vs. EXISTS: Oracle
IN vs. JOIN vs. EXISTS (SQL Server)
It may have to do with the optimization process... NOT EXISTS implies a subquery, and "optimizers" usually don't do subqueries justice. On the other hand, joins can be dealt with more easily...
I think this is a MySQL specific case. MySQL do not optimize subquery in IN / not in / any / not exists clauses, and actually performs the subquery for each row matched by the outer query. Because of this in MySQL, you should use join. In PostgreSQL however, you can just use subquery.
Related
I have recently noticed that SQL Server 2016 appears to be ignoring left joins if any column is not used in the select or where clause. Also not found in Actual execution plan.
This is good for if anyone added extra join but still not affecting performance.
I have query that took 9 sec, if I add column in select clause for Left join tables but without that only 1 sec.
Can anyone please check and suggest, Is that true or not?
Query with Actual execution plan. You can see there is no any table from left join in execution plan.
I'm not 100% sure what the question is asking, but a SQL optimizer can ignore left join. Consider this type of query:
select a.*
from a left join
b
on a.b_id = b.id;
If b.id is declared as unique (or equivalently a primary key) then the above query returns exactly the same result set as:
select a.*
from a;
I am not per se aware that SQL Server started putting this optimization in 2016. But the optimization is perfectly valid and (I believe) other optimizers do implement it.
Remember, SQL is a declarative language, not a procedural language. The SQL query describes the result set, not how it is produced.
If you have a left join and your matching condition don't return any data from the joined table it will return data as inner join return, when select statement does not contains columns from right tables. Not only in ms server 2016 but most of the DB's.
Left join reduces the performance of the query if there are large amount of data available in join tables.
What is an explanation of the mechanics behind the following Query?
It looks like a powerful method of doing dynamic filtering on a table.
CREATE TABLE tbl (ID INT, amt INT)
INSERT tbl VALUES
(1,1),
(1,1),
(1,2),
(1,3),
(2,3),
(2,400),
(3,400),
(3,400)
SELECT *
FROM tbl T1
WHERE EXISTS
(
SELECT *
FROM tbl T2
WHERE
T1.ID = T2.ID AND
T1.amt < T2.amt
)
Live test of it here on SQL Fiddle
You can usually convert correlated subqueries into an equivalent expression using explicit joins. Here is one way:
SELECT distinct t1.*
FROM tbl T1 left outer join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
where t2.id is null
Martin Smith shows another way.
The question of whether they are a "powerful way of doing dynamic filtering" is true, but (usually) unimportant. You can do the same filtering using other SQL constructs.
Why use correlated subqueries? There are several positives and several negatives, and one important reason that is both. On the positive side, you do not have to worry about "multiplication" of rows, as happens in the above query. Also, when you have other filtering conditions, the correlated subquery is often more efficient. And, sometimes using delete or update, it seems to be the only way to express a query.
The Achilles heel is that many SQL optimizers implement correlated subqueries as nested loop joins (even though do not have to). So, they can be highly inefficient at times. However, the particular "exists" construct that you have is often quite efficient.
In addition, the nature of the joins between the tables can get lost in nested subqueries, which complicated conditions in where clauses. It can get hard to understand what is going on in more complicated cases.
My recommendation. If you are going to use them on large tables, learn about SQL execution plans in your database. Correlated subqueries can bring out the best or the worst in SQL performance.
Possible Edit. This is more equivalent to the script in the OP:
SELECT distinct t1.*
FROM tbl T1 inner join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
Let's translate this to english:
"Select rows from tbl where tbl has a row of the same ID and bigger amt."
What this does is select everything except the rows with maximum values of amt for each ID.
Note, the last line SELECT * FROM tbl is a separate query and probably not related to the question at hand.
As others have already pointed out, using EXISTS in a correlated subquery is essentially telling the database engine "return all records for which there is a corresponding record which meets the criteria specified in the subquery." But there's more.
The EXISTS keyword represents a boolean value. It could also be taken to mean "Where at least one record exists that matches the criteria in the WHERE statement." In other words, if a single record is found, "I'm done, and I don't need to search any further."
The efficiency gain that CAN result from using EXISTS in a correlated subquery comes from the fact that as soon as EXISTS returns TRUE, the subquery stops scanning records and returns a result. Similarly, a subquery which employs NOT EXISTS will return as soon as ANY record matches the criteria in the WHERE statement of the subquery.
I believe the idea is that the subquery using EXISTS is SUPPOSED to avoid the use of nested loop searches. As #Gordon Linoff states above though, the query optimizer may or may not perform as desired. I believe MS SQL Server usually takes full advantage of EXISTS.
My understanding is that not all queries benefit from EXISTS, but often, they will, particularly in the case of simple structures such as that in your example.
I may have butchered some of this, but conceptually I believe it's on the right track.
The caveat is that if you have a performance-critical query, it would be best to evaluate execution of a version using EXISTS with one using simple JOINS as Mr. Linoff indicates. Depending on your database engine, table structure, time of day, and the alignment of the moon and stars, it is not cut-and-dried which will be faster.
Last note - I agree with lc. When you use SELECT * in your subquery, you may well be negating some or all of any performance gain. SELECT only the PK field(s).
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
I see them as being misleading (they're arguably harder to accurately visualize compared to conventional joins and subqueries), often misunderstood (e.g. using SELECT * will behave the same as SELECT 1 in the EXISTS/NOT EXISTS subquery), and from my limited experience, slower to execute.
Can someone describe and/or provide me an example where they are best suited or where there is no option other than to use them? Note that since their execution and performance are likely platform dependent, I'm particularly interested in their use in MySQL.
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
This article (though SQL Server related):
IN vs. JOIN vs. EXISTS
may be of interest to you.
In a nutshell, JOIN is a set operation, while EXISTS is a predicate.
In other words, these queries:
SELECT *
FROM a
JOIN b
ON some_condition(a, b)
vs.
SELECT *
FROM a
WHERE EXISTS
(
SELECT NULL
FROM b
WHERE some_condition(a, b)
)
are not the same: the former can return more than one record from a, while the latter cannot.
Their counterparts, NOT EXISTS vs. LEFT JOIN / IS NULL, are the same logically but not performance-wise.
In fact, the former may be more efficient in SQL Server:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
if the main query returned much less rows then the table where you want to find them. example:
SELECT st.State
FROM states st
WHERE st.State LIKE 'N%' AND EXISTS(SELECT 1 FROM addresses a WHERE a.State = st.State)
doing this with a join will be much slower. or a better example, if you want to search if a item exists in 1 of multiple tables.
You can't [easily] use a join in an UPDATE statement, so WHERE EXISTS works excellently there:
UPDATE mytable t
SET columnX = 'SomeValue'
WHERE EXISTS
(SELECT 1
FROM myothertable ot
WHERE ot.columnA = t.columnY
AND ot.columnB = 'XYX'
);
Edit: Basing this on Oracle more than MySQL, and yes there are ways to do it with an inline view, but IMHO this is cleaner.
Do these two queries differ from each other?
Query 1:
SELECT * FROM Table1, Table2 WHERE Table1.Id = Table2.RefId
Query 2:
SELECT * FROM Table1 INNER JOIN Table2 ON Table1.Id = Table2.RefId
I analysed both methods and they clearly produced the same actual execution plans. Do you know any cases where using inner joins would work in a more efficient way. What is the real advantage of using inner joins rather than approaching the manner of "Query 1"?
The two statements you have provided are functionally equivalent to one another.
The variation is caused by differing SQL syntax standards.
For a really exciting read, you can lookup the various SQL standards by visiting the following Wikipedia link. On the right hand side are references and links to the various dialects/standards of SQL.
http://en.wikipedia.org/wiki/SQL
These SQL statements are synonymous, though specifying the INNER JOIN is the preferred method and follows ISO format. I prefer it as well because it limits the plumbing of joining the tables from your where clause and makes the goal of your query clearer.
These will result in an identical query plan, but the INNER JOIN, OUTER JOIN, CROSS JOIN keywords are prefered because they add clarity to the code.
While you have the ability to specifiy join hints using the keywords in the FROM clause, you can do more complicated joins in the WHERE clause. But otherwise, there will be no difference in query plan.
I will also add that the first syntax is much more subject to inadvertent cross joins as the queries get complicated. Further the left and right joins in this syntax do not work properly in SQL server and should never be used. Mixing the syntax when you add a left join can also cause problems where the query does not correctly return the results. The syntax in the first example has been outdated for 17 years, I see no reason to ever use it.
Query 1 is considered an old syntax style and its use is discouraged. You will run into problems with you use LEFT and Right joins using that syntax style. Also on SQL Server you can have problems mixing those two different syles together in queries that use view of different formats.
I have found a significant difference using the LEFT OUTER JOINS and putting the conditions on the joined table in the ON clause rather than the WHERE clause. Once you put a condition on the joined table in the WHERE clause, you defeat the left outer join.
When I was using Oracle, I used the archaic (+) after the joined table (with all conditions including join conditions in the WHERE clause)because that's what I knew. When we became a SQL Server shop, I was forced to use LEFT OUTER JOINs, and I found they didn't work as before until I discovered this behavior. Here's an example:
select NC.*,
IsNull(F.STRING_VAL, 'NONE') as USER_ID,
CO.TOTAL_AMT_ORDERED
from customer_order CO
INNER JOIN VTG_CO_NET_CHANGE NC
ON NC.CUST_ORDER_ID=CO.ID
LEFT OUTER JOIN USER_DEF_FIELDS F
ON F.DOCUMENT_ID = CO.ID and
F.PROGRAM_ID='VMORDENT' and
F.ID='UDF-0000072' and
F.DOCUMENT_ID is not null
where NC.acct_year=2017
I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php