The purpose of SQL's EXISTS and NOT EXISTS - sql

Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
I see them as being misleading (they're arguably harder to accurately visualize compared to conventional joins and subqueries), often misunderstood (e.g. using SELECT * will behave the same as SELECT 1 in the EXISTS/NOT EXISTS subquery), and from my limited experience, slower to execute.
Can someone describe and/or provide me an example where they are best suited or where there is no option other than to use them? Note that since their execution and performance are likely platform dependent, I'm particularly interested in their use in MySQL.

Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
This article (though SQL Server related):
IN vs. JOIN vs. EXISTS
may be of interest to you.
In a nutshell, JOIN is a set operation, while EXISTS is a predicate.
In other words, these queries:
SELECT *
FROM a
JOIN b
ON some_condition(a, b)
vs.
SELECT *
FROM a
WHERE EXISTS
(
SELECT NULL
FROM b
WHERE some_condition(a, b)
)
are not the same: the former can return more than one record from a, while the latter cannot.
Their counterparts, NOT EXISTS vs. LEFT JOIN / IS NULL, are the same logically but not performance-wise.
In fact, the former may be more efficient in SQL Server:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server

if the main query returned much less rows then the table where you want to find them. example:
SELECT st.State
FROM states st
WHERE st.State LIKE 'N%' AND EXISTS(SELECT 1 FROM addresses a WHERE a.State = st.State)
doing this with a join will be much slower. or a better example, if you want to search if a item exists in 1 of multiple tables.

You can't [easily] use a join in an UPDATE statement, so WHERE EXISTS works excellently there:
UPDATE mytable t
SET columnX = 'SomeValue'
WHERE EXISTS
(SELECT 1
FROM myothertable ot
WHERE ot.columnA = t.columnY
AND ot.columnB = 'XYX'
);
Edit: Basing this on Oracle more than MySQL, and yes there are ways to do it with an inline view, but IMHO this is cleaner.

Related

Most efficient SQL Statement: Exists vs IN

I have three tables that I need to JOIN to get values from two columns.
These columns are GRN_STATUS and STATUS I have written some SQL that achives the desired result but I've been advised that using INis very inefficient and that I should use EXISTS instead.
I'm just wondering is this true in my situation? and what would a solution using EXISTS instead of IN look like?
SQL:
SELECT c.GRN_STATUS, a.STATUS
FROM
TableA a
INNER JOIN
TableB b
ON a.ORD_NO = b.ORD_NO
AND a.COMPANY_ID = b.COMPANY_ID
INNER JOIN
TableC c
ON b.GRN_NO = c.GRN_NO
AND b.COMPANY_ID = c.COMPANY_ID
AND a.STATUS IN ( 'B', 'C', 'D', 'E' )
AND c.GRN_STATUS = 'A';
In general, it depends on the implementation in the DBMS.
EXISTS mostly stops and returns at the first match so it COULD be more efficient, but it makes no sense when you have a list of constants.
Since SQL is a declarative language, you can't tell the DBMS the how, just the what. You describe the expected result and it is up to the server to try to find the most efficient way to fulfill your request.
The way the DBMS finds the efficient algorithm is based on several things including the amount and the distribution of the data, the actual statistics, the expected resources needed, etc.
So EXISTS may perform better on a huge table, while has no effect on smaller ones (or vica versa).
Your best bet is to actually check the estimated query plans or try them out.
My personal view is to use EXISTS when no data is required and JOIN when data is required. IN is for constant lists.

Is using “NOT EXISTS” considered to be bad SQL practise?

I have heard a lot of people over the years say that:
"join" operators are preferred over “NOT EXISTS”
Why?
In MySQL, Oracle, SQL Server and PostgreSQL, NOT EXISTS is of the same efficiency or even more efficient than LEFT JOIN / IS NULL.
While it may seem that "the inner query should be executed for each record from the outer query" (which seems to be bad for NOT EXISTS and even worse for NOT IN, since the latter query is not even correlated), it may be optimized just as well as all other queries are optimized, using appropriate anti-join methods.
In SQL Server, actually, LEFT JOIN / IS NULL may be less efficient than NOT EXISTS / NOT IN in case of unindexed or low cardinality column in the inner table.
It is often heard that MySQL is "especially bad in treating subqueries".
This roots from the fact that MySQL is not capable of any join methods other than nested loops, which severely limits its optimization abilities.
The only case when a query would benefit from rewriting subquery as a join would be this:
SELECT *
FROM big_table
WHERE big_table_column IN
(
SELECT small_table_column
FROM small_table
)
small_table will not be queried completely for each record in big_table: though it does not seem to be correlated, it will be implicitly correlated by the query optimizer and in fact rewritten to an EXISTS (using index_subquery to search for the first much if needed if small_table_column is indexed)
But big_table would always be leading, which makes the query complete in big * LOG(small) rather than small * LOG(big) reads.
This could be rewritten as
SELECT DISTINCT bt.*
FROM small_table st
JOIN big_table bt
ON bt.big_table_column = st.small_table_column
However, this won't improve NOT IN (as opposed to IN). In MySQL, NOT EXISTS and LEFT JOIN / IS NULL are almost the same, since with nested loops the left table should always be leading in a LEFT JOIN.
You may want to read these articles:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
IN vs. JOIN vs. EXISTS: Oracle
IN vs. JOIN vs. EXISTS (SQL Server)
It may have to do with the optimization process... NOT EXISTS implies a subquery, and "optimizers" usually don't do subqueries justice. On the other hand, joins can be dealt with more easily...
I think this is a MySQL specific case. MySQL do not optimize subquery in IN / not in / any / not exists clauses, and actually performs the subquery for each row matched by the outer query. Because of this in MySQL, you should use join. In PostgreSQL however, you can just use subquery.

Where does the practice "exists (select 1 from ...)" come from?

The overwhelming majority of people support my own view that there is no difference between the following statements:
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT NULL FROM tableB WHERE tableA.x = tableB.y)
Yet today I came face-to-face with the opposite claim when in our internal developer meeting it was advocated that select 1 is the way to go and select * selects all the (unnecessary) data, hence hurting performance.
I seem to remember that there was some old version of Oracle or something where this was true, but I cannot find references to that. So, I'm curious - how was this practice born? Where did this myth originate from?
Added: Since some people insist on having evidence that this is indeed a false belief, here - a google query which shows plenty of people saying it so. If you're too lazy, check this direct link where one guy even compares execution plans to find that they are equivalent.
The main part of your question is - "where did this myth come from?"
So to answer that, I guess one of the first performance hints people learn with sql is that select * is inefficient in most situations. The fact that it isn't inefficient in this specific situation is hence somewhat counter intuitive. So its not surprising that people are skeptical about it. But some simple research or experiments should be enough to banish most myths. Although human history kinda shows that myths are quite hard to banish.
As a demo, try these
SELECT * FROM tableA WHERE EXISTS (SELECT 1/0 FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT CAST('bollocks' as int) FROM tableB WHERE tableA.x = tableB.y)
Now read the ANSI standard. ANSI-92, page 191, case 3a
If the <select list> "*" is simply contained in a <subquery>
that is immediately contained in an <exists predicate>, then
the <select list> is equivalent to a <value expression> that
is an arbitrary <literal>.
Finally, the behaviour on most RDBMS should ignore THE * in the EXISTS clause. As per this question yesterday ( Sql Server 2005 - Insert if not exists ) this doesn't work on SQL Server 2000 but I know it does on SQL Server 2005+
For SQL Server Conor Cunningham from the Query Optimiser team explains why he typically uses SELECT 1
The QP will take and expand all *'s
early in the pipeline and bind them to
objects (in this case, the list of
columns). It will then remove
unneeded columns due to the nature of
the query.
So for a simple EXISTS subquery like
this:
SELECT col1 FROM MyTable WHERE EXISTS
(SELECT * FROM Table2 WHERE
MyTable.col1=Table2.col2)The * will be
expanded to some potentially big
column list and then it will be
determined that the semantics of the
EXISTS does not require any of those
columns, so basically all of them can
be removed.
"SELECT 1" will avoid having to
examine any unneeded metadata for that
table during query compilation.
However, at runtime the two forms of
the query will be identical and will
have identical runtimes.
Edit: However I have looked at this in some detail since posting this answer and come to the conclusion that SELECT 1 does not avoid this column expansion. Full details here.
This question has an answer that says it was some version of MS Access that actually did not ignore the field of the SELECT clause. I have done some Access development, and I have heard that SELECT 1 is best practice, so this seems very likely to me to be the source of the "myth."
Performance of SQL EXISTS usage variants

SQL: Is a query like this OK or is there a more efficient way of doing it, like using a join?

I often find myself wanting to write an SQL query like the following:
SELECT body
FROM node_revisions
where vid = (SELECT vid
FROM node
WHERE nid = 4);
I know that there are joins and stuff you could do, but they seem to make things more complicated. Are joins a better way to do it? Is it more efficient? Easier to understand?
Joins tend to be more efficient since databases are written with set operations in mind (and joins are set operations).
However, performance will vary from database to database, how the tables are structured, the amount of data in them and how much will be returned by the query.
If the amount of data is small, I would use a subquery like yours rather than a join.
Here is what a join would look like:
SELECT body
FROM node_revisions nr
INNER JOIN node n
ON nr.vid = n.vid
WHERE n.nid = 4
I would not use the query you posted, as there is chance of more than one node record with a nid = 4, which would cause it to fail.
I would use:
SELECT body
FROM node_revisions
WHERE vid IN (SELECT vid
FROM node
WHERE nid = 4);
Is this more readable or understandable? In this case, it's a matter of personal preference.
I think joins are easier to understand and can be more efficient. Your case is pretty simple, so it is probably a toss-up. Here is how I would write it:
SELECT body
FROM node_revisions
inner join node
on (node_revisions.vid = node.vid)
WHERE node.nid = 4
The answer to any performance related questions in databases is it depends, and we're short on details in the OP. Knowing no specifics about your situation... (thus, these are general rules of thumb)
Joins are better and easier to understand
If for some reason you need multiple column keys (fishy), you can continue to use a join and simply tack on another expression to the join condition.
If in the future you really do need to join auxiliary data, the join framework is already there.
It makes it more clear exactly what you're joining on and where indexes should be implemented.
Use of joins makes you better at joins and better at thinking about joins.
Joins are clear about what tables are in play
Written queries have nothing to do with effiency*
The queries you write and what actually gets run have little to do with one another. There are many ways to write a query but only so few ways to fetch the data, and it's up to the query engine to decide. This relates mostly to indexes. It's very possible to write four queries that look totally different but internally do the same thing.
(* It's possible to write a horrible query that is inefficient but it takes a special kind of crazy to do that.)
select
body
from node_revisions nr
join node n
on n.vid = nr.vid
where n.nid = 4
A join is interesting:
select body
from node_revisions nr
join node n on nr.vid = n.vid
where n.vid = 4
But you can also express a join without a join [!]:
select body
from node_revisions nr, node n
where n.nid = 4 and nr.vid = n.vid
Interestingly enough, SQL Server gives a slight different query plan on both queries, while the join has a clustered index scan, the "join without a join" has a clustered index seek in its place, which indicates it's better, at least in this case!
select
body
from node_revisions A
where exists (select 'x'
from Node B
Where A.Vid = B.Vid and B.NID=4)
I don't see anything wrong with what you wrote, and a good optimizer may even change it to a join if it sees fit.
SELECT body
FROM node_revisions
WHERE vid =
(
SELECT vid
FROM node
WHERE nid = 4
)
This query is logically equivalent to a join if and only if nid is a PRIMARY KEY or is covered by a UNIQUE constraint.
Otherwise, the queries are not equivalent: a join will always succeed, while the subquery will fail if there are more that 1 row in node with nid = 4.
If nid is a PRIMARY KEY, then the JOIN and the subquery will have same performance.
In case of a join, node will be made leading
In case of a subquery, the subquery will be executed once and transformed into a const on parsing stage.
The latest MySQL 6.x code will automatically convert that IN expression into an INNER JOIN using a semi-join subquery optimization, making the 2 statements largely equivalent:
http://forge.mysql.com/worklog/task.php?id=3740
but, actually writing it out is pretty simple to do, because INNER JOIN is the default join type, and doing this wouldn't rely on the server optimizing it away (which it might decide not to for some reason and which wouldn't be portable necessarily). all things being equal, why not go with:
select body from node_revisions r, node n where r.vid = n.vid and n.node = 4

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php