SQL: IN vs EXISTS - sql

I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?

This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

Will SQL Server be smart enough to not execute expensive queries if it is not needed ? (short-circuiting)

So SQL Server does not have short-circuiting in the explicit manner as with for example if-statements in general-purpose programming languages.
So consider the following mock-up query:
SELECT * FROM someTable
WHERE name = 'someValue' OR name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
Let's say there are only 3 rows in the table and all of them match name = 'someValue'. Will the expensive sub-query ever run?
Let's say there are 3 million rows and all but 1 could be fetched with the name = 'someValue' except 1 row which need to be fetched with the sub-query. Will the sub-query ever be evaluated when it is not needed?
If one has a similar real case, one might be ok with letting the 0.01% wait for the expensive sub-query to run before getting the results as long as the results are fetched quickly without the sub-query for the 99.99% of cases.
(I know that my specific example above could be handled explicitly with IF-statements in an SP, as suggested in this related thread:
Sql short circuit OR or conditional exists in where clause
but let's assume that is not an option.)
As the comments point out, the optimizer in SQL Server is pretty smart.
You could attempt the short-circuiting by using case. As the documentation states:
The CASE expression evaluates its conditions sequentially and stops with the first condition whose condition is satisfied.
Note that there are some exceptions involving aggregation. So, you could do:
SELECT t.*
FROM someTable t
WHERE 'true' = (CASE WHEN t.name = 'someValue' THEN 'true'
WHEN t.name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
THEN 'true'
END)
This type of enforced ordering is generally considered a bad idea. One exception is when one of the paths might involve an error,such as a type conversion error) -- however, that is generally fixed nowadays with the try_ functions.
In your case, I suspect that replacing the IN with EXISTS and using appropriate indexes might eliminate almost all the performance penalty of the subquery. However, that is a different matter.

Do subqueries preserve indexes

I would like to know if the value returned from subqueries in Oracle loses indexes or not.
select * from emp where empid = 1
-- empid is indexed
select t1.* from (select t2.* from emp t2) t1 where t1.empid = 1
-- t1.empid is still indexed?
Yes, the second query uses the index. In fact, both compile down to the exact same execution plan. Check it on SQLFiddle.
You should keep in mind that SQL processing is lazy: sub-queries are not necessarily fully executed to get input data to the top-level query. Instead, they should be regarded as code that can be invoked, as needed, to get that data.
Part of the query optimisation process is to simplify the structure of the supplied query. This generally can mean replacing IN and EXISTS with joins, pushing predicates into inline views and eliminating subqueries entirely.
In fact because this behaviour is so prevalant there are a few techniques (optimiser hints, for example, or common table expressions, or certain logically redundant clauses) specifically designed to prevent predicate pushing and subquery merging and other query transformations where they are unintentionally disadvantageous.
By default you should expect that subqueries and inline views will be merged into the parent query where logically possible, and as others have mentioned this is almost certainly the case in your example.
It follows from all of this, of course, that using subqueries or inline views generally doesn't impair the optimiser's ability to use indexes or query rewrite or various other performance enhancing techniques.
There is no simple answer on this question: some times yes and sometimes no. It depends on many factors. In general , just not going deeply into theory, try to avoid such style of query writing to be on safe side or in other words views are bad in most of the cases. Please refer to http://www.orafaq.com/tuningguide/push%20predicates.html for more details.

Performance of SQL "EXISTS" usage variants

Is there any difference in the performance of the following three SQL statements?
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
They all should work and return the same result set. But does it matter if the inner SELECT selects all fields of tableB, one field, or just a constant?
Is there any best practice when all statements behave equal?
The truth about the EXISTS clause is that the SELECT clause is not evaluated in an EXISTS clause - you could try:
SELECT *
FROM tableA
WHERE EXISTS (SELECT 1/0
FROM tableB
WHERE tableA.x = tableB.y)
...and should expect a divide by zero error, but you won't because it's not evaluated. This is why my habit is to specify NULL in an EXISTS to demonstrate that the SELECT can be ignored:
SELECT *
FROM tableA
WHERE EXISTS (SELECT NULL
FROM tableB
WHERE tableA.x = tableB.y)
All that matters in an EXISTS clause is the FROM and beyond clauses - WHERE, GROUP BY, HAVING, etc.
This question wasn't marked with a database in mind, and it should be because vendors handle things differently -- so test, and check the explain/execution plans to confirm. It is possible that behavior changes between versions...
Definitely #1. It "looks" scary, but realize the optimizer will do the right thing and is expressive of intent. Also ther is a slight typo bonus should one accidently think EXISTS but type IN. #2 is acceptable but not expressive. The third option stinks in my not so humble opinion. It's too close to saying "if 'no value' exists" for comfort.
In general it's important to not be scared to write code that mearly looks inefficient if it provides other benefits and does not actually affect performance.
That is, the optimizer will almost always execute your complicated join/select/grouping wizardry to save a simple EXISTS/subquery the same way.
After having given yourself kudos for cleverly rewriting that nasty OR out of a join you will eventually realize the optimizer still used the same crappy execution plan to resolve the much easier to understand query with embedded OR anyway.
The moral of the story is know your platforms optimizer. Try different things and see what is actually being done because the rampant knee jerks assumptions regarding 'decorative' query optimization are almost always incorrect and irrelevant from my experience.
I realize this is an old post, but I thought it important to add clarity about why one might choose one format over another.
First, as others have pointed out, the database engine is supposed to ignore the Select clause. Every version of SQL Server has/does, Oracle does, MySQL does and so on. In many, many moons of database development, I have only ever encountered one DBMS that did not properly ignore the Select clause: Microsoft Access. Specifically, older versions of MS Access (I can't speak to current versions).
Prior to my discovery of this "feature", I used to use Exists( Select *.... However, i discovered that MS Access would stream across every column in the subquery and then discard them (Select 1/0 also would not work). That convinced me switch to Select 1. If even one DBMS was stupid, another could exist.
Writing Exists( Select 1... is as abundantly clear in conveying intent (It is frankly silly to claim "It's too close to saying "if 'no value' exists" for comfort.") and makes the odds of a DBMS doing something stupid with the Select statement nearly impossible. Select Null would serve the same purpose but is simply more characters to write.
I switched to Exists( Select 1 to make absolutely sure the DBMS couldn't be stupid. However, that was many moons ago, and today I would expect that most developers would expect seeing Exists( Select * which will work exactly the same.
That said, I can provide one good reason for avoiding Exists(Select * even if your DBMS evaluates it properly. It is much easier to find and trounce all uses of Select * if you don't have to skip every instance of its use in an Exists clause.
In SQL Server at least,
The smallest amount of data that can be read from disk is a single "page" of disk space. As soon as the processor reads one record that satisfies the subquery predicates it can stop. The subquery is not executed as though it was standing on it's own, and then included in the outer query, it is executed as part of the complete query plan for the whole thing. So when used as a subquery, it really doesn't matter what is in the Select clause, nothing is returned" to the outer query anyway, except a boolean to indicate whether a single record was found or not...
All three use the exact same execution plan
I always use [Select * From ... ] as I think it reads better, by not implying that I want something in particular returned from the subquery.
EDIT: From dave costa comment... Oracle also uses the same execution plan for all three options
This is one of those questions that verges on initiating some kind of holy war.
There's a fairly good discussion about it here.
I think the answer is probably to use the third option, but the speed increase is so infinitesimal it's really not worth worrying about. It's easily the kind of query that SQL Server can optimise internally anyway, so you may find that all options are equivalent.
The EXISTS returns a boolean not actual data, that said best practice is to use #3.
Execution Plan.
Learn it, use it, love it
There is no possible way to guess, really.
In addition to what others have said, the practice of using SELECT 1 originated on old Microsoft SQL Server (prior 2005) - its query optimizer wasn't clever enough to avoid physically fetching fields from the table for SELECT *. No other DBMS, to my knowledge, has this deficiency.
The EXISTS tests for existence of rows, not what's in them, so other than some optimizer quirk similar to above, it doesn't really matter what's in the SELECT list.
The SELECT * seems to be most usual, but others are acceptable as well.
#3 Should be the best one, as you don´t need the returned data anyway. Bringing the fields will only add an extra overhead

Subqueries vs joins

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.