Is there any difference in the performance of the following three SQL statements?
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
They all should work and return the same result set. But does it matter if the inner SELECT selects all fields of tableB, one field, or just a constant?
Is there any best practice when all statements behave equal?
The truth about the EXISTS clause is that the SELECT clause is not evaluated in an EXISTS clause - you could try:
SELECT *
FROM tableA
WHERE EXISTS (SELECT 1/0
FROM tableB
WHERE tableA.x = tableB.y)
...and should expect a divide by zero error, but you won't because it's not evaluated. This is why my habit is to specify NULL in an EXISTS to demonstrate that the SELECT can be ignored:
SELECT *
FROM tableA
WHERE EXISTS (SELECT NULL
FROM tableB
WHERE tableA.x = tableB.y)
All that matters in an EXISTS clause is the FROM and beyond clauses - WHERE, GROUP BY, HAVING, etc.
This question wasn't marked with a database in mind, and it should be because vendors handle things differently -- so test, and check the explain/execution plans to confirm. It is possible that behavior changes between versions...
Definitely #1. It "looks" scary, but realize the optimizer will do the right thing and is expressive of intent. Also ther is a slight typo bonus should one accidently think EXISTS but type IN. #2 is acceptable but not expressive. The third option stinks in my not so humble opinion. It's too close to saying "if 'no value' exists" for comfort.
In general it's important to not be scared to write code that mearly looks inefficient if it provides other benefits and does not actually affect performance.
That is, the optimizer will almost always execute your complicated join/select/grouping wizardry to save a simple EXISTS/subquery the same way.
After having given yourself kudos for cleverly rewriting that nasty OR out of a join you will eventually realize the optimizer still used the same crappy execution plan to resolve the much easier to understand query with embedded OR anyway.
The moral of the story is know your platforms optimizer. Try different things and see what is actually being done because the rampant knee jerks assumptions regarding 'decorative' query optimization are almost always incorrect and irrelevant from my experience.
I realize this is an old post, but I thought it important to add clarity about why one might choose one format over another.
First, as others have pointed out, the database engine is supposed to ignore the Select clause. Every version of SQL Server has/does, Oracle does, MySQL does and so on. In many, many moons of database development, I have only ever encountered one DBMS that did not properly ignore the Select clause: Microsoft Access. Specifically, older versions of MS Access (I can't speak to current versions).
Prior to my discovery of this "feature", I used to use Exists( Select *.... However, i discovered that MS Access would stream across every column in the subquery and then discard them (Select 1/0 also would not work). That convinced me switch to Select 1. If even one DBMS was stupid, another could exist.
Writing Exists( Select 1... is as abundantly clear in conveying intent (It is frankly silly to claim "It's too close to saying "if 'no value' exists" for comfort.") and makes the odds of a DBMS doing something stupid with the Select statement nearly impossible. Select Null would serve the same purpose but is simply more characters to write.
I switched to Exists( Select 1 to make absolutely sure the DBMS couldn't be stupid. However, that was many moons ago, and today I would expect that most developers would expect seeing Exists( Select * which will work exactly the same.
That said, I can provide one good reason for avoiding Exists(Select * even if your DBMS evaluates it properly. It is much easier to find and trounce all uses of Select * if you don't have to skip every instance of its use in an Exists clause.
In SQL Server at least,
The smallest amount of data that can be read from disk is a single "page" of disk space. As soon as the processor reads one record that satisfies the subquery predicates it can stop. The subquery is not executed as though it was standing on it's own, and then included in the outer query, it is executed as part of the complete query plan for the whole thing. So when used as a subquery, it really doesn't matter what is in the Select clause, nothing is returned" to the outer query anyway, except a boolean to indicate whether a single record was found or not...
All three use the exact same execution plan
I always use [Select * From ... ] as I think it reads better, by not implying that I want something in particular returned from the subquery.
EDIT: From dave costa comment... Oracle also uses the same execution plan for all three options
This is one of those questions that verges on initiating some kind of holy war.
There's a fairly good discussion about it here.
I think the answer is probably to use the third option, but the speed increase is so infinitesimal it's really not worth worrying about. It's easily the kind of query that SQL Server can optimise internally anyway, so you may find that all options are equivalent.
The EXISTS returns a boolean not actual data, that said best practice is to use #3.
Execution Plan.
Learn it, use it, love it
There is no possible way to guess, really.
In addition to what others have said, the practice of using SELECT 1 originated on old Microsoft SQL Server (prior 2005) - its query optimizer wasn't clever enough to avoid physically fetching fields from the table for SELECT *. No other DBMS, to my knowledge, has this deficiency.
The EXISTS tests for existence of rows, not what's in them, so other than some optimizer quirk similar to above, it doesn't really matter what's in the SELECT list.
The SELECT * seems to be most usual, but others are acceptable as well.
#3 Should be the best one, as you donĀ“t need the returned data anyway. Bringing the fields will only add an extra overhead
Related
Update 11/2
After some additional troubleshooting, my team was able to tie this Oracle bug directly to a parameter change that was made on the 12c database the night before the query stopped working. After experiencing some performance issues from an application tied to this database, my team had our DBA change the OPTIMIZER_FEATURES_ENABLE parameter from 12.1.02 to 11.2.0.4. This fixed the performance issue for the problem application but caused the bug I have described above. To verify, I've been able to replicate this same issue in a separate environment by changing this parameter. My DBA has filed a ticket with Oracle to have this looked at.
As a workaround, I was able to make a slight change to my query in order to retrieve the expected results. Specifically, I combined Subquery1 with Subquery2 and I moved a few predicates in Subquery1 from the WHERE clause to the JOIN (where they more properly belonged). This change edited my execution plan (it is slightly less efficient than what was listed before) but was enough to address the original issue.
Original Post
Firstly, let me apologize for any vagueness in this question but I'm dealing with a confidential financial system so I am forced to hide certain implementation details.
Background
I have an Oracle query that I put into production a long time ago that has recently stopped producing expected results coincidentally after an upgrade from 11g to 12c. To my (and my production support team's) knowledge this query had been working fine for well over a year before that.
Details
The query is overly complicated and not very efficient but this is in large part because I am dealing with non-normalized tables (historically modeled after a Mainframe) and poor data input from upstream systems. In order to deal with a complicated business situation I leveraged multiple levels of Subquery Factoring (the WITH statement) and then my final statement joins together two Inline Views. The basic structure of the query without all of the complicated predicates is as follows:
I have 3 tables Table1, Table2, Table3. Table1 is a processing table made up of records from Table2.
--This grabs a subset from Table1
WITH Subquery1 as (
SELECT FROM Table1),
--This eliminates certain records from the first subset based on sister records
--from the original source table
Subquery2 as (
SELECT FROM Subquery1
WHERE NOT EXISTS FROM (SELECT from Table2)),
--This ties the records from Subquery2 to Table3
Subquery3 as (
SELECT FROM Table3
JOIN (SELECT Max(Date) FROM Table3)
JOIN Subquery2)
--This final query evaluates subquery3 in two different ways and
--only takes those records which fit the criteria items from both sets
SELECT FROM
(SELECT FROM Subquery3) -- Call this Inline View A
JOIN (SELECT FROM Subquery3) -- Call this Inline View B
The final query is pretty basic:
SELECT A.Group_No, B.Sub_Group, B.Key, B.Lob
FROM (SELECT Group_No, Lob, COUNT(Sub_Group)
FROM Subquery3
GROUP BY Group_No, Lob
HAVING COUNT(Sub_Group) = 1) A
JOIN (SELECT Group_No, Sub_Group, Key, Lob
FROM Subquery3
WHERE Sub_Group LIKE '0000%') B
ON A.Group_No = B.Group_No
AND A.Lob = B.Lob
Problem
If I edit the final query to remove the second Inline View and evaluate the output of the A inline view, I come away with 0 returned rows. I've manually evaluated the records for each individual subquery and can confirm this is an expected result.
Likewise, if I edit the final query to produce the output of only the 'B' inline view, I come away with 6 returned rows. Again, I've manually evaluated the data and this is exactly as expected.
Now when joining these two subsets (Inline View A and Inline View B) together, I would expect that the final query result would be 0 rows (since an inner join between a full set and an empty set produces no matches). However, when I run the entire query with the inner join as described above, I am getting back 1158 rows!
I have reviewed the Execution Plan but nothing jumps out at me:
Questions
Clearly I have done something to confuse the Oracle Optimizer and the updated query plan is pulling back a much different query than the one I have submitted. My best guess is that with all of these temporary views floating around within the same query, I have confused Oracle into evaluating some set before one that it depends upon.
To this day I've been unable to locate the official Oracle documentation around the WITH statement so I've never been completely confident about the order that subqueries are evaluated. I did notice in searching SO (can't find it now) someone mentioned that a factored subquery cannot refer to another factored query. I've never known this to be true before but the bizarre output above is making me wonder if I had only been lucky before with this query?
Can anyone explain the behavior I am seeing? Am I attempting to do something obviously incorrect with this query plan? Or alternatively, is there any chance that something changed between 11g and 12c that could explain why the behavior of this query might have changed?
This sounds like a "wrong results" bug in Oracle. These bugs are usually extremely specific to the version and the features you are using. There's nothing obviously wrong with the queries or execution plan you posted.
You have two ways of handling this:
Try to find the precise bug. What you're doing with common table expressions looks fine. There are some rare times when your query is technically invalid, you get "lucky" in one version and it works, and when you upgrade it fails. But when that happens the new version usually throws an error, not return wrong results. There's probably some extremely weird, specific combination of features you're using that's causing the issue. To find the real issue you need to massively simplify the query until you can make the smallest possible change and see the problem appear and disappear. You'll also want to remove all objects and only use DUAL. This process can take hours. At the end, when you're left with only a few lines of code, either post them here, look on Oracle Support, or create a Service Request.
Avoid the bug. Even if you go through the above steps there may not be a fix anyway. Sometimes the best work-around is to do something differently. It's nice to get to the bottom of every problem but you don't always have time. Instead, try re-writing the query in syntactically different but logically equivalent ways. Remove some or all of the common table expressions, maybe even repeat some SQL. But be sure to leave a comment warning future programmers of why you're doing things in a weird way.
I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?
This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).
Let's say I have following query:
SELECT Id, Name, ForeignKeyId,
(SELECT TOP (1) FtName FROM ForeignTable WHERE FtId = ForeignKeyId)
FROM Table
Would that query execute faster if it is written with JOIN:
SELECT Id, Name, ForeignKeyId, FtName
FROM Table t
LEFT OUTER JOIN ForeignTable ft
ON ft.FtId = t.ForeignTableIf
Just curious... also, if JOINs are faster, will it be faster in all cases (tables with lots of columns, large number of rows)?
EDIT: Queries I wrote are just for illustrating concept of TOP (1) vs JOIN. Yes - I know about Query Execution Plan in SQL Server but I'm not looking to optimize single query - I'm trying to understand if there is certain theory behind SELECT TOP (1) vs JOIN and if certain approach is preferred because of speed (not because of personal preference or readability).
EDIT2: I would like to thank Aaron for his detailed answer and encourage to people to check his company's SQL Sentry Plan Explorer free tool he mentioned in his answer.
Originally, I wrote:
The first version of the query is MUCH less readable to me. Especially
since you don't bother aliasing the matched column inside the
correlated subquery. JOINs are much clearer.
I still believe and stand by those statements, but I'd like to add to my original response based on the new information added to the question. You asked, are there general rules or theories about what performs better, a TOP (1) or a JOIN, leaving readability and preference aside)? I will re-state as I commented that no, there are no general rules or theories. When you have a specific example, it is very easy to prove what works better. Let's take these two queries, similar to yours but which run against system objects that we can all verify:
-- query 1:
SELECT name,
(SELECT TOP (1) [object_id]
FROM sys.all_sql_modules
WHERE [object_id] = o.[object_id]
)
FROM sys.all_objects AS o;
-- query 2:
SELECT o.name, m.[object_id]
FROM sys.all_objects AS o
LEFT OUTER JOIN sys.all_sql_modules AS m
ON o.[object_id] = m.[object_id];
These return the exact same results (3,179 rows on my system), but by that I mean the same data and the same number of rows. One clue that they're not really the same query (or at least not following the same execution plan) is that the results come back in a different order. While I wouldn't expect a certain order to be maintained or obeyed, because I didn't include an ORDER BY anywhere, I would expect SQL Server to choose the same ordering if they were, in fact, using the same plan.
But they're not. We can see this by inspecting the plans and comparing them. In this case I'll be using SQL Sentry Plan Explorer, a free execution plan analysis tool from my company - you can get some of this information from Management Studio, but other parts are much more readily available in Plan Explorer (such as actual duration and CPU). The top plan is the subquery version, the bottom one is the join. Again, the subquery is on the top, the join is on the bottom:
[click for full size]
[click for full size]
The actual execution plans: 85% of the overall cost of running the two queries is in the subquery version. This means it is more than 5 times as expensive as the join. Both CPU and I/O are much higher with the subquery version - look at all those reads! 6,600+ pages to return ~3,000 rows, whereas the join version returns the data using much less I/O - only 110 pages.
But why? Because the subquery version works essentially like a scalar function, where you're going and grabbing the TOP matching row from the other table, but doing it for every row in the original query. We can see that the operation occurs 3,179 times by looking at the Top Operations tab, which shows number of executions for each operation. Once again, the more expensive subquery version is on top, and the join version follows:
I'll spare you more thorough analysis, but by and large, the optimizer knows what it's doing. State your intent (a join of this type between these tables) and 99% of the time it will work out on its own what is the best underlying way to do this (e.g. execution plan). If you try to out-smart the optimizer, keep in mind that you're venturing into quite advanced territory.
There are exceptions to every rule, but in this specific case, the subquery is definitely a bad idea. Does that mean the proposed syntax in the first query is always a bad idea? Absolutely not. There may be obscure cases where the subquery version works just as well as the join. I can't think that there are many where the subquery will work better. So I would err on the side of the one that is more likely to be as good or better and the one that is more readable. I see no advantages to the subquery version, even if you find it more readable, because it is most likely going to result in worse performance.
In general, I highly advise you to stick to the more readable, self-documenting syntax unless you find a case where the optimizer is not doing it right (and I would bet in 99% of those cases the issue is bad statistics or parameter sniffing, not a query syntax issue). I would suspect that, outside of those cases, the repros you could reproduce where convoluted queries that work better than their more direct and logical equivalents would be quite rare. Your motivation for trying to find those cases should be about the same as your preference for the unintuitive syntax over generally accepted "best practice" syntax.
Your queries do different things. The first is more akin to a LEFT OUTER JOIN.
It depends how your indexes are setup for performance. But JOINs are more clear.
I agree with statements above (Rick). Run this in Execution Plan...you'll get a clear answer. No speculation needed.
I agree with Daniel and Davide, that these are two different SQL statements. If the ForeignTable has multiple records of the same FtId value, then you'll have get duplication of data. Assuming the 1st SQL statement is correct, you'll have to rewrite the 2nd with some GROUP BY clause.
select *
from ContactInformation c
where exists (select * from Department d where d.Id = c.DepartmentId )
select *
from ContactInformation c
inner join Department d on c.DepartmentId = d.Id
Both the queries give out the same output, which is good in performance wise join or correlated sub query with exists clause, which one is better.
Edit :-is there alternet way for joins , so as to increase performance:-
In the above 2 queries i want info from dept as well as contactinformation tables
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department rows for a ContactInformation row.
In your example above, the SELECT *:
means different output too so they are not actually equivalent
less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
You need to measure and compare - there's no golden rule which one will be better - it depends on too many variables and things in your system.
In SQL Server Management Studio, you could put both queries in a window, choose Include actual execution plan from the Query menu, and then run them together.
You should get a comparison of both their execution plans and a percentage of how much of the time was spent on one or the other query. Most likely, both will be close to 50% in this case. If not - then you know which of the two queries performs better.
You can learn more about SQL Server execution plans (and even download a free e-book) from Simple-Talk - highly recommended.
I assume that either you meant to add the DISTINCT keyword to the SELECT clause in your second query (or, less likely, a Department has only one Contact).
First, always start with 'logical' considerations. The EXISTS construct is arguably more intuitive so, all things 'physical' being equal, I'd go with that.
Second, there will be one day when you will need to ports this code, not necessarily to a different SQL product but, say, the same product but with a different optimizer. A decent optimizer should recognise that both are equivalent and come up with the same ideal plan. Consider that, in theory, the EXISTS construct has slightly more potential to short circuit.
Third, test it using a reasonably large data set. If performance isn't acceptable, start looking at the 'physical' considerations (but I suggest you always keep your 'logically-pure' code in comments for the forthcoming day when the perfect optimizer arrives :)
Your first query should output Department columns, while the second one should not.
If you're only interested in ContactInformation, these queries are equivalent. You could run them both and examine the query execution plan to see which one runs faster. For example, on MYSQL, where exists is more efficient with nullable columns, while inner join performs better if neither column is nullable.
We have a web application in which the users perform ad-hoc queries based on parameters that have been entered. I might also mention that response time is of high importance to the users.
The web page dynamically construct a SQL to execute based on the parameters entered. For example, if the user enters "1" for "Business Unit" we construct a SQL like this:
SELECT * FROM FACT WHERE
BUSINESS_UNIT = '1'
--AND other criteria based on the input params
I found that where the user does not specify a BUSINESS_UNIT the following query is constructed
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%'
--AND other criteria based on the input params
IMHO, this is unnecessarily (if not grossly) inefficient and warrants sending the code bad for modification but since I have a much higher rate of sending code back for rework than others, I believe that I may be earning a reputation as being "too picky."
If this is an inappropriate question because it is not a direct coding Q, let me know and I will delete it immediately. I am very confused whether subjective questions like this are allowed or not! I'll be watching your replies.
ty
Update:
I am using an Oracle database.
My impression is that Oracle does not optimize "LIKE '%'" by removing the condition and that leaving it in is less efficient. Could someone confirm?
The two queries are completely different (from the point of view of a result set)
SELECT * FROM FACT;
and
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%';
The first will return all rows, the second, if there are NULL values those rows will not be returned because anything compared to NULL is NULL and therefore does not satisfy the predicate. This is how it would work in Oracle.
Although that looks grossly inefficient, I just tested it out in SQL Server, and the query optimizer was smart enough to filter it out.
In other words,
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%'
and
SELECT * FROM FACT
generated the exact same query plan. So there shouldn't be a performance difference (depending on your DB engine I guess), although it does look kind of sloppy.
That may affect your decision to send it back or not. Personally I probably would, but if you're under a cloud already then it's something you can probably relax on, at least in terms of performance.
That query, with the BUSINESS_UNIT LIKE '%', doesn't look quite efficient -- I suppose it'll force a full scan of the table...
So, yes, I would try to have that query reworked, to deal with that kind of situation in a proper way -- or, at least, I would report that problem through our bug-tracker (Not sure of the priority I would assign, but, still, the problem would be logged somewhere, and someone will correct it someday, when there is no higher-priority problem to deal with).
SELECT * is a red flag. Specify a column list for the query.
I would send it back and I like the question.
All code review is to some extent subjective. The criteria should be based on a number of factors.
Does it work.
Does it meet reasonable expectations of performance, maintainability, usability, and scalability
While I have not tested this particular construct -- I suspect (as do you) this code will do horrible things to your SQL server. Thus performance and scalability are called into question.
The writer wanted a dummy statement so that he/she could append "and" to all the subsequent statements. changing it to a better "no-op" statement like 1==1 would help. Or they could do slightly more work and insert the "where" intelligently.
I would question why bind variables are not being used (unless there is a good reason in particular cases):
SELECT * FROM FACT WHERE
BUSINESS_UNIT = '1'
Why is it not:
SELECT * FROM FACT WHERE
BUSINESS_UNIT = :bu