What is better - SELECT TOP (1) or INNER JOIN? - sql

Let's say I have following query:
SELECT Id, Name, ForeignKeyId,
(SELECT TOP (1) FtName FROM ForeignTable WHERE FtId = ForeignKeyId)
FROM Table
Would that query execute faster if it is written with JOIN:
SELECT Id, Name, ForeignKeyId, FtName
FROM Table t
LEFT OUTER JOIN ForeignTable ft
ON ft.FtId = t.ForeignTableIf
Just curious... also, if JOINs are faster, will it be faster in all cases (tables with lots of columns, large number of rows)?
EDIT: Queries I wrote are just for illustrating concept of TOP (1) vs JOIN. Yes - I know about Query Execution Plan in SQL Server but I'm not looking to optimize single query - I'm trying to understand if there is certain theory behind SELECT TOP (1) vs JOIN and if certain approach is preferred because of speed (not because of personal preference or readability).
EDIT2: I would like to thank Aaron for his detailed answer and encourage to people to check his company's SQL Sentry Plan Explorer free tool he mentioned in his answer.

Originally, I wrote:
The first version of the query is MUCH less readable to me. Especially
since you don't bother aliasing the matched column inside the
correlated subquery. JOINs are much clearer.
I still believe and stand by those statements, but I'd like to add to my original response based on the new information added to the question. You asked, are there general rules or theories about what performs better, a TOP (1) or a JOIN, leaving readability and preference aside)? I will re-state as I commented that no, there are no general rules or theories. When you have a specific example, it is very easy to prove what works better. Let's take these two queries, similar to yours but which run against system objects that we can all verify:
-- query 1:
SELECT name,
(SELECT TOP (1) [object_id]
FROM sys.all_sql_modules
WHERE [object_id] = o.[object_id]
)
FROM sys.all_objects AS o;
-- query 2:
SELECT o.name, m.[object_id]
FROM sys.all_objects AS o
LEFT OUTER JOIN sys.all_sql_modules AS m
ON o.[object_id] = m.[object_id];
These return the exact same results (3,179 rows on my system), but by that I mean the same data and the same number of rows. One clue that they're not really the same query (or at least not following the same execution plan) is that the results come back in a different order. While I wouldn't expect a certain order to be maintained or obeyed, because I didn't include an ORDER BY anywhere, I would expect SQL Server to choose the same ordering if they were, in fact, using the same plan.
But they're not. We can see this by inspecting the plans and comparing them. In this case I'll be using SQL Sentry Plan Explorer, a free execution plan analysis tool from my company - you can get some of this information from Management Studio, but other parts are much more readily available in Plan Explorer (such as actual duration and CPU). The top plan is the subquery version, the bottom one is the join. Again, the subquery is on the top, the join is on the bottom:
[click for full size]
[click for full size]
The actual execution plans: 85% of the overall cost of running the two queries is in the subquery version. This means it is more than 5 times as expensive as the join. Both CPU and I/O are much higher with the subquery version - look at all those reads! 6,600+ pages to return ~3,000 rows, whereas the join version returns the data using much less I/O - only 110 pages.
But why? Because the subquery version works essentially like a scalar function, where you're going and grabbing the TOP matching row from the other table, but doing it for every row in the original query. We can see that the operation occurs 3,179 times by looking at the Top Operations tab, which shows number of executions for each operation. Once again, the more expensive subquery version is on top, and the join version follows:
I'll spare you more thorough analysis, but by and large, the optimizer knows what it's doing. State your intent (a join of this type between these tables) and 99% of the time it will work out on its own what is the best underlying way to do this (e.g. execution plan). If you try to out-smart the optimizer, keep in mind that you're venturing into quite advanced territory.
There are exceptions to every rule, but in this specific case, the subquery is definitely a bad idea. Does that mean the proposed syntax in the first query is always a bad idea? Absolutely not. There may be obscure cases where the subquery version works just as well as the join. I can't think that there are many where the subquery will work better. So I would err on the side of the one that is more likely to be as good or better and the one that is more readable. I see no advantages to the subquery version, even if you find it more readable, because it is most likely going to result in worse performance.
In general, I highly advise you to stick to the more readable, self-documenting syntax unless you find a case where the optimizer is not doing it right (and I would bet in 99% of those cases the issue is bad statistics or parameter sniffing, not a query syntax issue). I would suspect that, outside of those cases, the repros you could reproduce where convoluted queries that work better than their more direct and logical equivalents would be quite rare. Your motivation for trying to find those cases should be about the same as your preference for the unintuitive syntax over generally accepted "best practice" syntax.

Your queries do different things. The first is more akin to a LEFT OUTER JOIN.
It depends how your indexes are setup for performance. But JOINs are more clear.

I agree with statements above (Rick). Run this in Execution Plan...you'll get a clear answer. No speculation needed.
I agree with Daniel and Davide, that these are two different SQL statements. If the ForeignTable has multiple records of the same FtId value, then you'll have get duplication of data. Assuming the 1st SQL statement is correct, you'll have to rewrite the 2nd with some GROUP BY clause.

Related

Why is my SQL query getting disproportionally slow when adding a simple string comparison?

So, I have an SQL query for MSSQL looking like this (simplified for readability):
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
WHERE STATUS NOT IN (107)
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
It works fine, but I need a comparison against another table in the inner query, so I added a left join and said comparison:
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
LEFT JOIN OTHER_TABLE ON MY_DATA.ID=OTHER_TABLE.ID //new JOIN
WHERE STATUS NOT IN (107) AND (DEPARTMENT_ID='SP' OR DEPARTMENT_ID='BL') //new AND branch
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
This query works, but is A LOT slower thant the previous one. The wierd thing is, commenting out the new AND-branch of the WHERE clause while leaving the JOIN as it is makes it faster again. As if it's not joining another table that is slowing the query down, but the actual string comparisons... I am lost as to why this is so slow, or how I could speed it up... any advice would be appreciated!
Use an INNER JOIN. The outer join is being undone by the WHERE clause anyway:
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA d INNER JOIN
OTHER_TABLE ot
ON d.ID = ot.ID //new JOIN
WHERE od.STATUS NOT IN (107) AND DEPARTMENT_ID IN ('SP', 'BL') //new AND branch
GROUP BY ...
(The IN shouldn't make a difference; it is just easier to write.)
Next, if this still has slow performance, then look at the execution plans. It means that SQL Server is making a poor decision, probably on the JOIN algorithm. Normally, I fix this by forbidding nested loop joins, but there might be other solutions as well.
It's hard to say definitively what will or won't speed things up without seeing the execution plan. Also, understanding how fast you need it to be affects what steps you might want to (or not want to) consider taking.
What follows is admittedly somewhat vague, but these are a few things that came to mind when I thought about this. Take a look at the execution plan as Philip Couling suggested in that good link to get an idea where the pain points are, and of course, take these suggestions with a grain of salt.
You might consider adding some indexes to either or both of the tables. The execution plan might even give you suggestions on what could be useful, but off the top of my head, something on OTHER_TABLE.DEPARTMENT_ID probably wouldn't hurt.
You might be able to build potential new indexes as Filtered Indexes if you know those hard-coded search terms (like STATUS and DEPARTMENT_ID are always going to be the same).
You could pre-calculate some of this information if it's not changing so rapidly that you need to query it fresh on every call. This comes back to how fast you need it to go, because for just about any query, you can add columns or pre-populated lookup tables to avoid doing work at run time. For example, you could make an new bit field like IsNewOrBranch or IsStatusNot107 (both somewhat egregious steps, but things which could work). Or that might be pre-aggregating the data in the inner query ahead of time.
I know you simplified the query for our benefit, but that also makes it a little hard to know what's going on with the subquery, and the subsequent GROUP BY being performed against that subquery. There might be a way to avoid having to do two group bys.
Along the same vein, you might also look into splitting what you're doing into separate statements if SQL is having a difficult time figuring out how best to return the data. For example, you might populate a temp table or table variable with the results of your inner query, then perform your subsequent GROUP BY on that. While this approach isn't always useful, there are many times where trying to cram all the work into a single query will actually end up being worse than several individual, simple, optimized steps would be.
And as Gordon Linoff suggested, there are a number of query hints which could be used to coax the execution plan into doing things a specific way. But be careful, often that way lies madness.
Your SQL is fine, and restricting your data with an additional AND clause should usually not make it slower.
As it happens, choosing a fast execution path is a hard problem, and SQL Server sometimes (albeit seldom) gets it wrong.
What you can do to help SQL Server find the best execution path is to:
make sure the statistics on your tables are up-to-date and
make sure that there is an "obviously suitable" index that SQL Server can use. SQL Server Management studio will usually give you suggestions on missing indexes when selecting the "show actual execution plan" option.

JOIN or Correlated subquery with exists clause, which one is better

select *
from ContactInformation c
where exists (select * from Department d where d.Id = c.DepartmentId )
select *
from ContactInformation c
inner join Department d on c.DepartmentId = d.Id
Both the queries give out the same output, which is good in performance wise join or correlated sub query with exists clause, which one is better.
Edit :-is there alternet way for joins , so as to increase performance:-
In the above 2 queries i want info from dept as well as contactinformation tables
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department rows for a ContactInformation row.
In your example above, the SELECT *:
means different output too so they are not actually equivalent
less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
You need to measure and compare - there's no golden rule which one will be better - it depends on too many variables and things in your system.
In SQL Server Management Studio, you could put both queries in a window, choose Include actual execution plan from the Query menu, and then run them together.
You should get a comparison of both their execution plans and a percentage of how much of the time was spent on one or the other query. Most likely, both will be close to 50% in this case. If not - then you know which of the two queries performs better.
You can learn more about SQL Server execution plans (and even download a free e-book) from Simple-Talk - highly recommended.
I assume that either you meant to add the DISTINCT keyword to the SELECT clause in your second query (or, less likely, a Department has only one Contact).
First, always start with 'logical' considerations. The EXISTS construct is arguably more intuitive so, all things 'physical' being equal, I'd go with that.
Second, there will be one day when you will need to ports this code, not necessarily to a different SQL product but, say, the same product but with a different optimizer. A decent optimizer should recognise that both are equivalent and come up with the same ideal plan. Consider that, in theory, the EXISTS construct has slightly more potential to short circuit.
Third, test it using a reasonably large data set. If performance isn't acceptable, start looking at the 'physical' considerations (but I suggest you always keep your 'logically-pure' code in comments for the forthcoming day when the perfect optimizer arrives :)
Your first query should output Department columns, while the second one should not.
If you're only interested in ContactInformation, these queries are equivalent. You could run them both and examine the query execution plan to see which one runs faster. For example, on MYSQL, where exists is more efficient with nullable columns, while inner join performs better if neither column is nullable.

Correlated query vs inner join performance in SQL Server

let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.

Transact-SQL - sub query or left-join?

I have two tables containing Tasks and Notes, and want to retrieve a list of tasks with the number of associated notes for each one. These two queries do the job:
select t.TaskId,
(select count(n.TaskNoteId) from TaskNote n where n.TaskId = t.TaskId) 'Notes'
from Task t
-- or
select t.TaskId,
count(n.TaskNoteId) 'Notes'
from Task t
left join
TaskNote n
on t.TaskId = n.TaskId
group by t.TaskId
Is there a difference between them and should I be using one over the other, or are they just two ways of doing the same job? Thanks.
On small datasets they are wash when it comes to performance. When indexed, the LOJ is a little better.
I've found on large datasets that an inner join (an inner join will work too.) will outperform the subquery by a very large factor (sorry, no numbers).
In most cases, the optimizer will treat them the same.
I tend to prefer the second, because it has less nesting, which makes it easier to read and easier to maintain. I have started to use SQL Server's common table expressions to reduce nesting as well for the same reason.
In addition, the second syntax is more flexible if there are further aggregates which may be added in the future in addition to COUNT, like MIN(some_scalar), MAX(), AVG() etc.
The subquery will be slower as it is being executed for every row in the outer query. The join will be faster as it is done once. I believe that the query optimiser will not rewrite this query plan as it can't recognize the equivalence.
Normally you would do a join and group by for this sort of count. Correlated subqueries of the sort you show are mainly of interest if they have to do some grouping or more complex predicate on a table that is not participating in another join.
If you're using SQL Server Management Studio, you can enter both versions into the Query Editor and then right-click and choose Display Estimated Execution Plan. It will give you two percentage costs relative to the batch. If they're expected to take the same time, they'll both show as 50% - in which case, choose whichever you prefer for other reasons (easier to read, easier to maintain, better fit with your coding standards etc). Otherwise, you can pick the one with the lower percentage cost relative to the batch.
You can use the same technique to look at changing any query to improve performance by comparing two versions that do the same thing.
Of course, because it's a cost relative to the batch, it doesn't mean that either query is as fast as it could be - it just tells you how they compare to each other, not to some notional optimum query to get the same results.
There's no clear-cut answer on this. You should view the SQL Plan. In terms of relational algebra, they are essentially equivalent.
I make it a point to avoid subqueries wherever possible. The join will generally be more efficient.
You can use either, and they are semantically identical. In general, the rule of thumb is to use whichever form is easier for you to read, unless performance is an issue.
If performance is an issue, then experiment with rewriting the query using the other form. Sometimes the optimizer will use an index for one form, and not the other.

subselect vs outer join

Consider the following 2 queries:
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA
where tblA.a not in (select tblB.a from tblB)
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA left outer join tblB
on tblA.a = tblB.a where tblB.a is null
Which will perform better? My assumption is that in general the join will be better except in cases where the subselect returns a very small result set.
RDBMSs "rewrite" queries to optimize them, so it depends on system you're using, and I would guess they end up giving the same performance on most "good" databases.
I suggest picking the one that is clearer and easier to maintain, for my money, that's the first one. It's much easier to debug the subquery as it can be run independently to check for sanity.
non-correlated sub queries are fine. you should go with what describes the data you're wanting. as has been noted, this likely gets rewritten into the same plan, but isn't guaranteed to! what's more, if table A and B are not 1:1 you will get duplicate tuples from the join query (as the IN clause performs an implicit DISTINCT sort), so it's always best to code what you want and actually think about the outcome.
Well, it depends on the datasets. From my experience, if You have small dataset then go for a NOT IN if it's large go for a LEFT JOIN. The NOT IN clause seems to be very slow on large datasets.
One other thing I might add is that the explain plans might be misleading. I've seen several queries where explain was sky high and the query run under 1s. On the other hand I've seen queries with excellent explain plan and they could run for hours.
So all in all do test on your data and see for yourself.
I second Tom's answer that you should pick the one that is easier to understand and maintain.
The query plan of any query in any database cannot be predicted because you haven't given us indexes or data distributions. The only way to predict which is faster is to run them against your database.
As a rule of thumb I tend to use sub-selects when I do not need to include any columns from tblB in my select clause. I would definitely go for a sub-select when I want to use the 'in' predicate (and usually for the 'not in' that you included in the question), for the simple reason that these are easier to understand when you or someone else has come back and change them.
The first query will be faster in SQL Server which I think is slighty counter intuitive - Sub queries seem like they should be slower. In some cases (as data volumes increase) an exists may be faster than an in.
It should be noted that these queries will produce different results if TblB.a is not unique.
From my observations, MSSQL server produces same query plan for these queries.
I created a simple query similar to the ones in the question on MSSQL2005 and the explain plans were different. The first query appears to be faster. I am not a SQL expert but the estimated explain plan had 37% for query 1 and 63% for the query 2. It appears that the biggest cost for query 2 is the join. Both queries had two table scans.