Are there rules of thumb for developers when to use join instead of subquery or are they the same.
The first principle is "State the query accurately". The second principle is "state the query simply and obviously" (which is where you usually make choices). The third is "state the query so it will process efficiently".
If its a dbms with a good query processor, equivalent query designs should should result in query plans that are the same (or at least equally efficient).
My greatest frustration upon using MySQL for the first time was how conscious I had to be to anticipate the optimizer. After long experience with Oracle, SQL Server, Informix, and other dbms products, I very seldom expected to concern myself with such issues. It's better now with newer versions of MySQL, but it's still something I end up needing to pay attention to more often than with the others.
Performance-wise, they don't have any difference in most modern DB engines.
Problem with subqueries is that you might end having a sub-resultset without any key, so joining them would be more expensive.
If possible, always try to make JOIN queries and filter with ON clause, instead of WHERE (although it should be the same, as modern engines are optimized for this).
Depends on RDBMS. You should compare execution plans for both queries.
In my experience with Oracle 10 and 11, execution plans are always the same.
Theoretically every subquery can be changed to a join query.
As with many things, it depends.
- how complex is the subquery
- in a query how often is the subquery executed
I try to avoid subqueries whenever I can. Especially when expecting large result sets never use subqueries - in case the subquery is executed for each item of the result set.
take care,
Alex
Let's ignore the performance impact for now (as we should if we are aware that "Premature optimization is the root of all evil").
Choose what looks clearer and easier to maintain.
In SQL Server a correlated subquery usually performs worse than a join or, often even better for performance, a join to a derived table. I almost never write a subquery for anything that will have to be performed multiple times. This is because correlated subqueries often basically turn your query into a cursor and run one row at a time. In databases it is usually better to do things in a set-based fashion
Related
I've seen many examples of SQL with complex nested subqueries (and subsubqueries and subsubsubqueries and...). Is there ever any reason to write complex queries like this instead of using WITH and CTEs, just as one would use variables in a programming language to break up complex expressions?
In particular, is there a performance advantage?
Any query that you can write using only subqueries in the FROM clause and regular joins can use CTEs with direct substitution.
Subqueries are needed for:
Correlated subqueries (which are generally not in the FROM clause).
Lateral joins (in databases that support LATERAL or APPLY keywords in the FROM clause).
Scalar subqueries.
Sometimes, a query could be rewritten to avoid these constructs.
Subqueries in the FROM clause (except for lateral joins) can be written using CTEs.
Why are subqueries used and not CTEs? The most important reason is that CTEs are a later addition to the SQL language. With the exception of recursive CTEs, they are not really needed. They are really handy when a subquery is being referenced more than one time, but one could argue that a view serves the same purpose.
As mentioned in the comments, CTEs and subqueries might be optimized differently. This could be a performance gain or loss, depending on the query and the underlying indexes and so on.
Unless your query plan tells you that subquery performance is better than CTE otherwise I would use CTE instead of a subquery.
In particular, is there a performance advantage?
subquery vs simple (non-recursive) CTE versions, they are probably very similar. You would have to use the profiler and actual execution plan to spot any differences.
There are some reason I would use CTE
In general, CTE can be used recursively but subquery cannot make it, which can help you make a calendar table or especially well suited to tree structures.
CTE will be easier to maintain and read as (#Lukasz Szozda comment), because you can break up complex queries into several CTEs and give them good names, which will be very comfortable when writing in the main query.
Without performance considerations:
CTEs are more readable as sql code, meaning easier to maintain and debug.
Subqueries (at the FROM clause) are good as long as there are few, small and simple, thus converting to CTE would actually make it more difficult to read.
There is also the option of views which mostly prevents sql code duplication.
With performance considerations:
CTEs may screw up the more complex they become. If so, they become too risky to be trusted with some teaks and changes and may lead to a more aggressive performance approach like converting all CTEs to temps (#).
Subqueries behave as good as views and little better than CTEs in most cases. Still becoming too complex may hinter performance and turn performance optimization difficult. Eventually someone may need to tweak them or even extract the heavier(s) out to temps to lighten the main select.
Views are slightly better on increased complexity as long as they are composed of plain tables and simple views, they have elegant SQL and possible filters are linked wherever possible within view's joins. Still joining two complex views will get you to the same situation as complex CTEs or subqueries.
whats better in terms of SQL server efficiency, to use sub queries instead or joins?
I know uncorrelated are better than correlated sub-queries. But what about joins?
The SQL becomes more readable and understandable using joins OUTER JOIN and check for NULLS
but is it worse or better for performance of the DB?
You'll find that by using JOINs, the query optimization engine in SQL Server can formulate a more efficient query execution plan. As a rule, you are better off always using JOINs, but if there is a performance problem, try to rework the query and document performance differences. There are always very strange exceptions.
Take for example my query I was working on yesterday - by adding a ORDER BY to the query, it ran faster than without the ORDER BY. What the heck? How can that be? It seems to go against the very concept of SQL, since every operation has a time cost. However, the query execution plans were created completely different by SQL Server 2000 with and without the ORDER BY. Go figure! It points out the importance of checking execution plans and monitoring query performance.
It's difficult to answer a question as abstract as this but in the obvious cases I can think of where I would need to choose between the two then sub queries.
These cases are NOT EXISTS vs OUTER JOIN and NULL and IN/EXISTS vs JOIN and DISTINCT. The sub queries do appear in the plan as JOINS anyway.
(Edit: Just noticed that the second of my examples is mentioned in your question)
One of my jobs it to maintain our database, usually we have troubles with lack of performance while getting reports and working whit that base.
When I start looking at queries which our ERP sending to database I see a lot of totally needlessly subselect queries inside main queries.
As I am not member of developers which is creator of program we using, they do not like much when I criticize they code and job. Let say they do not taking my review as serious statements.
So I asking you few questions about subselect in SQL
Does subselect is taking a lot of more time then left outer joins?
Does exists any blog, article or anything where I subselect is recommended not to use ?
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Our database server is MSSQL2005
"Show, Don't Tell" - Examine and compare the query plans of the queries identified using SQL Profiler. Particularly look out for table scans and bookmark lookups (you want to see index seeks as often as possible). The 'goodness of fit' of query plans depends on up-to-date statistics, what indexes are defined, the holistic query workload.
Execution Plan Basics
Understanding More Complex Query Plans
Using SQL Server Profiler (2005 Version)
Run the queries in SQL Server Management Studio (SSMS) and turn on Query->Include Actual Execution Plan (CTRL+M)
Think yourself lucky they're only subselects (which in some cases the optimiser will produce equivalent 'join plans') and not correlated sub-queries!
Identify a query that is performing a high number of logical reads, re-write it using your preferred technique and then show how few logicals reads it does by comparison.
Here's a tip. To get the total number of logical reads performed, wrap a query in question with:
SET STATISTICS IO ON
GO
-- Run your query here
SET STATISTICS IO OFF
GO
Run your query, and switch to the messages tab in the results pane.
If you are interested in learning more, there is no better book than SQL Server 2008 Query Performance Tuning Distilled, which covers the essential techniques for monitoring, interpreting and fixing performance issues.
One thing you can do is to load SQL Profiler and show them the cost (in terms of CPU cycles, reads and writes) of the sub-queries. It's tough to argue with cold, hard statistics.
I would also check the query plan for these queries to make sure appropriate indexes are being used, and table/index scans are being held to a minimum.
In general, I wouldn't say sub-queries are bad, if used correctly and the appropriate indexes are in place.
I'm not very familiar with MSSQL, as we are using postrgesql in most of our applications. However there should exist something like "EXPLAIN" which shows you the execution plan for the query. There you should be able to see the various steps that a query will produce in order to retrieve the needed data.
If you see there a lot of table scans or loop join without any index usage it is definitely a hint for a slow query execution. With such a tool you should be able to compare the two queries (one with the join, the other without)
It is difficult to state which is the better ways, because it really highly depends on the indexes the optimizer can take in the various cases and depending on the DBMS the optimizer may be able to implicitly rewrite a subquery-query into a join-query and execute it.
If you really want to show which is better you have to execute both and measure the time, cpu-usage and so on.
UPDATE:
Probably it is this one for MSSQL -->QueryPlan
From my own experience both methods can be valid, as for example an EXISTS subselect can avoid a lot of treatment with an early break.
Buts most of the time queries with a lot of subselect are done by devs which do not really understand SQL and use their classic-procedural-programmer way of thinking on queries. Then they don't even think about joins, and makes some awfull queries. So I prefer joins, and I always check subqueries. To be completly honnest I track slow queries, and my first try on slow queries containing subselects is trying to do joins. Works a lot of time.
But there's no rules which can establish that subselect are bad or slower than joins, it's just that bad sql programmer often do subselects :-)
Does subselect is taking a lot of more time then left outer joins?
This depends on the subselect and left outer joins.
Generally, this construct:
SELECT *
FROM mytable
WHERE mycol NOT IN
(
SELECT othercol
FROM othertable
)
is more efficient than this:
SELECT m.*
FROM mytable m
LEFT JOIN
othertable o
ON o.othercol = m.mycol
WHERE o.othercol IS NULL
See here:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
Does exists any blog, article or anything where subselect is recommended not to use ?
I would steer clear of the blogs which blindly recommend to avoid subselects.
They are implemented for a reason and, believe it or not, the developers have put some effort into optimizing them.
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Write a query without the subselects which runs faster.
If you post your query here we possibly will be able to improve it. However, a version with the subselects may turn out to be faster.
Try rewriting some of the queries to elminate the sub-select and compare runtimes.
Share and enjoy.
let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.
I have two tables containing Tasks and Notes, and want to retrieve a list of tasks with the number of associated notes for each one. These two queries do the job:
select t.TaskId,
(select count(n.TaskNoteId) from TaskNote n where n.TaskId = t.TaskId) 'Notes'
from Task t
-- or
select t.TaskId,
count(n.TaskNoteId) 'Notes'
from Task t
left join
TaskNote n
on t.TaskId = n.TaskId
group by t.TaskId
Is there a difference between them and should I be using one over the other, or are they just two ways of doing the same job? Thanks.
On small datasets they are wash when it comes to performance. When indexed, the LOJ is a little better.
I've found on large datasets that an inner join (an inner join will work too.) will outperform the subquery by a very large factor (sorry, no numbers).
In most cases, the optimizer will treat them the same.
I tend to prefer the second, because it has less nesting, which makes it easier to read and easier to maintain. I have started to use SQL Server's common table expressions to reduce nesting as well for the same reason.
In addition, the second syntax is more flexible if there are further aggregates which may be added in the future in addition to COUNT, like MIN(some_scalar), MAX(), AVG() etc.
The subquery will be slower as it is being executed for every row in the outer query. The join will be faster as it is done once. I believe that the query optimiser will not rewrite this query plan as it can't recognize the equivalence.
Normally you would do a join and group by for this sort of count. Correlated subqueries of the sort you show are mainly of interest if they have to do some grouping or more complex predicate on a table that is not participating in another join.
If you're using SQL Server Management Studio, you can enter both versions into the Query Editor and then right-click and choose Display Estimated Execution Plan. It will give you two percentage costs relative to the batch. If they're expected to take the same time, they'll both show as 50% - in which case, choose whichever you prefer for other reasons (easier to read, easier to maintain, better fit with your coding standards etc). Otherwise, you can pick the one with the lower percentage cost relative to the batch.
You can use the same technique to look at changing any query to improve performance by comparing two versions that do the same thing.
Of course, because it's a cost relative to the batch, it doesn't mean that either query is as fast as it could be - it just tells you how they compare to each other, not to some notional optimum query to get the same results.
There's no clear-cut answer on this. You should view the SQL Plan. In terms of relational algebra, they are essentially equivalent.
I make it a point to avoid subqueries wherever possible. The join will generally be more efficient.
You can use either, and they are semantically identical. In general, the rule of thumb is to use whichever form is easier for you to read, unless performance is an issue.
If performance is an issue, then experiment with rewriting the query using the other form. Sometimes the optimizer will use an index for one form, and not the other.