Subqueries vs joins - sql

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index

A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.

You are running the subquery once for every row whereas the join happens on indexes.

Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.

before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.

Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.

Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.

This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.

The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.

Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.

With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.

It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.

The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.

Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.

Related

Is an inner select from the same table more efficient from "regular" select

I can't find any answer that talks about the efficient side (which query is running faster, not how to write the query!). I appreciate your help in this matter.
is this query better (and run faster then):
SELECT c.user_id_1
FROM (
SELECT b.user_id_1, b.mail_id
from table_name as b where name='test
) as a where a.age ='18') as c
this query:
select user_id_1 from table_name
where name='test
and age ='18'
Both of the queries gives me the same results. but can I test which query is faster?
I can think of no database where using a subquery would be faster. In most databases, the two queries would produce exactly the same execution plan -- regardless of indexes, partitions, and other factors.
It is important to understand that SQL engines do not directly execute a SQL statement. They convert the SQL statement into a directed acyclic graph (DAG) that looks (to the uninitiated) nothing like the original statement. Part of this process is optimizing the code, which makes the execution graph even less like the original code.
Some versions of MySQL and MariaDB have a habit of materializing subqueries in the FROM clause. This can have a deleterious effect on performance! So, a subquery can sometimes make things much worse.
It is also possible that very complex subqueries might confuse the optimizer, but a simple case such as yours would not be one of those cases.
I just tested this on a table with 13,745,928 records. If both user_id and age are covered with a clustered index then both queries will produce the same plan using a Clustered Index Seek with a Cost 100%. In this case, both queries returned in under 5 seconds.
On a side note: If you have multiple subqueries that return data from the same table(s) then it may be more performant to build up an indexed #temptable to replace the subqueries or CTEs. When you use the same subquery more than once, the query analyzer will return a plan for each subquery, meaning each subquery will be executed, not just one.

does the order of joins in a big search query affect the response time in sqlserver

I have a query that has joins between 15 tables or even more. I need to optimize the response time. I created some index columns, changed some conditions from NOT IN to NOT EXISTS, but I found myself wondering about this.
Does the order of these joins affect the response time?
The order of JOINs definitely does affect performance, as well as the type of join. INNER JOINs, generally, will yield quicker results than RIGHT or LEFT OUTER JOINs due to the selectivity of the join.
SQL Server also tries its best to optimize every query, but at 15 joins it may have a hard time. Consider breaking the statement up into smaller junks of fewer joins. A strategy to resolve things like this that I've used in the past is to create a temporary table to store the results, then INSERT into it and UPDATE it accordingly through several different statements, with the 15 tables being spread out into the appropriate insert/update spots.
The order of the joins definitely matters. The question is, can you change the order by rewriting the SQL? The query optimizer doesn't necessarily care what order you write the joins in, depending on the type of join. The query optimizer does its best to find the most efficient execution plan based on the SQL you've written. However, it is no where close to perfect. If you notice after looking at the execution plan that it could be done more efficiently another way you can trick it into doing it your way and see if it helps.
One way to trick it is to use temp tables to pare down the result set before joining to large tables. This will allow less records to be selected which will reduce I/O.
Another way, as demonstrated by Adam Machanic, is to use a top in the select clause with an order by.

Is possible to cache in some variable results of subquery?

I have query in postgresql 9.1 like
SELECT id
FROM students
INNER JOIN exams ON /some condition
WHERE studentsid NOT IN (SUBQUERY);
and when I run only the subquery it executes in 120ms, when I execute the previous query without condition with subquery it executes for 12 seconds, but when I add subquery it runs half hour
Is possible to cache in some variable results of subquery (results are always same array of ids) and execute in console/pgadmin ?
I found WITH statement but it looks like is not supported in postgres
First, the with statement is supported in Postgres.
Second, you need to identify where the performance problem is. Is it in the subquery? Or is it the not in?
You can put the subquery in a table, add indexes, and make the query more efficient.
You can rewrite the subquery using a left join, which often allows the query to be better optimized.
You can add appropriate indexes to make the entire query more efficient.
Without knowledge of what the subquery actually does, the right approach is speculation.

Correlated query vs inner join performance in SQL Server

let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.

subselect vs outer join

Consider the following 2 queries:
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA
where tblA.a not in (select tblB.a from tblB)
select tblA.a,tblA.b,tblA.c,tblA.d
from tblA left outer join tblB
on tblA.a = tblB.a where tblB.a is null
Which will perform better? My assumption is that in general the join will be better except in cases where the subselect returns a very small result set.
RDBMSs "rewrite" queries to optimize them, so it depends on system you're using, and I would guess they end up giving the same performance on most "good" databases.
I suggest picking the one that is clearer and easier to maintain, for my money, that's the first one. It's much easier to debug the subquery as it can be run independently to check for sanity.
non-correlated sub queries are fine. you should go with what describes the data you're wanting. as has been noted, this likely gets rewritten into the same plan, but isn't guaranteed to! what's more, if table A and B are not 1:1 you will get duplicate tuples from the join query (as the IN clause performs an implicit DISTINCT sort), so it's always best to code what you want and actually think about the outcome.
Well, it depends on the datasets. From my experience, if You have small dataset then go for a NOT IN if it's large go for a LEFT JOIN. The NOT IN clause seems to be very slow on large datasets.
One other thing I might add is that the explain plans might be misleading. I've seen several queries where explain was sky high and the query run under 1s. On the other hand I've seen queries with excellent explain plan and they could run for hours.
So all in all do test on your data and see for yourself.
I second Tom's answer that you should pick the one that is easier to understand and maintain.
The query plan of any query in any database cannot be predicted because you haven't given us indexes or data distributions. The only way to predict which is faster is to run them against your database.
As a rule of thumb I tend to use sub-selects when I do not need to include any columns from tblB in my select clause. I would definitely go for a sub-select when I want to use the 'in' predicate (and usually for the 'not in' that you included in the question), for the simple reason that these are easier to understand when you or someone else has come back and change them.
The first query will be faster in SQL Server which I think is slighty counter intuitive - Sub queries seem like they should be slower. In some cases (as data volumes increase) an exists may be faster than an in.
It should be noted that these queries will produce different results if TblB.a is not unique.
From my observations, MSSQL server produces same query plan for these queries.
I created a simple query similar to the ones in the question on MSSQL2005 and the explain plans were different. The first query appears to be faster. I am not a SQL expert but the estimated explain plan had 37% for query 1 and 63% for the query 2. It appears that the biggest cost for query 2 is the join. Both queries had two table scans.