Sql query optimization using IN over INNER JOIN

Sql query optimization using IN over INNER JOIN - sql

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan

Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.

Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement

Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

Related

SQL: Is it advised to filter table with 'select ... where' before join?

Which of those two variants is more advised in general SQL practice:
Lets consider table A with columns: 1,2,3 and table B with columns 3,4. Filtering table with select first:
select col2,col4 from
(select col1,col2 from tabA
where tabA.col3='sth') as t
join tabB using (col2);
or using plain join?:
select col2,col4 from tabA
join tabB using(col2)
where col3='sth';
We can assume where clause matches 1 row. Tables are of similar size. Does Oracle planner deal with such joins properly ot it's gonna create huge joined table and then filter it?

Test it yourself on the real tables using explain plans to learn how many rows are evaluated. I don't believe it is possible to know which would be better, or if there would be any difference. The available indexes make a difference to the optimizer's choice of approach for example.
Regarding your 2 examples I don't like "natural join" syntax ("using") so the first option below is I believe the more common approach (where clause refers directly to the "from table"):
select a.col2,b.col4
from tabA a
inner join tabB b on a.col2 = b.col2
where a.col3='sth'
;
but you could also try a join condition like this:
select b.col2,b.col4
from tabB b
inner join tabA a on b.col2 = a.col2 and a.col3='sth'
;
note this reverses the table relationship.

Your second version is an excellent way of writing the query, with the minor exceptions that col4 and col3 are not qualified:
select col2, col4
from tabA a join
tabB b
using (col2)
where a.col3 = 'sth';
Just for the record, this is not a natural join. That is a separate construct in SQL -- and rather abominable.
In my experience, Oracle has a good query optimizer. It will know how to optimize the query for the given conditions. The subquery should make no difference whatsoever to the query plan. Oracle is not going to filter before the join, unless that is the right thing to do.

The only time an inner filter will perform better than an outer filter is in (usually complex) cases where the query optimizer chooses a different query plan for one query versus the other.
Think of it in the same way as a query hint
By reorganizing the query you can influence the query plan without explicitly doing so with a hint
As far as which is more advised, whichever is easiest to read is usually best. Even if one performs better than the other because of where you put your filter, you should instead focus on having the correct indexes to ensure good performance.

How to eliminate duplicate of subquery in ".. where X in (S) or Y in (S)"?

I have a query where I need to get rows from a table where any of two foreign keys exists in another query. Here is the simplified SQL:
Select MainID From MainTable Where
Key1 In (Select SubID From SubTable Where UserID=#UserID) Or
Key2 In (Select SubID From SubTable Where UserID=#UserID)
As you can see, the sub-query is duplicated. Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?
Is there a better way I can write this SQL?
Update: I should have mentioned this originally - SubID is the primary key on SubTable.

You would replace the IN clause with an EXISTS clause:
Select MainID From MainTable
Where Exists
(
Select *
From SubTable
Where UserID = #UserID
And SubID in (MainTable.Key1, MainTable.Key2)
);

You can use a common table expression:
with subid_data as (
Select SubID
From SubTable
Where UserID=#UserID
)
Select MainID
From MainTable
Where Key1 In (select SubID from subid_data)
Or Key2 In (select SubID from subid_data);

I don't think compiler is intelligent enough to do a table scan or index seek once.
If you have a complicated where clause then you can push the sub-query results into temp table.
Now use the temp table in where clause which will have a better performance.
SELECT SubID
INTO #SubTable
FROM SubTable
WHERE UserID = #UserID
SELECT MainID
FROM MainTable M
WHERE EXISTS (SELECT 1
FROM #SubTable
WHERE M.Key1 = S.SubID)
OR EXISTS (SELECT 1
FROM #SubTable
WHERE M.Key2 = S.SubID)

Please try the following query:
Select MainID
From MainTable m
Where exists
( select 1 from SubTable s Where s.UserID=#UserID and s.sub_id in (m.key1,m.Key2))

tldr; both the original and the following JOIN proposal, with less "looks redundant", should generate equivalent query plans. View the actual query plans if there are any doubts as to how SQL Server is [currently] treating a query. (See IN vs. JOIN vs. EXISTS for a taste of the magic.)
Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?
Yes, SQL Server is smart enough to handle this. It does not need to "run twice" (nit: the subquery does not "run" at all in a procedural sense). That is, there is no mandated explicit materialization stage - much less two. The JOIN transformation below shows why such is not required.
Since these are independent (or non-correlated) sub-queries1, as they do not depend on the outer query, then they can - and I dare say will - be optimized as they can be freely, and easily, moved under Relational Algebra (RA) rules.
As you can see, the sub-query is duplicated .. Is there a better way I can write this SQL?
However it still "looks redundant" visually because it is written so. SQL Server doesn't care - but a human might. Thus the following is how I would write it and what I consider "better".
I am a big fan of using JOINs over subqueries; once a JOIN approach is adopted it often "fits better" with RA. This simple transformation to a JOIN is possible because of the non-correlated nature of the original subqueries - the [SQL Server] query planner is capable of doing such RA rewrites internally; view the actual query plans to see what differences there are, if any.
Rewriting the query would then be:
Select MainID
From MainTable
Join (
Select Distinct SubID -- SubId must be unique from select
From SubTable
Where UserID=#UserID
) t
-- Joining on "A or B" may indicate an ARC relationship
-- but this obtains the original results
On Key1 = t.SubID
Or Key2 = t.SubID
The DISTINCT is added to the derived table query because of the unknown (to me) multiplicity of SubId column - it can be treated as a redundant qualifier by SQL Server if SubId is bound by a Unique Constraint so it's either required or "free". See IN vs. JOIN with large rowsets for why it matters that the joined table keys are unique.
Note: SQL Server does not necessarily rewrite an IN to the join as shown above, as discussed in IN vs. JOIN vs. EXISTS; but the fundamental concept of being able to move the RA operation (and being able to treat the query as a what and not a how) is still used.
1 Some of the answers change the original subquery to a dependent/correlated subquery which is going the wrong way. It may still result in a respectable (or even equivalent) query plan as SQL Server will try to "undo" the changes - but that's going a step away from a clean RA model and JOINs! (And if SQL Server can't "undo" the added correlation then the query will be far inferior.)

Why does breaking out this correlated subquery vastly improve performance?

I tried running this query against two tables which were very different sizes - #temp was about 15,000 rows, and Member is about 70,000,000, about 68,000,000 of which do not have the ID 307.
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id as varchar) NOT IN (
SELECT IndividualID
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307)
This query ran for 18 hours, before I killed it and tried something else, which was:
SELECT IndividualID
INTO #source
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id AS VARCHAR) NOT IN (
SELECT IndividualID
FROM #source)
And this ran for less than a second before giving me a result.
I was pretty surprised by this. I'm a middle-tier developer rather than a SQL expert and my understanding of what goes on under the hood is a little murky, but I would have presumed that, since the sub-query in my first attempt is the exact same code, asking for the exact same data as in the second attempt, that these would be roughly equivalent.
But that's obviously wrong. I can't look at the execution plan for my original query to see what SQL Server is trying to do. So can someone kindly explain why splitting the data out into a temp table is so much faster?
EDIT: Table schemas and indexes
The #temp table has two columns, Individual_ID int and Source_Code varchar(50)
Member and Person are more complex. They has 29 and 13 columns respectively so I don't really want to post them all in full. PersonID is an int and is the PK on Person and an FK on Member. IndividualID is a column on Person - this is not clear in the query as written.
I tried using a LEFT JOIN instead of NOT IN before asking the question. The performance on the second query wasn't noticeably different - both were sub-second. On the first query I let it run for an hour before stopping it, presuming it would make no significant difference.
I also added an index on #source, just like on the original table, so the performance impact should be identical.

First, your query has two faux pas's that really stick out. You are converting to varchar(), but you do not include a length argument. This should not be allowed! The default length varies by context and you need to be explicit.
Second, you are matching two keys in different tables and they seemingly have different types. Foreign key references should always have the same type. This can have a very big impact on performance. If you are dealing with tables that have millions of rows, then you need to pay some attention to the data structure.
To understand the difference in performance, you need to understand execution plans. The two queries have very different execution plans. My (educated) guess is that the first version version is using a nested loop join algorithm. The second version is using a more sophisticated algorithm. In your case, this would be due to the ability of SQL Server to maintain statistics on tables. So, instantiating the intermediate results actually helps the optimizer produce a better query plan.
The subject of how best to write this logic has been investigated a lot. Here is a very good discussion on the subject by Aaron Bertrand.
I do agree with Aaron on the preference for not exists in this case:
SELECT COUNT(*)
FROM #temp t
WHERE NOT EXISTS (SELECT 1
FROM Member m JOIN
Person p
ON p.PersonID = m.PersonID
WHERE MemberID <> 307 and individual_id = t. individual_id
);
However, I don't know if this will have better performance in this particular case.

This line is probably what kills the first query
WHERE CAST(individual_id as varchar) NOT IN
My guess would be that this forces a table scan rather than using any indexes.

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d

Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.

Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster

Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.

The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)

Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.

They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.

From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.

Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.

I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.

It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).

Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!

Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php

How To Create a SQL Index to Improve ORDER BY performance

I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)

I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.

I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.

have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas