JOIN versus EXISTS performance - sql

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.
I do see other SO answers, like this and this, but my question is specific to Teradata.
For example, consider these two queries, which return identical results:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).
There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).
Both queries perform the same (I'm using Teradata 13.10).
A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on x.srv_accs_id=svc.srv_accs_id
where x.srv_accs_id is null
group by 1
order by 1
-and-
select svc.ltv_scr, count(*) as freq
from MY_BASE_TABLE svc
where not exists(
select 1
from MY_TARGET_TABLE x
where x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1
The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).
My example may be too trivial to see a difference; I'm just looking for coding guidance.

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.
While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.
EDIT:
Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Related

Optimizing OUTER JOIN queries using filters from WHERE clause.(Query Planner)

I am writing a distributed SQL query planner(Query Engine). Data will be fetched from RDBMS(PostgreSQL) nodes involving network I/O.
I want to optimize JOIN queries.
Logical Order of Execution is:
Do JOIN(make use of ON clause)
Apply WHERE clause on the joined result.
I was thinking about applying Filter(WHERE clause specific to a table) first itself, and then do join.
In what cases would that result in wrong results?
Example:
SELECT *
FROM tableA
LEFT JOIN tableB ON(tableA.col1 = tableB.col1)
LEFT JOIN tableC ON(tableB.col2 = tableC.col1)
WHERE tableA.colY < 100 AND tableB.colX > 50
Logical Execution:
joinResult = (tableA left join tableB ON() ) left join tableC ON()
Filter joinResult using given WHERE clause.
Proposed Execution:
filteredA = tableA WHERE tableA.colY < 100
filteredB = tableB WHERE tableB.colX > 50
Result = (filteredA left join filteredB ON(..))left join tableC ON(..)
Can I optimize any query like this? That is filtering the table first and then applying join above that.
Edit:
Some people are confusing and talking about this specific example. I am not talking about this specific example query, I am writing a query planner and I want to handle all type of queries
Please note that, each of the tables is sharded and stored in different machines, and the current execution model is to fetch each of the tables and then do join locally. So if I apply the WHERE filter before fetching, it would be better.
This is actually a complex topic.
We can filter the table in some cases. We can also reorder outer joins and then push the filter quals inside.
I was going through a research paper regarding this, but I haven't completed it yet(may not complete it also).
So for now, for those who are looking for answers, you could probably go through this research paper particularly section 2.2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf
For now I'm relying on PostgreSQL's planner and taking its output and reconstructing the query for my requirements.

SAS Enterprise: left join and right join difference?

I joined a new company that uses SAS Enterprise Guide.
I have 2 tables, table A has 100 row, and table B has over 30M rows (50-60 columns).
I tried to do a right join from A (100) to B (30M), it took over 2 hours and no result come back. I want to ask, will it help if I do a left join? I used the GUI and created the following query.
30M Record <- 100 Record ?
or
100 Record -> 30M Record ?
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CASE_NUMBER AS
SELECT t2.EMPGRPCOM,
t2.SEQINVNUM,
t2.SBSID,
t2.SBSLASTNAME,
t2.SBSFIRSTNAME,
t2.PMTDUEDATE,
t2.PREMAMT,
t2.ITEMDESC,
t2.EFFDATE,
t2.PAYAMT,
t2.MCAIDRATECD,
t2.REBILLIND,
t2.BILLTYPE
FROM WORK.'CASE NUMBER'n t1
LEFT JOIN DW.BILLING t2 ON (t1.CaseNumber = t2.SBSID)
WHERE t2.LOB = 'MD' AND t2.PMTDUEDATE BETWEEN '1Jan2015:0:0:0'dt AND '31Dec2017:0:0:0'dt AND t2.SITEID = '0001';
QUIT;
Left join and Right join, all other things aside, are equivalent - if you implement them the same way, anyway. I.E.,
select a.*
from a
left join
b
on a.id=b.id
;
vs
select a.*
from b
right join
a
on b.id=a.id
;
Same exact query, no difference, same time used. SQL is an interpreted language, meaning the SQL interpreter looks at what you send it and figures out what the best way to do it is - so it sees both queries and knows in both cases to do the same thing.
You can read about this in all sorts of articles, this one is a good starting point, or if that link ages just search for "right join vs left join".
Now, what you might want to consider is writing this in a different way, namely not using SQL; this kind of query SQL should be good at but sometimes isn't for some reason. I would write it as a hash table search, where the smaller case_number dataset is loaded to memory, then data step iterate over the larger table and check if it's found in the smaller dataset - if so, then great, return it.
I'd also think about whether left/right join is what you want, vs. inner join. Seems to me that if you're returning solely t2 values, right/left join isn't correct (when t1 is the "primary"): you'll just get empty rows for the non-matches. Either return a t1 variable, or use inner join.

Is there any performance difference between using inner join vs left join? [duplicate]

I've created SQL command that uses INNER JOIN on 9 tables, anyway this command takes a very long time (more than five minutes). So my folk suggested me to change INNER JOIN to LEFT JOIN because the performance of LEFT JOIN is better, despite what I know. After I changed it, the speed of query got significantly improved.
I would like to know why LEFT JOIN is faster than INNER JOIN?
My SQL command look like below:
SELECT * FROM A INNER JOIN B ON ... INNER JOIN C ON ... INNER JOIN D and so on
Update:
This is brief of my schema.
FROM sidisaleshdrmly a -- NOT HAVE PK AND FK
INNER JOIN sidisalesdetmly b -- THIS TABLE ALSO HAVE NO PK AND FK
ON a.CompanyCd = b.CompanyCd
AND a.SPRNo = b.SPRNo
AND a.SuffixNo = b.SuffixNo
AND a.dnno = b.dnno
INNER JOIN exFSlipDet h -- PK = CompanyCd, FSlipNo, FSlipSuffix, FSlipLine
ON a.CompanyCd = h.CompanyCd
AND a.sprno = h.AcctSPRNo
INNER JOIN exFSlipHdr c -- PK = CompanyCd, FSlipNo, FSlipSuffix
ON c.CompanyCd = h.CompanyCd
AND c.FSlipNo = h.FSlipNo
AND c.FSlipSuffix = h.FSlipSuffix
INNER JOIN coMappingExpParty d -- NO PK AND FK
ON c.CompanyCd = d.CompanyCd
AND c.CountryCd = d.CountryCd
INNER JOIN coProduct e -- PK = CompanyCd, ProductSalesCd
ON b.CompanyCd = e.CompanyCd
AND b.ProductSalesCd = e.ProductSalesCd
LEFT JOIN coUOM i -- PK = UOMId
ON h.UOMId = i.UOMId
INNER JOIN coProductOldInformation j -- PK = CompanyCd, BFStatus, SpecCd
ON a.CompanyCd = j.CompanyCd
AND b.BFStatus = j.BFStatus
AND b.ProductSalesCd = j.ProductSalesCd
INNER JOIN coProductGroup1 g1 -- PK = CompanyCd, ProductCategoryCd, UsedDepartment, ProductGroup1Cd
ON e.ProductGroup1Cd = g1.ProductGroup1Cd
INNER JOIN coProductGroup2 g2 -- PK = CompanyCd, ProductCategoryCd, UsedDepartment, ProductGroup2Cd
ON e.ProductGroup1Cd = g2.ProductGroup1Cd
A LEFT JOIN is absolutely not faster than an INNER JOIN. In fact, it's slower; by definition, an outer join (LEFT JOIN or RIGHT JOIN) has to do all the work of an INNER JOIN plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.
(And even if a LEFT JOIN were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN, so you cannot simply go replacing all instances of one with the other!)
Most likely your performance problems lie elsewhere, such as not having a candidate key or foreign key indexed properly. 9 tables is quite a lot to be joining so the slowdown could literally be almost anywhere. If you post your schema, we might be able to provide more details.
Edit:
Reflecting further on this, I could think of one circumstance under which a LEFT JOIN might be faster than an INNER JOIN, and that is when:
Some of the tables are very small (say, under 10 rows);
The tables do not have sufficient indexes to cover the query.
Consider this example:
CREATE TABLE #Test1
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test1 (ID, Name) VALUES (1, 'One')
INSERT #Test1 (ID, Name) VALUES (2, 'Two')
INSERT #Test1 (ID, Name) VALUES (3, 'Three')
INSERT #Test1 (ID, Name) VALUES (4, 'Four')
INSERT #Test1 (ID, Name) VALUES (5, 'Five')
CREATE TABLE #Test2
(
ID int NOT NULL PRIMARY KEY,
Name varchar(50) NOT NULL
)
INSERT #Test2 (ID, Name) VALUES (1, 'One')
INSERT #Test2 (ID, Name) VALUES (2, 'Two')
INSERT #Test2 (ID, Name) VALUES (3, 'Three')
INSERT #Test2 (ID, Name) VALUES (4, 'Four')
INSERT #Test2 (ID, Name) VALUES (5, 'Five')
SELECT *
FROM #Test1 t1
INNER JOIN #Test2 t2
ON t2.Name = t1.Name
SELECT *
FROM #Test1 t1
LEFT JOIN #Test2 t2
ON t2.Name = t1.Name
DROP TABLE #Test1
DROP TABLE #Test2
If you run this and view the execution plan, you'll see that the INNER JOIN query does indeed cost more than the LEFT JOIN, because it satisfies the two criteria above. It's because SQL Server wants to do a hash match for the INNER JOIN, but does nested loops for the LEFT JOIN; the former is normally much faster, but since the number of rows is so tiny and there's no index to use, the hashing operation turns out to be the most expensive part of the query.
You can see the same effect by writing a program in your favourite programming language to perform a large number of lookups on a list with 5 elements, vs. a hash table with 5 elements. Because of the size, the hash table version is actually slower. But increase it to 50 elements, or 5000 elements, and the list version slows to a crawl, because it's O(N) vs. O(1) for the hashtable.
But change this query to be on the ID column instead of Name and you'll see a very different story. In that case, it does nested loops for both queries, but the INNER JOIN version is able to replace one of the clustered index scans with a seek - meaning that this will literally be an order of magnitude faster with a large number of rows.
So the conclusion is more or less what I mentioned several paragraphs above; this is almost certainly an indexing or index coverage problem, possibly combined with one or more very small tables. Those are the only circumstances under which SQL Server might sometimes choose a worse execution plan for an INNER JOIN than a LEFT JOIN.
There is one important scenario that can lead to an outer join being faster than an inner join that has not been discussed yet.
When using an outer join, the optimizer is always free to drop the outer joined table from the execution plan if the join columns are the PK of the outer table, and none of the outer table columns are referenced outside of the outer join itself. For example SELECT A.* FROM A LEFT OUTER JOIN B ON A.KEY=B.KEY and B.KEY is the PK for B. Both Oracle (I believe I was using release 10) and Sql Server (I used 2008 R2) prune table B from the execution plan.
The same is not necessarily true for an inner join: SELECT A.* FROM A INNER JOIN B ON A.KEY=B.KEY may or may not require B in the execution plan depending on what constraints exist.
If A.KEY is a nullable foreign key referencing B.KEY, then the optimizer cannot drop B from the plan because it must confirm that a B row exists for every A row.
If A.KEY is a mandatory foreign key referencing B.KEY, then the optimizer is free to drop B from the plan because the constraints guarantee the existence of the row. But just because the optimizer can drop the table from the plan, doesn't mean it will. SQL Server 2008 R2 does NOT drop B from the plan. Oracle 10 DOES drop B from the plan. It is easy to see how the outer join will out-perform the inner join on SQL Server in this case.
This is a trivial example, and not practical for a stand-alone query. Why join to a table if you don't need to?
But this could be a very important design consideration when designing views. Frequently a "do-everything" view is built that joins everything a user might need related to a central table. (Especially if there are naive users doing ad-hoc queries that do not understand the relational model) The view may include all the relevent columns from many tables. But the end users might only access columns from a subset of the tables within the view. If the tables are joined with outer joins, then the optimizer can (and does) drop the un-needed tables from the plan.
It is critical to make sure that the view using outer joins gives the correct results. As Aaronaught has said - you cannot blindly substitute OUTER JOIN for INNER JOIN and expect the same results. But there are times when it can be useful for performance reasons when using views.
One last note - I haven't tested the impact on performance in light of the above, but in theory it seems you should be able to safely replace an INNER JOIN with an OUTER JOIN if you also add the condition <FOREIGN_KEY> IS NOT NULL to the where clause.
If everything works as it should it shouldn't, BUT we all know everything doesn't work the way it should especially when it comes to the query optimizer, query plan caching and statistics.
First I would suggest rebuilding index and statistics, then clearing the query plan cache just to make sure that's not screwing things up. However I've experienced problems even when that's done.
I've experienced some cases where a left join has been faster than a inner join.
The underlying reason is this:
If you have two tables and you join on a column with an index (on both tables).
The inner join will produce the same result no matter if you loop over the entries in the index on table one and match with index on table two as if you would do the reverse: Loop over entries in the index on table two and match with index in table one.
The problem is when you have misleading statistics, the query optimizer will use the statistics of the index to find the table with least matching entries (based on your other criteria).
If you have two tables with 1 million in each, in table one you have 10 rows matching and in table two you have 100000 rows matching. The best way would be to do an index scan on table one and matching 10 times in table two. The reverse would be an index scan that loops over 100000 rows and tries to match 100000 times and only 10 succeed. So if the statistics isn't correct the optimizer might choose the wrong table and index to loop over.
If the optimizer chooses to optimize the left join in the order it is written it will perform better than the inner join.
BUT, the optimizer may also optimize a left join sub-optimally as a left semi join. To make it choose the one you want you can use the force order hint.
Try both queries (the one with inner and left join) with OPTION (FORCE ORDER) at the end and post the results. OPTION (FORCE ORDER) is a query hint that forces the optimizer to build the execution plan with the join order you provided in the query.
If INNER JOIN starts performing as fast as LEFT JOIN, it's because:
In a query composed entirely by INNER JOINs, the join order doesn't matter. This gives freedom for the query optimizer to order the joins as it sees fit, so the problem might rely on the optimizer.
With LEFT JOIN, that's not the case because changing the join order will alter the results of the query. This means the engine must follow the join order you provided on the query, which might be better than the optimized one.
Don't know if this answers your question but I was once in a project that featured highly complex queries making calculations, which completely messed up the optimizer. We had cases where a FORCE ORDER would reduce the execution time of a query from 5 minutes to 10 seconds.
Have done a number of comparisons between left outer and inner joins and have not been able to find a consisten difference. There are many variables. Am working on a reporting database with thousands of tables many with a large number of fields, many changes over time (vendor versions and local workflow) . It is not possible to create all of the combinations of covering indexes to meet the needs of such a wide variety of queries and handle historical data. Have seen inner queries kill server performance because two large (millions to tens of millions of rows) tables are inner joined both pulling a large number of fields and no covering index exists.
The biggest issue though, doesn't seem to appeaer in the discussions above. Maybe your database is well designed with triggers and well designed transaction processing to ensure good data. Mine frequently has NULL values where they aren't expected. Yes the table definitions could enforce no-Nulls but that isn't an option in my environment.
So the question is... do you design your query only for speed, a higher priority for transaction processing that runs the same code thousands of times a minute. Or do you go for accuracy that a left outer join will provide. Remember that inner joins must find matches on both sides, so an unexpected NULL will not only remove data from the two tables but possibly entire rows of information. And it happens so nicely, no error messages.
You can be very fast as getting 90% of the needed data and not discover the inner joins have silently removed information. Sometimes inner joins can be faster, but I don't believe anyone making that assumption unless they have reviewed the execution plan. Speed is important, but accuracy is more important.
Outer joins can offer superior performance when used in views.
Say you have a query that involves a view, and that view is comprised of 10 tables joined together. Say your query only happens to use columns from 3 out of those 10 tables.
If those 10 tables had been inner-joined together, then the query optimizer would have to join them all even though your query itself doesn't need 7 out of 10 of the tables. That's because the inner joins themselves might filter down the data, making them essential to compute.
If those 10 tables had been outer-joined together instead, then the query optimizer would only actually join the ones that were necessary: 3 out of 10 of them in this case. That's because the joins themselves are no longer filtering the data, and thus unused joins can be skipped.
Source:
http://www.sqlservercentral.com/blogs/sql_coach/2010/07/29/poor-little-misunderstood-views/
Your performance problems are more likely to be because of the number of joins you are doing and whether the columns you are joining on have indexes or not.
Worst case you could easily be doing 9 whole table scans for each join.
I found something interesting in SQL server when checking if inner joins are faster than left joins.
If you dont include the items of the left joined table, in the select statement, the left join will be faster than the same query with inner join.
If you do include the left joined table in the select statement, the inner join with the same query was equal or faster than the left join.
From my comparisons, I find that they have the exact same execution plan. There're three scenarios:
If and when they return the same results, they have the same speed. However, we must keep in mind that they are not the same queries, and that LEFT JOIN will possibly return more results (when some ON conditions aren't met) --- this is why it's usually slower.
When the main table (first non-const one in the execution plan) has a restrictive condition (WHERE id = ?) and the corresponding ON condition is on a NULL value, the "right" table is not joined --- this is when LEFT JOIN is faster.
As discussed in Point 1, usually INNER JOIN is more restrictive and returns fewer results and is therefore faster.
Both use (the same) indices.

Question about SQL Server Optmization Sub Query vs. Join

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php