Improving performance of adding a column with a single value - sql

By experimentation and surprisingly, I have found out that LEFT JOINING a point-table is much faster on large tables then a simple assigning of a single value to a column. By a point-table I mean a table 1x1 (1 row and 1 column).
Approach 1. By a simple assigning value, I mean this (slower):
SELECT A.*, 'Value' as NewColumn,
FROM Table1 A
Approach 2. By left-joining a point-table, I mean this (faster):
WITH B AS (SELECT 'Value' as 'NewColumn')
SELECT * Table1 A
LEFT JOIN B
ON A.ID <> B.NewColumn
Now the core of my question. Can someone advise me how to get rid of the whole ON clause:
ON A.ID <> B.NewColumn?
Checking the joining condition seems unnecessary waste of time because the key of table A must not equal the key of table B. It would throw out the rows from results if t1.ID had the same value as 'Value'. Removing that condition or maybe changing <> to = sign, seems further space to facilitate the join's performance.
Update February 23, 2015
Bounty question addressed to performance experts. Which of the approaches mentioned in my question and answers is the fastest.
Approach 1 Simple assigning value,
Approach 2 Left joining a point-table,
Approach 3 Cross joining a point-table (thanks to answer of Gordon Linoff)
Approach 4 Any other approach which may be suggested during the bounty period.
As I have measured empirically time of query execution in seconds of 3 approaches - the second approach with LEFT JOIN is the fastest. Then CROSS JOIN method, and then at last simple assigning value. Surprising as it is. Performance expert with a Solomon's sword is needed to confirm it or deny it.

I'm surprised this is faster for a simple expression, but you seem to want a cross join:
WITH B AS (SELECT 'Value' as NewColumn)
SELECT *
FROM Table1 A CROSS JOIN
B;
I use this construct to put "parameters" in queries (values that can easily be changed). However, I don't see why it would be faster. If the expression is more complicated (such as a subquery or very complicated calculation), then this method only evaluates it once. In the original query, it would normally be evaluated only once, but there might be cases where it is evaluated for each row.

You can also try with CROSS APPLY:
SELECT A.*, B.*,
FROM Table1 A
CROSS APPLY(SELECT 'Value' as 'NewColumn') B

Can you try to insert into a temp table instead of outputting to screen:
SELECT A.*, 'Value' as NewColumn
INTO #Table1Assign
FROM Table1 A
and
WITH B AS (SELECT 'Value' as 'NewColumn')
SELECT * Table1 A
INTO #Table1Join
LEFT JOIN B
ON A.ID <> B.NewColumn
That takes the actual transmission and rendering of the data to SSMS out of the equation, which could be caused by network slowdown or processing on the client.
When I run this with a 1M row table, I consistently get better performance with the simple assigning method, even if I switch to CROSS JOIN for the join method.

I doubts that second approach will be faster,with three select and left join.
First of all you should test same query with various sample data repeatedly.
What is the real scenario like ?
Inner join will be definitely faster than left join .
How about this ?
Declare #t table(id int,c2 varchar(10))
INSERT INTO #T
select 1,'A' union all
select 2,'A' union all
select 3,'B' union all
select 4,'B'
Declare #t1 table(nEWcOL varchar(10))
INSERT INTO #T1 Values('Value')
-- #Approach1
--SELECT * FROM #T outer apply
--#t1
--Create index on both join column
--#Approach2
SELECT * FROM #T A inner join
#t1 b on a.c2<>b.nEWcOL
--#Approach3
Declare #value varchar(20)
Select #value= nEWcOL from #t1
select *,#value value from #t

Too much text for a comment, so added this as an answer although I'm actually more adding to the question (**)
Somehow I think this is going to be one of those 'it depends' situations. I think it depends a lot on the amount of rows involved and even more on what happens afterwards with the data. Is it simply returned, is it used in a GROUP BY or DISTINCT later on, do we further JOIN or calculate with it etc..
Anyway, I think this IS an interesting question in that I've had to find out the hard way that having a dozen of 'parameters' in a single-row temp-table was faster than having them assigned upfront to 12 variables. Many, many moons ago the code I was given looked like an absurd construction to me so I rewrote it to use #variables instead. This was in a +1000-lines stored procedure which needed some extra performance squeezed out of it. After quite a bit of refactoring it turned out to run remarkably slower than before the change?!?!!
I've never really understood why and at the time simply reverted to the old version again. My best guess is some weird kind of combination of parameter-sniffing vs (auto-created?) statistics on the temp-table in question; if anyone could bring some light to your question it probably will lead to an answer of mine too =)
(**: I realize SO is not a forum so I apologise upfront, simply wanted to chime in that the observed behaviour of the OP isn't entirely anecdotal)

Select * doesn't use indexes properly on SQL, you should always specify your columns.
Other than that I would use
DECLARE #Value VARCHAR(30) = 'Value'
SELECT t.Id, t.C2, #Value NewColumn
FROM Table1 t

Related

Sql query optimization using IN over INNER JOIN

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

SQL - JOIN using UNION ?? UNION using JOIN?

I was asked this question during one of my interviews.
Can you do JOIN using UNION keyword?
Can you do UNION using JOIN keyword?
That is -
1. I should get same output as JOIN without using JOIN keyword, but using UNION Keyword?
2. I should get same output as UNION without using UNION keyword, but using JOIN Keyword?
Can you give me an example of how to do this if possible?
An interview is the framework on which you set out your wares. Remember: don't answer questions ;)
Think of a press conference: the spokesperson is not looking to answer difficult questions from journos to catch themselves out. Rather, they are looking for questions to which they already have answers, being the information they want to release (and no more!)
If I faced this question in an interview, I would use it to demonstrate my knowledge of relational algebra because that's what I'd have gone into the interview with the intention of doing; I be alert for the "Talk about relational algebra here" question and this would be it.
Loosely speaking, JOIN is the counterpart of logical AND, whereas UNION is the counterpart of logical OR. Therefore, similar questions using convention logic could be, "Can you do AND using OR?" and "Can you do OR using AND?" The answer would depend on what else you could use e.g. NOT might come in handy ;)
I'd also be tempted to discuss the differences between the set of primitive operators, the set of operators necessary for computational completeness and the set of operators and shorthands required for practical purposes.
Trying to answer the question directly raises further questions. JOIN implies 'natural join' in relational algebra whereas in SQL it implies INNER JOIN. If the question specifically relates to SQL, do you have to answer for all the JOIN types? What about UNION JOIN?
To employ one example, SQL's outer join is famously a UNION. Chris Date expresses it better than I could ever hope to:
Outer join is expressly designed to
produce nulls in its result and should
therefore be avoided, in general.
Relationally speaking, it's a kind of
shotgun marriage: It forces tables
into a kind of union—yes, I do mean
union, not join—even when the tables
in question fail to conform to the
usual requirements for union (see
Chapter 6). It does this, in effect,
by padding one or both of the tables
with nulls before doing the union,
thereby making them conform to those
usual requirements after all. But
there's no reason why that padding
shouldn't be done with proper values
instead of nulls
SQL and Relational Theory, 1st Edition by C.J. Date
This would be a good discussion point if, "I hate nulls" is something you wanted to get across in the interview!
These are just a few thoughts that spring to mind. The crucial point is, by asking these questions the interviewer is offering you a branch. What will YOU hang on it? ;)
As this is an interview question, they are testing your understanding of both these functions.
The likely answer they are expecting is "generally no you cannot do this as they perform different actions", and you would explain this in more detail by stating that a union appends rows to the end of the result set where as a join adds further columns.
The only way you could have a Join and a Union work is where rows contain data from only one of the two sources:
SELECT A.AA, '' AS BB FROM A
UNION ALL
SELECT '' AS AA, B.BB FROM B
Is the same as:
SELECT ISNULL(A.AA, '') AS AA, ISNULL(B.BB, '') AS BB FROM A
FULL OUTER JOIN B ON 1=0
Or to do this with only one column where the types match:
SELECT A.AA AS TT FROM A
UNION ALL
SELECT B.BB AS TT FROM B
Is the same as:
SELECT ISNULL(A.AA, B.AA) AS TT
FROM A
FULL OUTER JOIN B ON 1=0
One case where you would do this is if you have data spawned over multiple tables but you want to see ti all together, however I would advise to use a UNION in this case rather than a FULL OUTER JOIN because of the query is doing what you would otherwise expect.
Do you mean something like this?
create table Test1 (TextField nvarchar(50), NumField int)
create table Test2 (NumField int)
create table Test3 (TextField nvarchar(50), NumField int)
insert into Test1 values ('test1a', 1)
insert into Test1 values ('test1b', 2)
insert into Test2 values (1)
insert into Test3 values ('test3a', 4)
insert into Test3 values ('test3b', 5)
select Test1.*
from Test1 inner join Test2 on Test1.NumField = Test2.NumField
union
select * from Test3
(written on SQL Server 2008)
UNION works when both SELECT statements have the same number of columns, AND the columns have the same (or at least similar) data types.
UNION doesn't care if both SELECT statements select data only from a single table, or if one or both of them are already JOINs on more than one table.
I think it also depends on other operations available.
If I remember well, UNION can be done using a FULL OUTER join:
Table a (x, y)
Table b (x, y)
CREATE VIEW one
AS
SELECT a.x AS Lx
, b.x AS Rx
, a.y AS Ly
, b.y AS Ry
FROM a FULL OUTER JOIN b
ON a.x = b.x
AND a.y = b.y
CREATE VIEW unionTheHardWay
AS
SELECT COALESCE(Lx, Rx) AS x
, COALESCE(Ly, Ry) AS y
FROM one

I Need some sort of Conditional Join

Okay, I know there are a few posts that discuss this, but my problem cannot be solved by a conditional where statement on a join (the common solution).
I have three join statements, and depending on the query parameters, I may need to run any combination of the three. My Join statement is quite expensive, so I want to only do the join when the query needs it, and I'm not prepared to write a 7 combination IF..ELSE.. statement to fulfill those combinations.
Here is what I've used for solutions thus far, but all of these have been less than ideal:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
WHERE jt.someCol = conditions
OR #neededJoin is null
(This is just too expensive, because I'm performing the join even when I don't need it, just not evaluating the join)
OUTER APPLY
(SELECT TOP(1) * FROM joinedTable jt
WHERE jt.someCol = someCol
AND #neededjoin is null)
(this is even more expensive than always left joining)
SELECT #sql = #sql + ' INNER JOIN joinedTable jt ' +
' ON jt.someCol = someCol ' +
' WHERE (conditions...) '
(this one is IDEAL, and how it is written now, but I'm trying to convert it away from dynamic SQL).
Any thoughts or help would be great!
EDIT:
If I take the dynamic SQL approach, I'm trying to figure out what would be most efficient with regards to structuring my query. Given that I have three optional conditions, and I need the results from all of them my current query does something like this:
IF condition one
SELECT from db
INNER JOIN condition one
UNION
IF condition two
SELECT from db
INNER JOIN condition two
UNION
IF condition three
SELECT from db
INNER JOIN condition three
My non-dynamic query does this task by performing left joins:
SELECT from db
LEFT JOIN condition one
LEFT JOIN condition two
LEFT JOIN condition three
WHERE condition one is true
OR condition two is true
OR condition three is true
Which makes more sense to do? since all of the code from the "SELECT from db" statement is the same? It appears that the union condition is more efficient, but my query is VERY long because of it....
Thanks!
LEFT JOIN
joinedTable jt ON jt.someCol = someCol AND jt.someCol = conditions AND #neededjoin ...
...
OR
LEFT JOIN
(
SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...
) jt ON jt.someCol = someCol
...
OR
;WITH jtCTE AS
(SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...)
SELECT
...
LEFT JOIN
jtCTE ON jtCTE.someCol = someCol
...
To be honest, there is no such construct as a conditional JOIN unless you use literals.
If it's in the SQL statement it's evaluated... so don't have it in the SQL statement by using dynamic SQL or IF ELSE
the dynamic sql solution is usually the best for these situations, but if you really need to get away from that a series of if statments in a stroed porc will do the job. It's a pain and you have to write much more code but it will be faster than trying to make joins conditional in the statement itself.
I would go for a simple and straightforward approach like this:
DECLARE #ret TABLE(...) ;
IF <coondition one> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition two> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition three> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
SELECT DISTINCT ... FROM #ret ;
Edit: I am suggesting a table variable, not a temporary table, so that the procedure will not recompile every time it runs. Generally speaking, three simpler inserts have a better chance of getting better execution plans than one big huge monster query combining all three.
However, we can not guess-timate performance. we must benchmark to determine it. Yet simpler code chunks are better for readability and maintainability.
Try this:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
AND jt.someCol = conditions
AND #neededJoin = 1 -- or whatever indicates join is needed
I think you'll find it is good performance and does what you need.
Update
If this doesn't give the performance I claimed, then perhaps that's because the last time I did this using joins to a table. The value I needed could come from one of 3 tables, based on 2 columns, so I built a 'join-map' table like so:
Col1 Col2 TableCode
1 2 A
1 4 A
1 3 B
1 5 B
2 2 C
2 5 C
1 11 C
Then,
SELECT
V.*,
LookedUpValue =
CASE M.TableCode
WHEN 'A' THEN A.Value
WHEN 'B' THEN B.Value
WHEN 'C' THEN C.Value
END
FROM
ValueMaster V
INNER JOIN JoinMap M ON V.Col1 = M.oOl1 AND V.Col2 = M.Col2
LEFT JOIN TableA A ON M.TableCode = 'A'
LEFT JOIN TableB B ON M.TableCode = 'B'
LEFT JOIN TableC C ON M.TableCode = 'C'
This gave me a huge performance improvement querying these tables (most of them dozens or hundreds of million-row tables).
This is why I'm asking if you actually get improved performance. Of course it's going to throw a join into the execution plan and assign it some cost, but overall it's going to do a lot less work than some plan that just indiscriminately joins all 3 tables and then Coalesce()s to find the right value.
If you find that compared to dynamic SQL it's only 5% more expensive to do the joins this way, but with the indiscriminate joins is 100% more expensive, it might be worth it to you to do this because of the correctness, clarity, and simplicity over dynamic SQL, all of which are probably more valuable than a small improvement (depending on what you're doing, of course).
Whether the cost scales with the number of rows is also another factor to consider. If even with a huge amount of data you only save 200ms of CPU on a query that isn't run dozens of times a second, it's a no-brainer to use it.
The reason I keep hammering on the fact that I think it's going to perform well is that even with a hash match, it wouldn't have any rows to probe with, or it wouldn't have any rows to create a hash of. The hash operation is going to stop a lot earlier compared to using the WHERE clause OR-style query of your initial post.
The dynamic SQL solution is best in most respects; you are trying to run different queries with different numbers of joins without rewriting the query to do different numbers of joins - and that doesn't work very well in terms of performance.
When I was doing this sort of stuff an æon or so ago (say the early 90s), the language I used was I4GL and the queries were built using its CONSTRUCT statement. This was used to generate part of a WHERE clause, so (based on the user input), the filter criteria it generated might look like:
a.column1 BETWEEN 1 AND 50 AND
b.column2 = 'ABCD' AND
c.column3 > 10
In those days, we didn't have the modern JOIN notations; I'm going to have to improvise a bit as we go. Typically there is a core table (or a set of core tables) that are always part of the query; there are also some tables that are optionally part of the query. In the example above, I assume that 'c' is the alias for the main table. The way the code worked would be:
Note that table 'a' was referenced in the query:
Add 'FullTableName AS a' to the FROM clause
Add a join condition 'AND a.join1 = c.join1' to the WHERE clause
Note that table 'b' was referenced...
Add bits to the FROM clause and WHERE clause.
Assemble the SELECT statement from the select-list (usually fixed), the FROM clause and the WHERE clause (occasionally with decorations such as GROUP BY, HAVING or ORDER BY too).
The same basic technique should be applied here - but the details are slightly different.
First of all, you don't have the string to analyze; you know from other circumstances which tables you need to add to your query. So, you still need to design things so that they can be assembled, but...
The SELECT clause with its select-list is probably fixed. It will identify the tables that must be present in the query because values are pulled from those tables.
The FROM clause will probably consist of a series of joins.
One part will be the core query:
FROM CoreTable1 AS C1
JOIN CoreTable2 AS C2
ON C1.JoinColumn = C2.JoinColumn
JOIN CoreTable3 AS M
ON M.PrimaryKey = C1.ForeignKey
Other tables can be added as necessary:
JOIN AuxilliaryTable1 AS A
ON M.ForeignKey1 = A.PrimaryKey
Or you can specify a full query:
JOIN (SELECT RelevantColumn1, RelevantColumn2
FROM AuxilliaryTable1
WHERE Column1 BETWEEN 1 AND 50) AS A
In the first case, you have to remember to add the WHERE criterion to the main WHERE clause, and trust the DBMS Optimizer to move the condition into the JOIN table as shown. A good optimizer will do that automatically; a poor one might not. Use query plans to help you determine how able your DBMS is.
Add the WHERE clause for any inter-table criteria not covered in the joining operations, and any filter criteria based on the core tables. Note that I'm thinking primarily in terms of extra criteria (AND operations) rather than alternative criteria (OR operations), but you can deal with OR too as long as you are careful to parenthesize the expressions sufficiently.
Occasionally, you may have to add a couple of JOIN conditions to connect a table to the core of the query - that is not dreadfully unusual.
Add any GROUP BY, HAVING or ORDER BY clauses (or limits, or any other decorations).
Note that you need a good understanding of the database schema and the join conditions. Basically, this is coding in your programming language the way you have to think about constructing the query. As long as you understand this and your schema, there aren't any insuperable problems.
Good luck...
Just because no one else mentioned this, here's something that you could use (not dynamic). If the syntax looks weird, it's because I tested it in Oracle.
Basically, you turn your joined tables into sub-selects that have a where clause that returns nothing if your condition does not match. If the condition does match, then the sub-select returns data for that table. The Case statement lets you pick which column is returned in the overall select.
with m as (select 1 Num, 'One' Txt from dual union select 2, 'Two' from dual union select 3, 'Three' from dual),
t1 as (select 1 Num from dual union select 11 from dual),
t2 as (select 2 Num from dual union select 22 from dual),
t3 as (select 3 Num from dual union select 33 from dual)
SELECT m.*
,CASE 1
WHEN 1 THEN
t1.Num
WHEN 2 THEN
t2.Num
WHEN 3 THEN
t3.Num
END SelectedNum
FROM m
LEFT JOIN (SELECT * FROM t1 WHERE 1 = 1) t1 ON m.Num = t1.Num
LEFT JOIN (SELECT * FROM t2 WHERE 1 = 2) t2 ON m.Num = t2.Num
LEFT JOIN (SELECT * FROM t3 WHERE 1 = 3) t3 ON m.Num = t3.Num

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php

SQL - improve NOT EXISTS query performance

Is there a way I can improve this kind of SQL query performance:
INSERT
INTO ...
WHERE NOT EXISTS(Validation...)
The problem is when I have many data in my table (like million of rows), the execution of the WHERE NOT EXISTS clause if very slow. I have to do this verification because I can't insert duplicated data.
I use SQLServer 2005
thx
Make sure you are searching on indexed columns, with no manipulation of the data within those columns (like substring etc.)
Off the top of my head, you could try something like:
TRUNCATE temptable
INSERT INTO temptable ...
INSERT INTO temptable ...
...
INSERT INTO realtable
SELECT temptable.* FROM temptable
LEFT JOIN realtable on realtable.key = temptable.key
WHERE realtable.key is null
Try to replace the NOT EXISTS with a left outer join, it sometimes performs better in large data sets.
Outer Apply tends to work for me...
instead of:
from t1
where not exists (select 1 from t2 where t1.something=t2.something)
I'll use:
from t1
outer apply (
select top 1 1 as found from t2 where t1.something=t2.something
) t2f
where t2f.found is null
Pay attention to the other answer regarding indexing. NOT EXISTS is typically quite fast if you have good indexes.
But I have had performance issues with statements like you describe. One method I've used to get around that is to use a temp table for the candidate values, perform a DELETE FROM ... WHERE EXISTS (...), and then blindly INSERT the remainder. Inside a transaction, of course, to avoid race conditions. Splitting up the queries sometimes allows the optimizer to do its job without getting confused.
If you can at all reduce your problem space, then you'll gain heaps of performance. Are you absolutely sure that every one of those rows in that table needs to be checked?
The other thing you might want to try is a DELETE InsertTable FROM InsertTable INNER JOIN ExistingTable ON <Validation criteria> before your insert. However, your mileage may vary
insert into customers
select *
from newcustomers
where customerid not in (select customerid
from customers)
..may be more efficient. As others have said, make sure you've got indexes on any lookup fields.