Where or Join Which one is evaluated first in sql server? - sql

Query
select * from TableA a join TableB b
on a.col1=b.col1
where b.col2 = 'SomeValue'
I'm expecting the server, first filter the col2 from TableB then do the join. This will be more efficient.
Is that the sql server evaluates the where clause first and then Join?
Any link the to know in which order sql will process a query ?
Thanks In Advance

Already answered ... read both answers ...
https://dba.stackexchange.com/questions/5038/sql-server-join-where-processing-order
To summarise: it depends on the server implementation and its execution plan ... so you will need to read up on your server in order to optimise your queries.
But I'm sure that simple joins get optimised by each server as best as it can.
If you are not sure measure execution time on a large dataset.

It's decided by the sql server query optmiser engine based on which which execution plan have lesser cost.
If you think that the filter clause will benefit your query performance, you can get the subset of the table by filtering it with your desired value and make a CTE for it.
Then join the cte expression with your other table.
You can check which query performs better in your case in SSMS and go with it :)

We will use this code:
IF OBJECT_ID(N'tempdb..#TableA',N'U') IS NOT NULL DROP TABLE #TableA;
IF OBJECT_ID(N'tempdb..#TableB',N'U') IS NOT NULL DROP TABLE #TableB;
CREATE TABLE #TableA (col1 INT NOT NULL,Col2 NVARCHAR(255) NOT NULL)
CREATE TABLE #TableB (col1 INT NOT NULL,Col2 NVARCHAR(255) NOT NULL)
INSERT INTO #TableA VALUES (1,'SomeValue'),(2,'SomeValue2'),(3,'SomeValue3')
INSERT INTO #TableB VALUES (1,'SomeValue'),(2,'SomeValue2'),(3,'SomeValue3')
select * from #TableA a join #TableB b
on a.col1=b.col1
where b.col2 = 'SomeValue'
Let`s analyze query plan in MSSQL Management studio. Mark full SELECT statement and right click --> Diplay Estimated Execution Plan. As you can seen on the picture below
first it does Table Scan for the WHERE clause, then JOIN.
1.Is that the sql server evaluates the where clause first and then Join?
First the where clause then JOIN
2.Any link the to know in which order sql will process a query?
I think you will find useful information here:
Execution Plan Basics
Graphical Execution Plans for Simple SQL Queries

Related

SQL: Is it advised to filter table with 'select ... where' before join?

Which of those two variants is more advised in general SQL practice:
Lets consider table A with columns: 1,2,3 and table B with columns 3,4. Filtering table with select first:
select col2,col4 from
(select col1,col2 from tabA
where tabA.col3='sth') as t
join tabB using (col2);
or using plain join?:
select col2,col4 from tabA
join tabB using(col2)
where col3='sth';
We can assume where clause matches 1 row. Tables are of similar size. Does Oracle planner deal with such joins properly ot it's gonna create huge joined table and then filter it?
Test it yourself on the real tables using explain plans to learn how many rows are evaluated. I don't believe it is possible to know which would be better, or if there would be any difference. The available indexes make a difference to the optimizer's choice of approach for example.
Regarding your 2 examples I don't like "natural join" syntax ("using") so the first option below is I believe the more common approach (where clause refers directly to the "from table"):
select a.col2,b.col4
from tabA a
inner join tabB b on a.col2 = b.col2
where a.col3='sth'
;
but you could also try a join condition like this:
select b.col2,b.col4
from tabB b
inner join tabA a on b.col2 = a.col2 and a.col3='sth'
;
note this reverses the table relationship.
Your second version is an excellent way of writing the query, with the minor exceptions that col4 and col3 are not qualified:
select col2, col4
from tabA a join
tabB b
using (col2)
where a.col3 = 'sth';
Just for the record, this is not a natural join. That is a separate construct in SQL -- and rather abominable.
In my experience, Oracle has a good query optimizer. It will know how to optimize the query for the given conditions. The subquery should make no difference whatsoever to the query plan. Oracle is not going to filter before the join, unless that is the right thing to do.
The only time an inner filter will perform better than an outer filter is in (usually complex) cases where the query optimizer chooses a different query plan for one query versus the other.
Think of it in the same way as a query hint
By reorganizing the query you can influence the query plan without explicitly doing so with a hint
As far as which is more advised, whichever is easiest to read is usually best. Even if one performs better than the other because of where you put your filter, you should instead focus on having the correct indexes to ensure good performance.

Sql query optimization using IN over INNER JOIN

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

SQL: Performance comparison for exclusion (Join vs Not in)

I am curious on the most efficient way to query exclusion on sql. E.g. There are 2 tables (tableA and tableB) which can be joined on 1 column (col1). I want to display the data of tableA for all the rows which col1 does not exist in tableB.
(So, in other words, tableB contains a subset of col1 of tableA. And I want to display tableA without the data that exists in tableB)
Let's say tableB has 100 rows while tableA is gigantic (more than 1M rows). I know 'Not in (not exists)' can be used but perhaps there are more efficient ways (less comp. time) to do it.? I don't maybe with outer joins?
Code snippets and comments are much appreciated.
Depends on the RDBMS. For Microsoft SQL Server NOT EXISTS is preferred to the OUTER JOIN as it can use the more efficient Anti-Semi join.
For Oracle Minus is apparently preferred to NOT EXISTS (where suitable)
You would need to look at the execution plans and decide.
I prefer to use
Select a.Col1
From TableA a
Left Join TableB b on a.Col1 = b.Col1
Where b.Col1 Is Null
I believe this will be quicker as you are utilising the FK constraint (providing you have
them of course)
Sample data:
create table #a
(
Col1 int
)
Create table #b
(
col1 int
)
insert into #a
Values (1)
insert into #a
Values (2)
insert into #a
Values (3)
insert into #a
Values (4)
insert into #b
Values (1)
insert into #b
Values (2)
Select a.Col1
From #a a
Left Join #b b on a.col1 = b.Col1
Where b.Col1 is null
The questions has been asked several times. The often fastest way is to do this:
SELECT * FROM table1
WHERE id in (SELECT id FROM table1 EXCEPT SELECT id FROM table2)
As the whole joining can be done on indexes, where using NOT IN it generally cannot.
There is no correct answer to this question. Every RDBMS has query optimizer that will determine best execution plan based on available indices, table statistics (number of rows, index selectivity), join condition, query condition, ...
When you have relatively simple query like in your question, there is often several ways you can get results in SQL. Every self respecting RDBMS will recognize your intention and will create same execution plan, no matter which syntax you use (subqueries with IN or EXISTS operator, query with JOIN, ...)
So, best solution here is to write simplest query that works and then check execution plan.
If that solution is not acceptable then you should try to find better query.

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php

How does the IN predicate work in SQL?

After prepairing an answer for this question I found I couldn't verify my answer.
In my first programming job I was told that a query within the IN () predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.
For example, given the query:
SELECT count(*) FROM Table1 WHERE Table1Id NOT IN (
SELECT Table1Id FROM Table2 WHERE id_user = 1)
Table1 Rows | # of "IN" executions
----------------------------------
10 | 10
100 | 100
1000 | 1000
10000 | 10000
Is this correct? How does the IN predicate actually work?
The warning you got about subqueries executing for each row is true -- for correlated subqueries.
SELECT COUNT(*) FROM Table1 a
WHERE a.Table1id NOT IN (
SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user
);
Note that the subquery references the id_user column of the outer query. The value of id_user on each row of Table1 may be different. So the subquery's result will likely be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the outer query.
The example you tested is a non-correlated subquery. Most modern RDBMS optimizers worth their salt should be able to tell when the subquery's result doesn't depend on the values in each row of the outer query. In that case, the RDBMS runs the subquery a single time, caches its result, and uses it repeatedly for the predicate in the outer query.
PS: In SQL, IN() is called a "predicate," not a statement. A predicate is a part of the language that evaluates to either true or false, but cannot necessarily be executed independently as a statement. That is, you can't just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it's not a valid statement.
It will entirely depend on the database you're using, and the exact query.
Query optimisers are very smart at times - in your sample query, I'd expect the better databases to be able to use the same sort of techniques that they do with a join. More naive databases may just execute the same query many times.
This depends on the RDBMS in question.
See detailed analysis here:
MySQL, part 1
MySQL, part 2
SQL Server
Oracle
PostgreSQL
In short:
MySQL will optimize the query to this:
SELECT COUNT(*)
FROM Table1 t1
WHERE NOT EXISTS
(
SELECT 1
FROM Table2 t2
WHERE t2.id_user = 1
AND t2.Table1ID = t1.Table2ID
)
and run the inner subquery in a loop, using the index lookup each time.
SQL Server will use MERGE ANTI JOIN.
The inner subquery will not be "executed" in a common sense of word, instead, the results from both query and subquery will be fetched concurrently.
See the link above for detailed explanation.
Oracle will use HASH ANTI JOIN.
The inner subquery will be executed once, and a hash table will be built from the resultset.
The values from the outer query will be looked up in the hash table.
PostgreSQL will use NOT (HASHED SUBPLAN).
Much like Oracle.
Note that rewriting the query as this:
SELECT (
SELECT COUNT(*)
FROM Table1
) -
(
SELECT COUNT(*)
FROM Table2 t2
WHERE (t2.id_user, t2.Table1ID) IN
(
SELECT 1, Table1ID
FROM Table1
)
)
will greatly improve the performance in all four systems.
Depends on optimizer. Check exact query plan for each particular query to see how the RDBMS will actually execute that.
In Oracle that'd be:
EXPLAIN PLAN FOR «your query»
In MySQL or PostgreSQL
EXPLAIN «your query»
Most SQL engines nowadays will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS
I would say look at your execution plan and find out :-)
Also if you have NULL values for the Table1Id column you will not get any data back
Not really. But it's butter to write such queries using JOIN
Yes, but execution stops as soon as the query processer "finds" the value you are looking for... So if, for example the first row in the outer select has Table1Id = 32, and if Table2 has a record with a TableId = 32, then
as soon as the subquery finds the row in Table2 where TableId = 32, it stops...