Sql Server Join query - sql

I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.

If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).

The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.

It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.

If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.

Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.

The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.

Related

MSSQL Diferent execution plan for same query/data

I have query that is running "fast" on production, but very slow (1hour) on test servers.
The following query is in question:
select z.PrimaryKeyColumn
FROM [table1] z
inner join Table2 p on p.PrimaryKeyColumn=z.PrimaryKeyColumn
left outer join table3 pz on z.PrimaryKeyColumn==Rtrim(rtrim(pz.column2)+LTRIM(pz.column3))
I analyzed query execution plan and realized that on production it uses hash match while on test it uses loop for the first join, hence the slowness.
I have rebuilt indexes and updated statistics, but result are the same.
Additionally, on TEST server, where results are slow, i Copied/duplicated Table2 table with indexes and data, and when i use that table then the query is fast as it is on production server...
This are the query execution plans:
TEST server:
TEST server but using duplicate of Table2 in INNER JOIN:
PRODUCTION server:
Probably both server are not the same or with different configuration. BUT something that is not necessary, remove the functions RTRIM() by comparing u will have the same result.
The fact that when you have copied table1 and table 2 to new copies has resolved the Query Plan to the same as Production does indicate that there is something different in the cardinality estimates from the original and the copy.
There has to be some differences in the statistics created for the tables, so check between them that they have the same statistics created for the original and copy.
Also review the histograms for the statistics, especially related to the different index choices observed between the plans - do the steps look the same?
Also, possible obvious and you've already confirmed - but do all the same indexes exist on the tables?

Why is there a HUGE performance difference between temp table and subselect

This is a question about SQL Server 2008 R2
I'm not a DBA, by far. I'm a java developer, who has to write SQL from time to time. (mostly embedded in code). I want to know if I did something wrong here, and if so, what I can do to avoid it to happen again.
Q1:
SELECT something FROM (SELECT * FROM T1 WHERE condition1) JOIN ...
Q1 features 14 joins
Q2 is the same as Q1, with one exception. (SELECT * FROM T1 WHERE condition1) is executed before, and stored in a temp table.
This is not a correlated sub-query.
Q2:
SELECT * INTO #tempTable FROM T1 WHERE condition1
SELECT something FROM #tempTable JOIN ...
again, 14 joins.
The thing that puzzles me now is that Q1 took > 2min, (tried it a few times, to avoid caching to play a role) while Q2 (both queries combined) took 2sec!!! What gives?
Why it's not recommended to use subqueries?
Database Optimizer (regardless of what database you are using) can not always properly optimize such query (with subqueries). In this case, the problem to the optimizer is to choose the right way to join result sets. There are several algorithms for joining two result sets. The choice of the algorithm depends on the number of records which are contained in one and in the other result set. In case if you join two physical tables (subquery is not a physical table), the database can easily determine the amount of data in two result sets by the available statistics. If one of result sets is a subquery then to understand how many records it returns is very difficult. In this case the database can choose wrong query plan of join, so that will lead to a dramatic reduction in the performance of the query.
Rewriting the query with using temporary tables is intended to simplify the database optimizer. In the rewritten query all result sets participating in joins will be physical tables and the database will easily determine the length of each result set. This will allow the database to choose the guaranteed fastest of all possible query plans. Moreover, the database will make the right choice no matter what are the conditions. The rewritten query with temporary tables would work well on any database, this is especially important in the development of portable solutions. In addition, the rewritten query is easier to read, easier to understand and to debug.
It is understood that rewriting the query with temporary tables can lead to some slowdown due to additional expenses: creation of temporary tables. If the database will not be mistaken with the choice of the query plan, it will perform the old query faster than a new one. However, this slowdown will always be negligible. Typically the creation of a temporary table takes a few milliseconds. That is, the delay can not have a significant impact on system performance, and usually can be ignored.
Important! Do not forget to create indexes for temporary tables. The index fields should include all fields that are used in join conditions.
There are lot of things to tackle here, indexes, execution plans, etc. Testing and comparing results is the way to go.
You could take a look to the usual suspects, indexes. Take a look into the execution plan and compare them. Make sure the WHERE clause is using the correct ones. Ensure you are using the indexes on your JOINs.
These answers sure will help you a lot.
Performance: Subquery or Joining
Is there a speed difference between CTE , SubQuery and Temp tables?

What are the steps followed by sql engine to execute the query..??

My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?
There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.
In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.
It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)
In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.
All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.
The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.
It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.

Correlated query vs inner join performance in SQL Server

let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.

Are LEFT JOIN subquery table arguments evaluated more than once?

I have a query that looks like this:
SELECT *
FROM employees e
LEFT JOIN
(
SELECT *
FROM timereports
WHERE date = '2009-05-04'
) t
ON e.id = t.employee_id
As you can see, my LEFT JOIN second table parameter is generated by a a subquery.
Does the db evaluate this subquery only once, or multiple times?
thanks.
matti
This depends on the RDBMS.
In most of them, a HASH OUTER JOIN will be employed, in which case the subquery will be evaluated once.
MySQL, on the other hand, isn't capable of making HASH JOIN's, that's why it will most probably push the predicate into the subquery and will issue this query:
SELECT *
FROM timereports t
WHERE t.employee_id = e.id
AND date = '2009-05-04'
in a nested loop. If you have an index on timereports (employee_id, date), this will also be efficient.
If you are using SQL Server, you can take a look at the Execution Plan of the query. The SQL Server query optimizer will optimize the query so that it takes the least time in execution. Best time will be based on some conditions viz. indexing and the like.
You have to ask the database to show the plan. The algorithm for doing this is chosen dynamically (at query time) based on many factors. Some databases use statistics of the key distribution to decide which algorithm to use. Other databases have relatively fixed rules.
Further, each database has a menu of different algorithms. The database could use a sort-merge algorithm, or nested loops. In this case, there may be a query flattening strategy.
You need to use your database's unique "Explain Plan" feature to look at the query execution plan.
You also need to know if your database uses hints (usually comments embedded in the SQL) to pick an algorithm.
You also need to know if your database uses statistics (sometimes called a "cost-based query optimizer) to pick an algorithm.
One you know all that, you'll know how your query is executed and if an inner query is evaluated multiple times or flattened into the parent query or evaluated once to create a temporary result that's used by the parent query.
What do you mean by evaluated?
The database has a couple of different options how to perform a join, the two most common ones being
Nested loops, in which case each row in one table will be looped through and the corresponding row in the other table will be looked up, and
Hash join, which means that both tables will be scanned once and the results are then merged using some hash algorithm.
Which of those two options is chosen depends on the database, the size of the table and the available indexes (and perhaps other things as well).