What are the steps followed by sql engine to execute the query..?? - sql

My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?

There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.

In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.

It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)

In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.

All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.

The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.

It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.

Related

Why is there a HUGE performance difference between temp table and subselect

This is a question about SQL Server 2008 R2
I'm not a DBA, by far. I'm a java developer, who has to write SQL from time to time. (mostly embedded in code). I want to know if I did something wrong here, and if so, what I can do to avoid it to happen again.
Q1:
SELECT something FROM (SELECT * FROM T1 WHERE condition1) JOIN ...
Q1 features 14 joins
Q2 is the same as Q1, with one exception. (SELECT * FROM T1 WHERE condition1) is executed before, and stored in a temp table.
This is not a correlated sub-query.
Q2:
SELECT * INTO #tempTable FROM T1 WHERE condition1
SELECT something FROM #tempTable JOIN ...
again, 14 joins.
The thing that puzzles me now is that Q1 took > 2min, (tried it a few times, to avoid caching to play a role) while Q2 (both queries combined) took 2sec!!! What gives?
Why it's not recommended to use subqueries?
Database Optimizer (regardless of what database you are using) can not always properly optimize such query (with subqueries). In this case, the problem to the optimizer is to choose the right way to join result sets. There are several algorithms for joining two result sets. The choice of the algorithm depends on the number of records which are contained in one and in the other result set. In case if you join two physical tables (subquery is not a physical table), the database can easily determine the amount of data in two result sets by the available statistics. If one of result sets is a subquery then to understand how many records it returns is very difficult. In this case the database can choose wrong query plan of join, so that will lead to a dramatic reduction in the performance of the query.
Rewriting the query with using temporary tables is intended to simplify the database optimizer. In the rewritten query all result sets participating in joins will be physical tables and the database will easily determine the length of each result set. This will allow the database to choose the guaranteed fastest of all possible query plans. Moreover, the database will make the right choice no matter what are the conditions. The rewritten query with temporary tables would work well on any database, this is especially important in the development of portable solutions. In addition, the rewritten query is easier to read, easier to understand and to debug.
It is understood that rewriting the query with temporary tables can lead to some slowdown due to additional expenses: creation of temporary tables. If the database will not be mistaken with the choice of the query plan, it will perform the old query faster than a new one. However, this slowdown will always be negligible. Typically the creation of a temporary table takes a few milliseconds. That is, the delay can not have a significant impact on system performance, and usually can be ignored.
Important! Do not forget to create indexes for temporary tables. The index fields should include all fields that are used in join conditions.
There are lot of things to tackle here, indexes, execution plans, etc. Testing and comparing results is the way to go.
You could take a look to the usual suspects, indexes. Take a look into the execution plan and compare them. Make sure the WHERE clause is using the correct ones. Ensure you are using the indexes on your JOINs.
These answers sure will help you a lot.
Performance: Subquery or Joining
Is there a speed difference between CTE , SubQuery and Temp tables?

Sql Server Join query

I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.
If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).
The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.
It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.
If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.
The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.

Methods of visualizing joins

Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.

Correlated query vs inner join performance in SQL Server

let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.

What generic techniques can be applied to optimize SQL queries?

What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).