Are LEFT JOIN subquery table arguments evaluated more than once? - sql

I have a query that looks like this:
SELECT *
FROM employees e
LEFT JOIN
(
SELECT *
FROM timereports
WHERE date = '2009-05-04'
) t
ON e.id = t.employee_id
As you can see, my LEFT JOIN second table parameter is generated by a a subquery.
Does the db evaluate this subquery only once, or multiple times?
thanks.
matti

This depends on the RDBMS.
In most of them, a HASH OUTER JOIN will be employed, in which case the subquery will be evaluated once.
MySQL, on the other hand, isn't capable of making HASH JOIN's, that's why it will most probably push the predicate into the subquery and will issue this query:
SELECT *
FROM timereports t
WHERE t.employee_id = e.id
AND date = '2009-05-04'
in a nested loop. If you have an index on timereports (employee_id, date), this will also be efficient.

If you are using SQL Server, you can take a look at the Execution Plan of the query. The SQL Server query optimizer will optimize the query so that it takes the least time in execution. Best time will be based on some conditions viz. indexing and the like.

You have to ask the database to show the plan. The algorithm for doing this is chosen dynamically (at query time) based on many factors. Some databases use statistics of the key distribution to decide which algorithm to use. Other databases have relatively fixed rules.
Further, each database has a menu of different algorithms. The database could use a sort-merge algorithm, or nested loops. In this case, there may be a query flattening strategy.
You need to use your database's unique "Explain Plan" feature to look at the query execution plan.
You also need to know if your database uses hints (usually comments embedded in the SQL) to pick an algorithm.
You also need to know if your database uses statistics (sometimes called a "cost-based query optimizer) to pick an algorithm.
One you know all that, you'll know how your query is executed and if an inner query is evaluated multiple times or flattened into the parent query or evaluated once to create a temporary result that's used by the parent query.

What do you mean by evaluated?
The database has a couple of different options how to perform a join, the two most common ones being
Nested loops, in which case each row in one table will be looped through and the corresponding row in the other table will be looked up, and
Hash join, which means that both tables will be scanned once and the results are then merged using some hash algorithm.
Which of those two options is chosen depends on the database, the size of the table and the available indexes (and perhaps other things as well).

Related

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...
In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.
I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them
Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

If you join two tables in the SELECT statement, all indexes on the table columns can no longer be used?

Let's say we have:
SELECT *
FROM Pictures
JOIN Categories ON Categories.CategoryId = Pictures.CategoryId
WHERE Pictures.UserId = #UserId
ORDER BY Pictures.UploadDate DESC
In this case, the database first join the two tables and then work on the derived table, which I think would mean the indexes on the individual tables would be no use, unless you can come up with an index that is bound to some column in the derived table?
You have a fundamental misunderstanding of how SQL works. The SQL language specifies what result set should be returned. It says nothing about how the database should achieve those results.
It is up to the database engine to parse the statement and come up with an execution plan (hopefully an efficient one) that will produce the correct results. Many modern relational databases have sophisticated query optimizers that completely pull apart the statement and derive execution plans that seem to have no relationship with the original query. (At least not to the untrained eye)
The execution plan for the same query can even change over time if the engine uses a cost based optimizer. A cost based optimizer makes decisions based on statistics that have been gathered about data and indexes. As the statistics change, the execution plan can also change.
With your simple query you assume that the database has to join the tables and create a temporary result set before it applies the where clause. That might be how you think about the problem, but the database is free to implement it entirely differently. I doubt there are many (if any) databases that would create a temporary result set for your simple query.
This is not to say that you cannot ever predict when an index may or may not be used. But it takes practice and experience to get a feel for how a database might execute a query.
This will join the tables giving you all the category information if a picture's 'CategoryId' is in the table 'Categories''s CategoryId field. (and no result for a particular 'Picture' if there is no such category)
This query will likely return several rows of data. The indexes of either table will be useful no matter which table you would like to access.
Normally your program would loop through the result set.
CategoryId will give you the row in Categories with all the relevant fields in that Category and 'Picture.Id' (assuming there is such a field) will give you a reference to that exact picture row in the database.
You can then manipulate either table by using the relevant index
"UPDATE Categories SET .... WHERE CategoryId = " +
"UPDATE Pictures ..... WHERE PictureId =" +
or some such depending on your programming environment.
Indexes are up to the optimizer for use, which depends on what is occurring in the query. For the query posted, there's nothing obvious to stop an index from being used. However, not all databases operate the same -- MySQL only allows one index to be used per SELECT (check the query plan, because the optimizer might interpret the JOIN so that another index may be used).
The stuff that is likely to ensure that an index can not be used is any function/operation that alters the data. IE: getting the month/etc out of a date, wildcarding the left side of a LIKE clause...

Sql Server Join query

I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.
If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).
The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.
It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.
If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.
The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.

JOIN or Correlated subquery with exists clause, which one is better

select *
from ContactInformation c
where exists (select * from Department d where d.Id = c.DepartmentId )
select *
from ContactInformation c
inner join Department d on c.DepartmentId = d.Id
Both the queries give out the same output, which is good in performance wise join or correlated sub query with exists clause, which one is better.
Edit :-is there alternet way for joins , so as to increase performance:-
In the above 2 queries i want info from dept as well as contactinformation tables
Generally, the EXISTS clause because you may need DISTINCT for a JOIN for it to give the expected output. For example, if you have multiple Department rows for a ContactInformation row.
In your example above, the SELECT *:
means different output too so they are not actually equivalent
less chance of a index being used because you are pulling all columns out
Saying that, even with a limited column list, they will give the same plan: until you need DISTINCT... which is why I say "EXISTS"
You need to measure and compare - there's no golden rule which one will be better - it depends on too many variables and things in your system.
In SQL Server Management Studio, you could put both queries in a window, choose Include actual execution plan from the Query menu, and then run them together.
You should get a comparison of both their execution plans and a percentage of how much of the time was spent on one or the other query. Most likely, both will be close to 50% in this case. If not - then you know which of the two queries performs better.
You can learn more about SQL Server execution plans (and even download a free e-book) from Simple-Talk - highly recommended.
I assume that either you meant to add the DISTINCT keyword to the SELECT clause in your second query (or, less likely, a Department has only one Contact).
First, always start with 'logical' considerations. The EXISTS construct is arguably more intuitive so, all things 'physical' being equal, I'd go with that.
Second, there will be one day when you will need to ports this code, not necessarily to a different SQL product but, say, the same product but with a different optimizer. A decent optimizer should recognise that both are equivalent and come up with the same ideal plan. Consider that, in theory, the EXISTS construct has slightly more potential to short circuit.
Third, test it using a reasonably large data set. If performance isn't acceptable, start looking at the 'physical' considerations (but I suggest you always keep your 'logically-pure' code in comments for the forthcoming day when the perfect optimizer arrives :)
Your first query should output Department columns, while the second one should not.
If you're only interested in ContactInformation, these queries are equivalent. You could run them both and examine the query execution plan to see which one runs faster. For example, on MYSQL, where exists is more efficient with nullable columns, while inner join performs better if neither column is nullable.

What are the steps followed by sql engine to execute the query..??

My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?
There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.
In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.
It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)
In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.
All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.
The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.
It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.