Why the planner does not execute joins participating in WHERE clause first?

Why the planner does not execute joins participating in WHERE clause first? - sql

I'm experimenting with PostgreSQL (v9.3). I have a quite large database, and often I need to execute queries with 8-10 joined tables (as source of large data grids). I'm using Devexpress XPO as the ORM above PostgreSQL, so unfortunately I don't have any control over how joins are generated.
The following example is a fairly simplified one, the real scenario is more complex, but as far as my examination the main problem can be seen on this too.
Consider the following variants of the (semantically) same query:
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN orderdetails od ON o.details = od.oid
LEFT JOIN customers c ON o.customer = c.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN customers c ON o.customer = c.oid
LEFT JOIN orderdetails od ON o.details = od.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
The orders table contains about 1 million rows, and the customers about 30 thousand. The order details contains the same amount as orders due to a one-to-one relation.
UPDATE:
It seems like the example is too simplified to reproduce the issue, because I checked again and in this case the two execution plain is identical. However in my real query where there are much more joins, the problem occures: if I put customers as the first join, the execution is 100x faster. I'll add my real query, but due to the hungarian language and the fact that it's been generated by XPO and Npgsql makes it less readable.
The first query is significantly slower (about 100x) than the second, and when I output the plans with EXPLAIN ANALYZE I can see that the order of the joins reflects to their position in the query string. So firstly the two "giant" tables are joined together, and then after the filtered customer table is joined (where the filter selects only one row).
The second query is faster because the join starts with that one customer row, and after that it joins the 20-30 order details rows.
Unfortunately in my case XPO generates the first version so I'm suffering with performance.
Why PostgreSQL query planner not noticing that the join on customers has a condition in the WHERE clauuse? IMO the correct optimization would be to take those joins first which has any kind of filter, and then take those joins which participate only in selection.
Any kind of help or advice is appreciated.

Join orders only matters, if your query's joins not collapsed. This is done internally by the query planner, but you can manipulate the process with the join_collapse_limit runtime option.
Note however, the query planner will not find every time the best join order by default:
Constraining the planner's search in this way is a useful technique both for reducing planning time and for directing the planner to a good query plan. If the planner chooses a bad join order by default, you can force it to choose a better order via JOIN syntax — assuming that you know of a better order, that is. Experimentation is recommended.
For the best performance, I recommend to use some kind of native querying, if available. Raising the join_collapse_limit can be a good-enough solution though, if you ensure, this hasn't caused other problems.
Also worth to mention, that raising join_collapse_limit will most likely increase the planning time.

Related

Does SQL query structure have an impact its performance?

My question is theoretical. Say the following query (option A):
Select *
from orders o
inner join (
select *
from orderDetails
where product = 'shirt'
) od on o.orderId = od.orderId
Versus the following (option B)
Select *
from orders o
inner join orderDetails od on o.orderId = od.orderId
where od.product = 'shirt'
Is there any technical advantage of one over the other? For example, I get the impression that option A is less resource demanding on the DB since the inner join occurs on a already narrowed down number of rows. Whereas, option B gives the same result, however, it seems to perform the inner join on all the orderIds available before narrowing it down to the shirts.
I'm curious about the end impact bc, at times stored procedures get quite large, and I'd like to ensure that they don't affect report loading time needlessly.

The two queries should evaluate to exactly the same execution plan in SQL Server -- same performance. This is regardless of indexes.
Why? SQL is a descriptive language. A SELECT query describes the result set. It does not specify how the result set is created. In most databases, the work of figuring out what to do is handled by the SQL compiler and optimizer, which produce a directed-acyclic graph (DAG) of operations (some databases also do run-time optimizations). To the newcomer, the operations in the DAG look nothing like the original SELECT.
Not all databases have optimizers as smart as SQL Server. For instance, there is a difference in MySQL -- particularly in older versions. MySQL has a tendency to materialize subqueries, which usually adversely affect performance. However, that is due to a poor optimizing strategy rather than to SQL in general.

SQL Sub Query within a join

Is there a difference between the results of the two sets of code below?
If there isn't, I don't understand why my teachers keep teaching sub queries. When would they be useful in basic SQL commands?
Select soh.Total, c.*
From SalesLT.Customer As c
Inner join (select oh.CustomerID Sum(oh.TotalDue) As Total
From SalesLT.SalesOrderHeader As oh Group by oh.CustomerID
Having Sum(oh.totaldue) > 90000) As soh on c.CustomerID = soh.CustomerID
VS
Select A.*, C.*
From Sales as A inner join Customer as C on A.customerID=C.customerID
Group by A.CustomerID
Having Sum(C.totaldue) > 90000

Is there a difference? Well, obviously. The two are constructed differently.
Do they produce the same result? Obviously not. In fact, the second one will produce an error in almost all databases, because the columns from A are not aggregated.
In addition, the number of columns is likely to differ between the two queries, unless Customer has exactly two columns.
I would suggest that you study SQL a bit harder. If your teachers are suggesting that you need to understand subqueries, then that is probably because they are an important part of the language.
Homework: Write a reasonable second query that doesn't use subqueries.

Sub queries always took more time in term of performance and return results.
Where as inner joins provide faster way to fetch results and process queries.
So this is always good to user inner joins and avoid sub queries as much as possible, it effect execution time. To test more, try to add Execution Plan before running query in query panel.
This will show you difference of results and time took to execute.

Adding a join condition in the from clause and where clause makes query faster. Why?

I'm tuning a query for a large transactional financial system. I've noticed that including a join condition in the where clause as well as the from clause makes the query run significantly faster than either of the two individually. I note that the join in the from clause has more than one condition; I mention this in case it is significant. Here's a simplified example:
SELECT *
FROM employee e
INNER JOIN car c ON c.id = e.car_id AND -- some other join
-- Adding the join above again, in the where clause makes the query faster
WHERE c.id = e.car_id;
I thought ANSI vs old-school was purely syntactic. What's going on?
Update
Having analysed the two execution plans, it's clear that adding the same join in the where clause as the from clause, produces a very different execution plan than having the join in either of the two.
Comparing the plans, I could see what the plan with the additional where clause condition was doing better, and wondered why the one without, was joining in the way that it was. Knowing the optimal plan, a quick tweak to the join conditions resolved matters, although I'm still surprised that both queries didn't compile into the same thing. Black magic.

could be that the WHERE c.id = e.car_id addition is a way for control the order in which the tables are used to perform the proper search ..
this could a way for forcing the query optimizer to use as main table the table in where condition and the table related beacause the the sequence of table joins could not so valid for searching as is usefull for understand the query logic

What are the differences between these?

What are the differences between the two queries?
SELECT CountryMaster.Id
FROM Districts INNER JOIN
CountryMaster ON Districts.CountryId = CountryMaster.Id
SELECT CountryMaster.Id
FROM CountryMaster INNER JOIN
Districts ON Districts.CountryId = CountryMaster.Id
I know the output will be same, but I want to know is there any drastic effects of the same if I neglect positions of tables and columns in complex queries or tables having tons of data like hundreds of thousands of rows.

No difference whatsoever. The order of the joins is irrelevant. The query optimizer inside the database engine will decide on a merge plan to actually process the records from the two tables based on the stored statistics for the data in those tables.
In fact, in many cases, the query optimizer's will generate exactly the same plan for both a query phrased using joins as it would for a query phrased with a correlated sub-query.
The lesson here I have learned is:
Always start with the syntax, or representation, that most clearly represents the meaning of the process you are trying to create, and trust the query optimizer to do its job. Having said that, the query optimizer is not perfect, so if there is a performance issue, use the query show plan with alternate constructions and see if it improves...
One quick comment on performance of inner vs. outer joins. It is simply not true that inner joins are intrinsically faster than outer joins. The relative performance depends entirely on which of the three types of processing joins are used by the query engine;
1. Nested Loop Join, 2., Merge Join, or 3. Hash Join.
The Nested Loop join, for example, is used when the set of records on one side of the join is very much smaller than on the other side, and the larger set is indexed on the join column[s]. In this case, if the smaller set is the "outer" side, then an outer join will be faster. The reason is that the nested loop join takes the entire set of records from that smaller set, and iterates through each one, finding the records from the larger set that match. An inner join has to perform a second step of removing rows from the smaller side when no matches were found in the larger set. The outer join does not do this second step.
Each of the three possible types of join processes has its own characterisitic behavior patterns... See Nested Loop Joins, Merge Joins and Hash Joins for the details.

As written they are identical. Excellent answer from Charles.
If you want to know if they will have different execution plans then simply display the execution plan in SSMS.
As for speed have the columns used in the join indexed.
Maintain the indexes - a fragmented index is not nearly as effective.
The query plan will not always be the same.
The query optimizer keeps statistics and as the profile of the data changes the optimal plan may change.
Thousands of rows is not a lot.
Once you get into millions then tune indexes and syntax (with hints).
Some times you have to get into millions before you have enough data to tune.
There is also a UNION operator that is equivalent and sometimes faster.
The join hint Loop is not symmetric so in that case the query plan is different for the following but they are still that same results.
If one is a PK table I always put it first.
In this case the first is twice as fast as the second.
select top 10 docSVsys.sID, docMVtext.fieldID
from docSVsys
inner loop join docMVtext
on docMVtext.sID = docSVsys.sID
where docSVsys.sID < 100
order by docSVsys.sID, docMVtext.fieldID
select top 10 docSVsys.sID, docMVtext.fieldID
from docMVtext
inner loop join docSVsys
on docMVtext.sID = docSVsys.sID
where docSVsys.sID < 100
order by docSVsys.sID, docMVtext.fieldID
Advanced Query Tuning Concepts

Aggregating two selects with a group by in SQL is really slow

I am currently working with a query in in MSSQL that looks like:
SELECT
...
FROM
(SELECT
...
)T1
JOIN
(SELECT
...
)T2
GROUP BY
...
The inner selects are relatively fast, but the outer select aggregates the inner selects and takes an incredibly long time to execute, often timing out. Removing the group by makes it run somewhat faster and changing the join to a LEFT OUTER JOIN speeds things up a bit as well.
Why would doing a group by on a select which aggregates two inner selects cause the query to run so slow? Why does an INNER JOIN run slower than a LEFT OUTER JOIN? What can I do to troubleshoot this further?
EDIT: What makes this even more perplexing is the two inner queries are date limited and the overall query only runs slow when looking at date ranges between the start of July and any other day in July, but if the date ranges are anytime before the the July 1 and Today then it runs fine.

Without some more detail of your query its impossible to offer any hints as to what may speed your query up. A possible guess is the two inner queries are blocking access to any indexes which might have been used to perform the join resulting in large scans but there are probably many other possible reasons.
To check where the time is used in the query check the execution plan, there is a detailed explanation here
http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx
The basic run down is run the query, and display the execution plan, then look for any large percentages - they are what is slowing your query down.

Try rewriting your query without the nested SELECTs, which are rarely necessary. When using nested SELECTs - except for trivial cases - the inner SELECT resultsets are not indexed, which makes joining them to anything slow.
As Tetraneutron said, post details of your query -- we may help you rewrite it in a straight-through way.

Have you given a join predicate? Ie join table A ON table.ColA = table.ColB. If you don't give a predicate then SQL may be forced to use nested loops, so if you have a lot of rows in that range it would explain a query slow down.
Have a look at the plan in the SQL studio if you have MS Sql Server to play with.

After your t2 statement add a join condition on t1.joinfield = t2.joinfield

The issue was with fragmented data. After the data was defragmented the query started running within reasonable time constraints.

JOIN = Cartesian Product. All columns from both tables will be joined in numerous permutations. It is slow because the inner queries are querying each of the separate tables, but once they hit the join, it becomes a Cartesian product and is more difficult to manage. This would occur at the outer select statement.
Have a look at INNER JOINs as Tetraneutron recommended.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas