Does SQL query structure have an impact its performance? - sql

My question is theoretical. Say the following query (option A):
Select *
from orders o
inner join (
select *
from orderDetails
where product = 'shirt'
) od on o.orderId = od.orderId
Versus the following (option B)
Select *
from orders o
inner join orderDetails od on o.orderId = od.orderId
where od.product = 'shirt'
Is there any technical advantage of one over the other? For example, I get the impression that option A is less resource demanding on the DB since the inner join occurs on a already narrowed down number of rows. Whereas, option B gives the same result, however, it seems to perform the inner join on all the orderIds available before narrowing it down to the shirts.
I'm curious about the end impact bc, at times stored procedures get quite large, and I'd like to ensure that they don't affect report loading time needlessly.

The two queries should evaluate to exactly the same execution plan in SQL Server -- same performance. This is regardless of indexes.
Why? SQL is a descriptive language. A SELECT query describes the result set. It does not specify how the result set is created. In most databases, the work of figuring out what to do is handled by the SQL compiler and optimizer, which produce a directed-acyclic graph (DAG) of operations (some databases also do run-time optimizations). To the newcomer, the operations in the DAG look nothing like the original SELECT.
Not all databases have optimizers as smart as SQL Server. For instance, there is a difference in MySQL -- particularly in older versions. MySQL has a tendency to materialize subqueries, which usually adversely affect performance. However, that is due to a poor optimizing strategy rather than to SQL in general.

Related

SQL Server choosing inefficient execution plan

I've got a query that gets run in certain circumstances with an 'over-simplified' execution plan that actually turns out to be quite slow (3-5 seconds). The query is:
SELECT DISTINCT Salesperson.*
FROM Salesperson
INNER JOIN SalesOrder on Salesperson.Id = SalesOrder.SalespersonId
INNER JOIN PrelimOrder on SalesOrder.Id = PrelimOrder.OrderId
INNER JOIN PrelimOrderStatus on PrelimOrder.CurrentStatusId = PrelimOrderStatus.Id
INNER JOIN PrelimOrderStatusType on PrelimOrderStatus.StatusTypeId = PrelimOrderStatusType.Id
WHERE
PrelimOrderStatusType.StatusTypeCode = 'Draft'
AND Salesperson.EndDate IS NULL
and the slow execution plan looks like:
The thing that stands out straight away is that the actual number of rows/executions is significantly higher than the respective estimates:
If I remove the Salesperson.EndDate IS NULL clause, then a faster, parallelized execution plan is run:
A similar execution plan also runs quite fast if I remove the DISTINCT keyword.
From what I can gather, it seems that the optimiser decides, based on its incorrect estimates, that the query won't be costly to run and therefore doesn't choose the parallelized plan. But I can't for the life of me figure out why it is choosing the incorrect plan. I have checked my statistics and they are all as they should be. I have tested in both SQL Server 2008 to 2016 with identical results.
SELECT DISTINCT is expensive. So, it is best to avoid it. Something like this:
SELECT sp.*
FROM Salesperson sp
WHERE EXISTS (SELECT 1
FROM SalesOrder so INNER JOIN
PrelimOrder po
ON so.Id = po.OrderId INNER JOIN
PrelimOrderStatus pos
ON po.CurrentStatusId = pos.Id INNER JOIN
PrelimOrderStatusType post
ON pos.StatusTypeId = post.Id
WHERE sp.Id = so.SalespersonId AND
post.StatusTypeCode = 'Draft'
) AND
sp.EndDate IS NULL;
Note: An index on SalesPerson(EndDate, Id) would be helpful.
As #Gordon Linoff already said, DISTINCT usually is bad news for performance. Often it means you're amassing way too much data and then squeezing it back together in a more compact set. Better to keep it small all throughout the process, if possible.
Also, it's kind of counter-intuitive that the query plan with index scans turns out to be faster than the one with index seeks; it seems (in this case) parallelism makes up for it. You could try playing around with the
Cost Threshold For Parallelism Option but beware that this is a server-wide setting! (then again, in my opinion the default of 5 is rather high for most use-cases I've run into personally; CPU's are aplenty these days, time still isn't =).
Bit of a long reach, but I was wondering if you could 'split' the query in 2, thus eliminating (a small) part of the guesswork of the server. I'm assuming here that StatusTypeCode is unique. (verify the datatype of the variable too!)
DECLARE #StatusTypeId int
SELECT #StatusTypeId = Id
FROM PrelimOrderStatusType
WHERE StatusTypeCode = 'Draft'
SELECT Salesperson.*
FROM Salesperson
WHERE Salesperson.EndDate IS NULL
AND EXISTS ( SELECT *
FROM SalesOrder
ON SalesOrder.SalespersonId = Salesperson.Id
JOIN PrelimOrder
ON PrelimOrder.OrderId = SalesOrder.Id
JOIN PrelimOrderStatus
ON PrelimOrderStatus.Id = PrelimOrder.CurrentStatusId
AND PrelimOrderStatus.StatusTypeId = #StatusTypeId)
If it doesn't help, could you give give the definition of the indexes that are being used?

Why the planner does not execute joins participating in WHERE clause first?

I'm experimenting with PostgreSQL (v9.3). I have a quite large database, and often I need to execute queries with 8-10 joined tables (as source of large data grids). I'm using Devexpress XPO as the ORM above PostgreSQL, so unfortunately I don't have any control over how joins are generated.
The following example is a fairly simplified one, the real scenario is more complex, but as far as my examination the main problem can be seen on this too.
Consider the following variants of the (semantically) same query:
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN orderdetails od ON o.details = od.oid
LEFT JOIN customers c ON o.customer = c.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN customers c ON o.customer = c.oid
LEFT JOIN orderdetails od ON o.details = od.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
The orders table contains about 1 million rows, and the customers about 30 thousand. The order details contains the same amount as orders due to a one-to-one relation.
UPDATE:
It seems like the example is too simplified to reproduce the issue, because I checked again and in this case the two execution plain is identical. However in my real query where there are much more joins, the problem occures: if I put customers as the first join, the execution is 100x faster. I'll add my real query, but due to the hungarian language and the fact that it's been generated by XPO and Npgsql makes it less readable.
The first query is significantly slower (about 100x) than the second, and when I output the plans with EXPLAIN ANALYZE I can see that the order of the joins reflects to their position in the query string. So firstly the two "giant" tables are joined together, and then after the filtered customer table is joined (where the filter selects only one row).
The second query is faster because the join starts with that one customer row, and after that it joins the 20-30 order details rows.
Unfortunately in my case XPO generates the first version so I'm suffering with performance.
Why PostgreSQL query planner not noticing that the join on customers has a condition in the WHERE clauuse? IMO the correct optimization would be to take those joins first which has any kind of filter, and then take those joins which participate only in selection.
Any kind of help or advice is appreciated.
Join orders only matters, if your query's joins not collapsed. This is done internally by the query planner, but you can manipulate the process with the join_collapse_limit runtime option.
Note however, the query planner will not find every time the best join order by default:
Constraining the planner's search in this way is a useful technique both for reducing planning time and for directing the planner to a good query plan. If the planner chooses a bad join order by default, you can force it to choose a better order via JOIN syntax — assuming that you know of a better order, that is. Experimentation is recommended.
For the best performance, I recommend to use some kind of native querying, if available. Raising the join_collapse_limit can be a good-enough solution though, if you ensure, this hasn't caused other problems.
Also worth to mention, that raising join_collapse_limit will most likely increase the planning time.

Performance impacts on specifying multiple columns in inner join

I want to select some records from two tables based on matching the values of two columns.
I have got two queries for the same, out of these one contains join on two columns as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CUSTOMER_ID=UL.CUSTOMER_ID AND UM.CREATED_BY=UL.USER_ID
and the same results can be achieved by following query having single column join as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CREATED_BY=UL.USER_ID
WHERE
UM.CUSTOMER_ID=UL.CUSTOMER_ID
Is there any difference in performance of above queries?
As everything concerning performance the answer is: It Depends.
In general the engine is smart enough to optimize both queries, I'm not surprised if both produce the same execution plan.
In fact you must run both queries a few times and study the execution plan to actually determine if both run about the same time AND using the same amount of CPU, IO and memory. (Remember performance is not only about running fast, is about smart use of all resources).
For a "semantic" vision, your data is using two keys to be "determined". In that case you can let both expression at the JOIN predicate. Let only filters at the WHERE clause.
The advantage of explicit joins over implicit ones is for create this logic (and visual) separation

Filter table before inner join condition

There's a similar question here, but my doubt is slight different:
select *
from process a inner join subprocess b on a.id=b.id and a.field=true
and b.field=true
So, when using inner join, which operation comes first: the join or the a.field=true condition?
As the two tables are very big, my goal is to filter table process first and after that join only the rows filtered with table subprocess.
Which is the best approach?
First things first:
which operation comes first: the join or the a.field=true condition?
Your INNER JOIN includes this (a.field=true) as part of the condition for the join. So it will prevent rows from being added during the JOIN process.
A part of an RDBMS is the "query optimizer" which will typically find the most efficient way to execute the query - there is no guarantee on the order of evaluation for the INNER JOIN conditions.
Lastly, I would recommend rewriting your query this way:
SELECT *
FROM process AS a
INNER JOIN subprocess AS b ON a.id = b.id
WHERE a.field = true AND b.field = true
This will effectively do the same thing as your original query, but it is widely seen as much more readable by SQL programmers. The optimizer can rearrange INNER JOIN and WHERE predicates as it sees fit to do so.
You are thinking about SQL in terms of a procedural language which it is not. SQL is a declarative language, and the engine is free to pick the execution plan that works best for a given situation. So, there is no way to predict if a join or a where will be executed first.
A better way to think about SQL is in terms of optimizing queries. Things like assuring that your joins and wheres are covered by indexes. Also, at least in MS Sql Server, you can preview an estimated or actual execution plan. There is nothing stopping you from doing that and seeing for yourself.

Sql Server: join order makes different execution plan, which one has better performance? WHY?

select * from AdventureWorks.Sales.Customer c
inner loop join AdventureWorks.Sales.SalesOrderHeader o on o.CustomerID = c.CustomerID
select * from AdventureWorks.Sales.SalesOrderHeader o
inner loop join AdventureWorks.Sales.Customer c on c.CustomerID = o.CustomerID
In MS Sql Server: the above two statements can makes different execution plans.
If we assume table Customer and table SalesOrderHeader have very different order of magnitude of records. Which one has better performance? WHY?
Using join hints forces join order. Look into the messages tab: there is a message saying that.
This is a very unfortunate side-effect of using join-hints. It makes them very awkward to use.
Which one has better performance?
Look at the query execution time and plan cost estimation to answer that.