ORDER BY column that has allow null is slow. Why? - sql

So really my question is WHY this worked.
Anyway, I had this query that does a few inner joins, has a where clause and does an order by on a nvarchar column. If I run the query WITHOUT order by, the query takes less than a second. If I run the query WITH order by, it takes 12 seconds.
Now I had a great idea and changed all the INNER JOINs to LEFT JOINs. And also included the ORDER BY clause. That took less than a second. So I remembered the difference between LEFT JOINs and INNER JOINs. INNER JOINs check for NULL and LEFT JOINs don't. So I went into the table design and unchecked "Allow Nulls". Now I run the query WITH INNER JOINs and a ORDER BY clause and the query takes less than a second. WHY?
From what I understand, the FROM, JOINS, WHERE, then SELECT clauses should run first and return a result set. Then the ORDER BY clause runs at the very end on the resultant record set. Therefore the query should have taken AT MOST a second, yes, even with the column allowing nulls. So why would the query take less than a second WITHOUT the order by clause, but take 12 seconds WITH order by clause? That doesn't make sense to me.
Query below:
SELECT PlanInfo.PlanId, PlanName, COALESCE(tResponsible, '') AS tResponsible, Processor, CustName, TaskCategoryId, MapId, tEnd,
CASE MapId WHEN 9 THEN 1 ELSE 2 END AS sor
FROM PlanInfo INNER JOIN [orders].dbo.BaanOrders_Ext ON PlanInfo.PlanName = [orders].dbo.BaanOrders_Ext.OrderNo
INNER JOIN [orders].dbo.BaanOrders ON PlanInfo.PlanName = [orders].dbo.BaanOrders.OrderNo
INNER JOIN Tasks ON PlanInfo.PlanId = Tasks.PlanId
INNER JOIN EngSchedToTimingMap ON Tasks.CatId = EngSchedToTimingMap.TaskCategoryId
WHERE (MapId = 9 OR MapId = 11 or MapId = 13 or MapId = 15)
AND([orders].dbo.BaanOrders_Ext.Processor = 'metest' OR tResponsible = 'metest')
ORDER BY PlanInfo.PlanId

I would have to guess that it is due to having an index on PlanInfo.PlanId, on which you are sorting.
SQL Server could streamline collection so that it follows the index and build the rest of the columns along that order. When the field is NULLable, the index cannot be used for sorting, because it will not contain the NULL values, which incidentally come first, so it decides to optimize along a different path.
Showing the Execution Plan always helps. Either paste the images of the plans, or just show the text-mode plans, i.e. add the line above the query, then execute it
SET SHOWPLAN_TEXT ON;
<the query>

When you use the ORDER BY clause, you force the database engine to sort the results. This takes some time (especially if the result contains many rows) - thus it is possible that a query that runs 1 second without an ORDER BY clause runs 12 seconds with it. Note that sorting takes at best O(N*log(N)) time where N is the number of rows.
The reason why NULLs are generally slow is the fact that they must be treated specially. Sorting with NULLs adds more complex comparison conditions and slows the sorting down.

If your question is "Why does the ORDER BY clause cause my query to run longer?" the answer is because sorting the results is added to the query execution plan.
If you use the "Show Estimated Query Execution Plan" tool in SQL Server Studio, it will show you exactly what it thinks the SQL Server engine will do.

Related

How to improve the performance of a 10 min running query?

I have a query which is taking approximately 10 mins to execute and produce the results. When I try to break it into parts and run it, it seems to run fine, within seconds.
I tried to modify the subselect of the top and the bottom portions of the query and determine if that was causing the issue, but it was not. It gave out some results within 3 seconds.
I am trying to learn to read the Estimated Execution plan, but it is becoming more confusing and hard for me to trace to the issue.
Can anyone please point out some mistakes which I made that is making the query for long?
Select Distinct
PostExtended.BatchNum,
post.ControlNumStatus,
post.AccountSeg,
Post.PostDat
From
Post
Post Records
join (Select Post, MAX(Dist) as Dist, COUNT(fkglDist) as RecordCount From PostExtend WITH (NOLOCK) Group By flPost) as PostExtender on Post.PK = PostExtender.flPost
join glPostExtended WITH (NOLOCK) on glPostExtendedLimiter.Post = glPostExtended.Post and (PostExtendedLimiter.fkglDist = PostExtend.Dist or PostExtend.Dist is null)
join (select lP.fkosControlNumberStatus, lP.SourceJENumber, AccountSegment,
sum(case
............
from Post WITH (NOLOCK)
join AccountingPeriod WITH (NOLOCK) on AccountingPeriod.pk = lP.fkglAccountingPeriod
join FiscalYear WITH (NOLOCK) on FiscalYear.pk = AccountingPeriod.FiscalYear
join Account WITH (NOLOCK) on Account.pk = FiscalYear.Account
where FiscalYear.Period = #Date
and glP.fkMLSosCodeEntryType = 2202
group by glP.fkosControlNumberStatus, glP.SourceNumber, AccountSeg) post on post.ControlNumStatus = Post.fkControlNumberStatus and postdata.SourceJENumber = glPost.SourceJENumber
where post.AmountT <> 0)......
Group by
The subqueries are very often the point of problems.
I would try to:
separate the postdata subquery from the main query,
save the result in a temporary table or even in a table variable,
put clustered index on fkosControlNumberStatus and SourceJENumber fields,
join this temporary table back to the main query.
Sometimes the result of these simple actions pleasantly surprises.
This is a fairly complex query. You are joining on Aggregate Queries (with GROUP BY).
The first thing I would do is see how long it takes to run each of the join queries. One of these may run very fast, while another may run very long. So, you may not really need to optimize the entire query--just one of the joined queries.
Another way to do it is just start eliminating joins one by one, then run the entire query and see how fast it goes. When you have a really significant decrease in time, you've found the error.
Typically, one thing that can add a lot of CPU is comparisons. The sums with case statements might be the biggest suspect.
Have you used the Database Engine Tuning Adviser? If all else fails, go with that and see what it tells you.
So, maybe try this approach:
Take away the CASE Statements inside the SUM expressions on that last join.
Remove the last JOIN with all the sums.
Remove the first join with that GROUP BY and the MAX expression
That would be my strategy.

Why would a sub query perform better than a litteral value in a WHERE clause with multiple joins?

Take to following query:
SELECT *
FROM FactALSAppSnapshot AS LB
LEFT OUTER JOIN MLALSInfoStage AS LA ON LB.ProcessDate = LA.ProcessDate AND
LB.ALSAppID = LA.ALSNumber
LEFT OUTER JOIN MLMonthlyIncomeStage AS LC ON LB.ProcessDate = LC.ProcessDate AND
LB.ALSAppID = LC.ALSNumber
LEFT OUTER JOIN DimBranchCategory AS LI on LB.ALSAppBranchKey = LI.Branch
WHERE LB.ProcessDate=(SELECT TOP 1 LatestProcessDateKey
FROM DimDate)
Notice that the WHERE condition is a scalar sub query. The runtime for this is 0:54 resulting in 367,853 records.
However, if I switch the WHERE clause to the following:
WHERE LB.ProcessDate=20161116
This somehow causes the query runtime to jump up to 57:33 still resulting in 367,853 records. What is happening behind the scenes that would cause this huge jump in the runtime? I would have expected the sub query version to take longer, not the literal integer value.
The table aliased as LI (last join on the list) seems to be the only table that isn't indexed on its key, and seems to allow the query to perform closer to the 1st query if I remove that table as join and using the integer value instead of the sub query.
SQL Server 11
The real answer to your question lies in the execution plan for the query. You can see the actual plan in SSMS.
Without the plan, the rest is speculation. However, based on my experience, what changes is the way the joins are processed. In my experience, queries slow down considerably when a query switches to nested loop joins. This is at the whim of the optimizer, which -- when there is a constant -- thinks this is the best way to run the query.
I'm not sure why this would be the case. Perhaps an index on FactALSAppSnapshot(ProcessDate, ALSAppID, ALSAppBranchKey) would speed up both versions of the query.

Left join or Select in select (SQL - Speed of query)

I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?
My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.

Small vs Large and Large vs Small sql joins [duplicate]

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.

Does Sql JOIN order affect performance?

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.