How to fine tune SQL Server query execution plan generation?

How to fine tune SQL Server query execution plan generation? - sql

I have a large "Deals" table (1.3 million rows) that needs to be displayed on a paginated grid in my application, the application also includes filters to help the user search through those rows, the generated SQL follows the structure below:
SELECT TOP 10 *
FROM (
SELECT
ROW_NUMBER() OVER(ORDER BY [DealID] DESC) AS RowNumber, *
FROM (
select d.DealID, a.[Description] as Asset,
from Deals d
inner join Assets a on d.AssetID = a.AssetID
) as Sorted
where Asset like '%*my asset%'
) as Sorted
My problem is with the execution plan generated for this query, because it's ordered by DealID, SQL Server is choosing the clustered index on DealID to execute this query and performs a clustered Index Scan on this table that has 1.3 million rows, but the query is also being filtered by Asset and there are only 171 rows that satisfy the filter, so it's much faster to use the non-clustered index on the asset first and then sort the resulting rows, I'm already able to fix this issue by adding the WITH INDEX(IX_Asset_ID)) hint into the query, but the problem is that since this is a generated query, this will add a lot of complexity to the code the generates this query.
So my question is, is there a way to get SQL Server to detect this situation without the hint? Maybe update statistics or something like that? Or even moving the hint to the end of the query would actually help since the middle of the query is actually a report written by the client.
--Edit--
As pointed out in the comments there are a few issues with the query, but those were actually created by the fact that I attempted to create a minimal reproducible example of the problem so I omitted the paging part of the query, structure below is a more complete version that should make more sense:
SELECT TOP #pageLength * FROM (
SELECT ROW_NUMBER() OVER(ORDER BY [DealID] DESC) AS RowNumber, *
FROM (
SELECT d.DealID, a.[Description] AS Asset, FROM Deals d
INNER JOIN Assets a on d.AssetID = a.AssetID
) AS Sorted
WHERE Asset LIKE '%*my asset%'
) AS Paged
WHERE RowNumber > #startRow
ORDER BY RowNumber
OPTION(RECOMPILE)

It's much better to page based off your clustered index key values, something like this
SELECT TOP (#pageLength)
d.DealID,
a.[Description] AS Asset,
#startRowNumber + ROW_NUMBER() OVER(ORDER BY [DealID] DESC) AS RowNumber
FROM Deals d
INNER JOIN Assets a on d.AssetID = a.AssetID
WHERE DealId > #startDealId
and a.[Description] LIKE '%*my asset%'
ORDER BY DealId
This technique is sometimes called "keyset pagination", and it leverages the ordered index to allow SQL to seek directly to the next clustered index key after the last page. You track the rows based on the key value rather than the generated row number.

In the end the solution for me was to use the DISABLE_OPTIMIZER_ROWGOAL hint.
I think what's happening here is that SQL Server is being too optimistic about the query only requiring 10 rows and is scanning too much of the table because if thinks it won't take long to find the first 10 but in reality it's better to use the available indexes, adding the hint causes it to change the plan and the query runs quickly.
SELECT TOP #pageLength * FROM (
SELECT ROW_NUMBER() OVER(ORDER BY [DealID] DESC) AS RowNumber, *
FROM (
SELECT d.DealID, a.[Description] AS Asset, FROM Deals d
INNER JOIN Assets a on d.AssetID = a.AssetID
) AS Sorted
WHERE Asset LIKE '%*my asset%'
) AS Paged
WHERE RowNumber > #startRow
ORDER BY RowNumber
OPTION(RECOMPILE, USE HINT ('DISABLE_OPTIMIZER_ROWGOAL'))

Related

How to efficiently get the row indicies of a subquery or joined table using row_number()

I have a table of 'games' where some of those games (but not all) are grouped into 'campaigns' by a campaign ID.
I am trying to write an SQL query which will get a dataset containing various information about the games, but in particular: if a given game is part of a campaign how many games are in that campaign in total (I have this working) and the index of that game within the campaign (e.g. the earliest game in a campaign is index '1', the next is '2' and so on).
I have achieved this, but the execution plan looks terrible and the obvious way to fix that doesn't work, but I get ahead of myself.
Here is the working query, with some extraneous stuff removed:
g1.`id` AS `game_id`,
(SELECT
COUNT(*)
FROM `games` g3
WHERE g3.`campaign` = g1.`campaign`
) AS `campaign_length`,
ca2.`ri` AS `campaign_index`,
ca1.`id` AS `campaign_id`, ca1.`name` AS `campaign_name`
FROM `games` g1
LEFT JOIN `campaigns` ca1 ON ca1.`id` = g1.`campaign`
LEFT JOIN (
SELECT
g4.`id` AS `id`,
ROW_NUMBER() OVER (
PARTITION BY g4.`campaign`
ORDER BY g4.`start` ASC) AS `ri`
FROM `games` g4
) AS ca2 ON ca2.`id` = g1.`id`
WHERE g1.`end` > CURRENT_TIMESTAMP()
AND g1.`gamemaster` = 25
ORDER BY g1.`start` ASC
;
The problem with this version is that for table g4 the execution plan lists a full table scan - which is fine at the moment as there's only a few hundred records, but long term will be terrible for performance, especially as this query (or ones very similar to it) will be executed on many different pages of my website. I believe this is happening because the ROW_NUMBER() function needs to number all the rows before the LEFT JOIN's ON statement can filter them down to the ones I actually need.
The obvious solution, which I have tried to no avail, is to add
WHERE g4.`campaign` = g1.`campaign` after FROM `games` g4;
that way ROW_NUMBER() would only need to number those records that have a chance of being returned in the dataset. However this does not work because g1.`campaign` is not in scope.
I can do WHERE g4.`campaign` IS NOT NULL which at least gets the execution plan down to a Index Conditional instead of a full table scan, but it will still not scale nicely as the number of games in campaigns grows with time.
I know that my "obvious solution" won't work because of the scope problem, but does anyone have a suggestion for how I can achieve what I'm trying to do without a terrible execution plan?

Based on your comments, the campaign_index must be calculated before the WHERE clause is applied. This means that calculation of the campaign_index will always require a full table scan, as the WHERE clause can't reduce the rows being computed over.
You can, however, use windowed functions rather than a self join and correlated sub-query...
WITH
games AS
(
SELECT
*,
COUNT(*)
OVER (
PARTITION BY `campaign`
)
AS `campaign_length`,
ROW_NUMBER()
OVER (
PARTITION BY `campaign`
ORDER BY `start`
)
AS `campaign_index`
FROM
games
)
SELECT
games.*,
campaigns.`name` AS `campaign_name`
FROM
games
LEFT JOIN
campaigns
ON campaigns.`id` = games.`campaign`
WHERE
games.`end` > CURRENT_TIMESTAMP()
AND games.`gamemaster` = 25
ORDER BY
games.`start`
;

Copy the table into a new table with a fresh AUTO_INCREMENT id. This will quickly add row numbers.
CREATE TABLE new_list (
row_num INT AUTO_INCREMENT NOT NULL,
INDEX(row_num) ) ENGINE=InnoDB
SELECT ... FROM ...
ORDER BY ... -- this will do the sorting before numbering

Order of Operation in SQL Server Query

I have the below query selecting items and one of its feature from a hundred thousand row of items.
But I am concerned about the performance of sub query. Will it be executed after or before the where clause ?
Suppose, I am selecting 25 items from 10000 items, this subquery will be executed only for 25 items or 10000 items ?
declare #BlockStart int = 1
, #BlockSize int = 25
;
select *, (
select Value_Float
from Features B
where B.Feature_ModelID = Itm.ModelID
and B.Feature_PropertyID = 5
) as Price
from (
select *
, row_number() over (order by ModelID desc) as RowNumber
from Models
) Itm
where Itm.RowNumber >= #BlockStart
and Itm.RowNumber < #BlockStart + #BlockSize
order by ModelID desc

The sub query in the FROM clause produces a full set of results, but the sub query in the SELECT clause will (generally!) only be run for the records included with the final result set.
As with all things SQL, there is a query optimizer involved, which may at times decide to create seemingly-strange execution plans. In this case, I believe we can be pretty confident, but I need to caution about making sweeping generalizations about SQL language order of operations.
Moving on, have you seen the OFFSET/FECTH syntax available in Sql Server 2012 and later? This seems like a better way to handle the #BlockStart and #BlockSize values, especially as it looks like you're paging on the clustered key. (If you end up paging on an alternate column, the link shows a much faster method).
Also, at risk of making generalizations again, if you can know that only one Features record exists per ModelID with Feature_PropertyID = 5, you will tend to get better performance using a JOIN:
SELECT m.*, f.Value_Float As Price
FROM Models m
LEFT JOIN Features f ON f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
If you can't make that guarantee, you may get better performance from an APPLY operation:
SELECT m.*, f.Value_Float As Price
FROM Models m
OUTER APPLY (
SELECT TOP 1 Value_Float
FROM Features f
WHERE f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
) f
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
Finally, this smells like yet another variation of the Entity-Attribute-Value pattern... which, while it has it's places, typically should be a pattern of last resort.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.

SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').

Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix

This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!

The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.

Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.

[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

Oracle performance issue in getting first row in sub query

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.

It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER

I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

SQL CTE in a View vs Temp Table in a Stored Procedure

Please bear with me -- I know this is complex.
I have a table that contains apartments, and another that contains leases for those apartments. My task is to select the "most relevant" lease from the list. In general, that means the most recent lease, but there are a few quirks that make it more complex than just ordering by date.
That has led me to create this common table expression query inside a View, which I then JOIN with a number of others inside a Stored Procedure to get the results I need:
WITH TempTable AS (
SELECT l.BuildingID, l.ApartmentID, l.LeaseID, l.ApplicantID,
ROW_NUMBER() OVER (PARTITION BY l.ApartmentID ORDER BY s.Importance DESC, MovedOut, MovedIN DESC, LLSigned DESC, Approved DESC, Applied DESC) AS 'RowNumber'
FROM dbo.NPleaseapplicant AS l INNER JOIN
dbo.NPappstatus AS s ON l.BuildingID = s.BuildingID AND l.AppStatus = s.Code
)
SELECT BuildingID, ApartmentID, LeaseID, ApplicantID
FROM TempTable
WHERE RowNumber = 1
This works and returns the correct result. The challenge I'm facing is very slow performance.
As a test, I created a temp table inside the Stored Procedure instead of using View, and got much, much better performance:
CREATE TABLE #Relevant (
BuildingID int,
ApartmentID int,
LeaseID int,
ApplicantID int,
RowNumber int
)
INSERT INTO #Relevant (BuildingID, ApartmentID, LeaseID, ApplicantID, RowNumber)
SELECT l.BuildingID, l.ApartmentID, l.LeaseID, l.ApplicantID,
ROW_NUMBER() OVER (PARTITION BY l.ApartmentID ORDER BY s.Importance DESC, MovedOut, MovedIN DESC, LLSigned DESC, Approved DESC, Applied DESC) AS 'RowNumber'
FROM dbo.NPleaseapplicant AS l INNER JOIN
dbo.NPappstatus AS s ON l.BuildingID = s.BuildingID AND l.AppStatus = s.Code
WHERE (l.BuildingID = #BuildingID)
DROP TABLE #Relevant
At first glance this doesn't add up to me. I've heard that temp tables are notoriously bad for performance. The wrinkle is that I'm able to better limit the query in the Temp Table with a WHERE clause that I can't in the View. With over 10,000 leases across 16 buildings in the table, the ability to filter with the WHERE may drop the rows affected by 90% - 95%.
Bearing all that in mind, is there anything glaring that I'm missing here? Am I doing something wrong with the View that might cause the dreadful performance, or is it just a matter of a smaller result set in the Temp Table beating the unrestricted result set in the CTE?
EDIT: I should add that this business logic of selecting the "most relevant lease" is key to many reports in the system. That's why it was placed inside a View to begin with. The View gives us "Write Once, Use Many" capabilities whereas a Temp Table in a Stored Procedure would need to be recreated for every other Stored Proc in the system. Ugly.
EDIT #2: Could I use a Table Based Function instead of a view? Would that allow me to limit the rows affected up front, and still use the resulting dataset in a JOIN with other tables? If it works -- and has decent performance -- this would allow me to keep the business logic in one place (the function) instead of duplicating it in dozens of Stored Procedures.

Just to put a bow on this one, here's what I ended up doing:
Instead of using a View to join all possible rows from 2 or 3 tables, I created a Table Based Function that makes the same basic query. As one of the parameters I pass in the Building ID, and use that in a WHERE clause, like so:
SELECT l.BuildingID, l.ApartmentID, l.LeaseID, l.ApplicantID,
ROW_NUMBER() OVER (PARTITION BY l.ApartmentID ORDER BY s.Importance DESC, MovedOut, MovedIN DESC, LLSigned DESC, Approved DESC, Applied DESC) AS 'RowNumber'
FROM dbo.NPleaseapplicant AS l INNER JOIN
dbo.NPappstatus AS s ON l.BuildingID = s.BuildingID AND l.AppStatus = s.Code
WHERE (l.BuildingID = #BuildingID)
The result is that it drastically reduces the number of joins required, and it speeds up the query immensely.
I then change all the stored procedures that rely on the View to use the Function instead, and bingo -- huge performance gains.

You could also re-write the view with subquery syntax:
SELECT BuildingID, ApartmentID, LeaseID, ApplicantID
FROM
(
SELECT l.BuildingID, l.ApartmentID, l.LeaseID, l.ApplicantID,
ROW_NUMBER() OVER (PARTITION BY l.ApartmentID ORDER BY s.Importance DESC, MovedOut, MovedIN DESC, LLSigned DESC, Approved DESC, Applied DESC) AS 'RowNumber'
FROM dbo.NPleaseapplicant AS l INNER JOIN
dbo.NPappstatus AS s ON l.BuildingID = s.BuildingID AND l.AppStatus = s.Code
)subquery
WHERE RowNumber = 1
Doing this will allow the bounding Where (where the view is being used) to be applied to the subquery, whereas the CTE case isn't bounded.
Views have fewer issues around parallel execution plans than Table Valued Functions (although this one will probably be inlined anyway, making them effectively identical)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to fine tune SQL Server query execution plan generation? - sql

Related

How to efficiently get the row indicies of a subquery or joined table using row_number()

Order of Operation in SQL Server Query

SELECT MAX() too slow - any alternatives?

Oracle performance issue in getting first row in sub query

SQL CTE in a View vs Temp Table in a Stored Procedure

Categories

Resources