SQL Server - Optimizing MAX() on large tables

SQL Server - Optimizing MAX() on large tables - sql

My company has a series of SQL views. One critical view has a sub select that fetches the max(id) from a large table and joins with another table.
On a sample, I populated a test table with 1m rows. Max(id) (id is an integer value) takes 8 minutes. Top with and order by desc take 8 minutes. Just experimenting, I tried max(id) over(partion by(id)) takes one second. The result set is correct. Not sure why this sped things up so much. Any ideas much appreciated. New test table with 1m rows is tblmsg_nicholas
INNER JOIN LongviewHoldTable lvhold WITH (NOLOCK) ON lvhold.MsgID = case tm.MsgType when 'LV_BLIM' /*then (select max(tm2.ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)*/
/*then (SELECT TOP 1 ID FROM TBLMSG_NICHOLAS TM2 WHERE msgtype = 'LV_ALLOC' and TM.GroupID =tm2.GroupID ORDER BY ID DESC)*/
then (select max(tm2.ID) OVER (PARTITION BY ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)
else tm.ID
end
WHERE
TA.TARGETTASKID IS NOT NULL AND
TA.RESPONSE IS NULL

About MAX(). It looks like you are computing your MAX() more-or-less like this.
select max(tm2.ID)
from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID
An index on TBLMSG_NICHOLAS (msgtype, GroupID, ID DESC) accelerates that subquery. The query planner can random-access that index directly: the first matching row contains the MAX(ID) value you want.
But you also use a so-called dependent -- correlated -- subquery. It's usually a good idea to refactor such subqueries into JOINed independent subqueries. But it's hard to help you do that because you didn't show your entire query.

Related

Order of Operation in SQL Server Query

I have the below query selecting items and one of its feature from a hundred thousand row of items.
But I am concerned about the performance of sub query. Will it be executed after or before the where clause ?
Suppose, I am selecting 25 items from 10000 items, this subquery will be executed only for 25 items or 10000 items ?
declare #BlockStart int = 1
, #BlockSize int = 25
;
select *, (
select Value_Float
from Features B
where B.Feature_ModelID = Itm.ModelID
and B.Feature_PropertyID = 5
) as Price
from (
select *
, row_number() over (order by ModelID desc) as RowNumber
from Models
) Itm
where Itm.RowNumber >= #BlockStart
and Itm.RowNumber < #BlockStart + #BlockSize
order by ModelID desc

The sub query in the FROM clause produces a full set of results, but the sub query in the SELECT clause will (generally!) only be run for the records included with the final result set.
As with all things SQL, there is a query optimizer involved, which may at times decide to create seemingly-strange execution plans. In this case, I believe we can be pretty confident, but I need to caution about making sweeping generalizations about SQL language order of operations.
Moving on, have you seen the OFFSET/FECTH syntax available in Sql Server 2012 and later? This seems like a better way to handle the #BlockStart and #BlockSize values, especially as it looks like you're paging on the clustered key. (If you end up paging on an alternate column, the link shows a much faster method).
Also, at risk of making generalizations again, if you can know that only one Features record exists per ModelID with Feature_PropertyID = 5, you will tend to get better performance using a JOIN:
SELECT m.*, f.Value_Float As Price
FROM Models m
LEFT JOIN Features f ON f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
If you can't make that guarantee, you may get better performance from an APPLY operation:
SELECT m.*, f.Value_Float As Price
FROM Models m
OUTER APPLY (
SELECT TOP 1 Value_Float
FROM Features f
WHERE f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
) f
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
Finally, this smells like yet another variation of the Entity-Attribute-Value pattern... which, while it has it's places, typically should be a pattern of last resort.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.

SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').

Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix

This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!

The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.

Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.

[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

Limit the number of rows being processed in this query

I cannot post the actual query here, so I am posting the basic outline of the query which should suffice. The query is used to page and return a set of users ranked according the output of a function, say F. F takes parameters from the User table and other tables which are joined. The query is something like as follows
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where DATEDIFF(dd, LastLogin, GetDate()) > 200 and Y.bar > FUBAR) as temp
where rownum > 0
According to the execution plan 91% of the cost is in the Sort. Since the sort is based on F, I cannot add an index to speed the sort. The inner query queries all the records, filters then sorts. Now most of the time the users just look at results in the 1 - 5 pages (1 page has 20 records hence the Top(20)) so I was thinking if there was any way I could limit the rows being processed and sorted and make the query faster and less CPU intensive most of the time.
EDIT: When I say to Calculate F tables are joined, what I mean is this. F takes in parameters such as X.blah and Y.foo and Y.bar. That's it. All these parameters also need to be returned as part of the resultset. e.g. The Latitude and Longitude of the User's Last location is stored in X.

At least you could try not to call DATEDIFF on every row
declare #target_date datetime
set #target_date = DATEADD(dd, -200, GetDate())
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where LastLogin < #target_date and Y.bar > FUBAR) as temp
where rownum > 0
Perhaps do the same thing with FUBAR and F?
The example above doesn't give you much performance but provides a general idea on how to reduce function calls

Not sure if and how much it'll help - but two things:
can you make sure all the foreign key columns and colums in the WHERE clause (user.blah, X.blah, user.foo, Y.foo, Y.bar) are indeed indexed? This will significantly help JOIN performance.
If those columns are not indexed, there also might be a sort operation in the execution plan that SQL Server uses so it can then use a Merge Join for the data. So your sort might not even really come from the OVER (ORDER BY F DESC) that you think causes the sort
you're combining TOP (20) with row numbers, but you're not defining any real ORDER BY for the complete result set - so your results will be random at best. Also, if you already define the rownum, couldn't you just use:
SELECT (columns)
FROM (.......) as temp
WHERE rownum BETWEEN 0 AND 20

Some thoughts:
What kind of function is F? Can it be rewritten as an inline table-valued function? That would give the optimizer an opportunity to expand the function into a reusable execution plan.
You're doing a LEFT OUTER JOIN on Y, but then include a column from Y in your WHERE clause, effectively rendering it as an INNER JOIN. Although the optimizer probably renders the execution plan in the same way, I would clean that up so that it's easier to troubleshoot in the future.

Is there a way to get different results for the same SQL query if the data stays the same?

I get a different result set for this query intermittently when I run it...sometimes it gives 1363, sometimes 1365 and sometimes 1366 results. The data doesn't change. What could be causing this and is there a way to prevent it? Query looks something like this:
SELECT *
FROM
(
SELECT
RC.UserGroupId,
RC.UserGroup,
RC.ClientId AS CLID,
CASE WHEN T1.MultipleClients = 1 THEN RC.Salutation1 ELSE RC.DisplayName1 END AS szDisplayName,
T1.MultipleClients,
RC.IsPrimaryRecord,
RC.RecordTypeId,
RC.ClientTypeId,
RC.ClientType,
RC.IsDeleted,
RC.IsCompany,
RC.KnownAs,
RC.Salutation1,
RC.FirstName,
RC.Surname,
Relationship,
C.DisplayName Client,
RC.DisplayName RelatedClient,
E.Email,
RC.DisplayName + ' is the ' + R.Relationship + ' of ' + C.DisplayName Description,
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
INNER JOIN
(
SELECT
E.Email,
CASE WHEN (COUNT(DISTINCT RC.DisplayName) > 1) THEN 1 ELSE 0 END AS MultipleClients
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
GROUP BY E.Email
) T1 ON E.Email = T1.Email
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
) T
WHERE
sequence_id = 1
AND T.UserGroupId IN (Select * from iCentral.dbo.GetSubUserGroups('471b9cbd-2312-4a8a-bb20-35ea53d30340',0))
AND T.IsDeleted = 0
AND T.RecordTypeId = 1
AND T.ClientTypeId IN
(
'1', --Client
'-1652203805' --NTU
)
AND T.CLID NOT IN
(
SELECT DISTINCT
UDDF.CLID
FROM
SLacsis_SLM.dbo.T_UserDef UD WITH (NOLOCK)
INNER JOIN
SLacsis_SLM.dbo.T_UserDefData UDDF WITH (NOLOCK)
ON UD.UserDef_ID = UDDF.UserDef_ID
INNER JOIN
SLacsis_SLM.dbo.T_Client CLL WITH (NOLOCK)
ON CLL.CLID = UDDF.CLID AND CLL.UserGroup_CLID = UD.UserID
WHERE
UD.UserDef_ID in
(
'F68F31CE-525B-4455-9D50-6DA77C66FEE5',
'A7CECB03-866C-4F1F-9E1A-CEB09474FE47'
)
AND UDDF.Data = 'NO'
)
ORDER BY T.Surname
EDIT:
I have removed all NOLOCK's (including the ones in views and UDFs) and I'm still having the same issue. I get the same results every time for the nested select (T) and if I put the result set of T into a temp table in the beginning of the query and join onto the temp table instead of the nested select then the final result set is the same every time I run the query.
EDIT2:
I have been doing some more reading on ROW_NUMBER()...I'm partitioning by email (of which there are duplicates) and ordering by Relationship (where there are only 1 of 2 relationships). Could this cause the query to be non-deterministic and would there be a way to fix that?
EDIT3:
Here are the actual execution plans if anyone is interested http://www.mediafire.com/?qo5gkh5dftxf0ml. Is it possible to see that it is running as read committed from the execution plan? I've compared the files using WinMerge and the only differences seem to be the counts (ActualRows="").
EDIT4:
This works:
SELECT * FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY B.Email ORDER BY Relationship DESC) AS sequence_id
FROM
(
SELECT DISTINCT
RC.UserGroupId,
...
) B...
EDIT5:
When running the same ROW_NUMBER() query (T in the original question, just selecting RC.DisplayName and ROW_NUMBER) twice in a row I get different rank for some people:
Does anyone have a good explanation/example of why or how ROW_NUMBER() over a result set that contains duplicates can rank differently each time it is run and ultimately change the number of results?
EDIT6:
Ok I think this makes sense to me now. This occurs when people 2 people have the same email address (e.g a husband and wife pair) and relationship. I guess in this case their ROW_NUMBER() ranking is arbitrary and can change every time it is run.

Your use of NOLOCK all over means you are doing dirty reads and will see uncommitted data, data that will be rolled back, transient and inconsistent data etc
Take these off, try again, report back pleas
Edit: some options with NOLOCKS removed
Data is really changing
Some parameter or filter is changing (eg GETDATE)
Some float comparisons running on different cores each time
See this on dba.se https://dba.stackexchange.com/q/4810/630
Embedded NOLOCKs in udfs or views (eg iCentral.dbo.GetSubUserGroups)
...

As I said yesterday in the comments the row numbering for rows with duplicate E.Email, Relationship values will be arbitrary.
To make it deterministic you would need to do PARTITION BY B.Email ORDER BY Relationship DESC, SomeUniqueColumn . Interesting that it changes between runs though using the same execution plan. I assume this is a consequence of the hash join.

I think your problem is the first row over the partition is not deterministic. I suspect that Email and Relationship is not unique.
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
Later you examine the first row of the partition.
WHERE T.sequence_id = 1
AND T.UserGroupId ...
If that first row is arbitrary then you are going to get an arbitrary where comparison. You need to add to the ORDER BY to include a complete unique key. If there is no unique key then you need to make one or live with arbitrary results. Even on a table with a clustered PK the select row order is not guaranteed unless the entire PK is in the sort clause.

This probably has to do with ordering. You have a sequence_id defined as a row_number ordered by Relationship. You'll always get a sensible order by relationship, but other than that your row_number will be random. So you can get different rows with sequence_id 1 each time. That in turn will affect your where clause, and you can get different numbers of results. To fix this to get a consistent result, add another field to your row_number's order by. Use a primary key to be certain of consistent results.

There's a recent KB that addresses problems with ROW_NUMBER() ... see FIX: You receive an incorrect result when you run a query that uses the row_number function in SQL Server 2008 for the details.
However this KB indicates that it's a problem when parallelism is invoked for execution, and looking at your execution plans I can't see this kicking in. But the fact that MS have found a problem with it in one situation makes me a little bit wary - i.e., could the same issue occur for a sufficiently complicated query (and your execution plan does look sufficiently large).
So it may be worth checking your patch levels of SQL Server 2008.

U Must only use
Order by
without prtition by.
ROW_NUMBER() OVER (ORDER BY Relationship DESC) AS sequence_id

T-SQL: Why is my query faster, if I use a table variable?

can anybody explain me why this query takes 13 seconds:
SELECT Table1.Location, Table2.SID, Table2.CID, Table1.VID, COUNT(*)
FROM Table1 INNER JOIN
Table2 AS ON Table1.TID = Table2.TID
WHERE Table1.Last = Table2.Last
GROUP BY Table1.Location, Table2.SID, Table2.CID, Table1.VID
And this one only 1 second:
DECLARE #Test TABLE (Location INT, SID INT, CID INT, VID INT)
INSERT INTO #Test
SELECT Table1.Location, Table2.SID, Table2.CID, Table1.VID
FROM Table1 INNER JOIN
Table2 AS ON Table1.TID = Table2.TID
WHERE Table1.Last = Table2.Last
SELECT Location, SID, CID, VID, COUNT(*)
FROM #Test
GROUP BY Location, SID, CID, VID
When I remove the GROUP BY from the first query it needs only 1 second too. I also try to write a subselect and group the result, but it takes 13 seconds too. I don't unterstand this.

Compare the execution plans for the two queries.

The GROUP BY may perform better if you have an INDEX on each of the columns in the GROUP BY (either one each, or a combined single index of all the columns)
The reason your temporary version works better is probably because the GROUP BY is being performed on a much smaller subset of the data and therefore it is fast even without the index.
Your temp table method is by no means the wrong way of doing it. It is one of those situations where you weigh up the pro's and con's of each method. An index on the main table may slow up your inserts / updates and increase your database size. The temp table however may not perform adequately once the data size increases over time.

Could be the that in first query you group and count on larger query result than in second query where you work with already smaller dataset. Could be indexing problem. Once you fix it don't forget to re-check with larger result set because "worser" query can perform better with larger datasets.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Server - Optimizing MAX() on large tables - sql

Related

Order of Operation in SQL Server Query

SELECT MAX() too slow - any alternatives?

Limit the number of rows being processed in this query

Is there a way to get different results for the same SQL query if the data stays the same?

T-SQL: Why is my query faster, if I use a table variable?

Categories

Resources