Order of Operation in SQL Server Query - sql

I have the below query selecting items and one of its feature from a hundred thousand row of items.
But I am concerned about the performance of sub query. Will it be executed after or before the where clause ?
Suppose, I am selecting 25 items from 10000 items, this subquery will be executed only for 25 items or 10000 items ?
declare #BlockStart int = 1
, #BlockSize int = 25
;
select *, (
select Value_Float
from Features B
where B.Feature_ModelID = Itm.ModelID
and B.Feature_PropertyID = 5
) as Price
from (
select *
, row_number() over (order by ModelID desc) as RowNumber
from Models
) Itm
where Itm.RowNumber >= #BlockStart
and Itm.RowNumber < #BlockStart + #BlockSize
order by ModelID desc

The sub query in the FROM clause produces a full set of results, but the sub query in the SELECT clause will (generally!) only be run for the records included with the final result set.
As with all things SQL, there is a query optimizer involved, which may at times decide to create seemingly-strange execution plans. In this case, I believe we can be pretty confident, but I need to caution about making sweeping generalizations about SQL language order of operations.
Moving on, have you seen the OFFSET/FECTH syntax available in Sql Server 2012 and later? This seems like a better way to handle the #BlockStart and #BlockSize values, especially as it looks like you're paging on the clustered key. (If you end up paging on an alternate column, the link shows a much faster method).
Also, at risk of making generalizations again, if you can know that only one Features record exists per ModelID with Feature_PropertyID = 5, you will tend to get better performance using a JOIN:
SELECT m.*, f.Value_Float As Price
FROM Models m
LEFT JOIN Features f ON f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
If you can't make that guarantee, you may get better performance from an APPLY operation:
SELECT m.*, f.Value_Float As Price
FROM Models m
OUTER APPLY (
SELECT TOP 1 Value_Float
FROM Features f
WHERE f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
) f
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
Finally, this smells like yet another variation of the Entity-Attribute-Value pattern... which, while it has it's places, typically should be a pattern of last resort.

Related

SQL Server - Optimizing MAX() on large tables

My company has a series of SQL views. One critical view has a sub select that fetches the max(id) from a large table and joins with another table.
On a sample, I populated a test table with 1m rows. Max(id) (id is an integer value) takes 8 minutes. Top with and order by desc take 8 minutes. Just experimenting, I tried max(id) over(partion by(id)) takes one second. The result set is correct. Not sure why this sped things up so much. Any ideas much appreciated. New test table with 1m rows is tblmsg_nicholas
INNER JOIN LongviewHoldTable lvhold WITH (NOLOCK) ON lvhold.MsgID = case tm.MsgType when 'LV_BLIM' /*then (select max(tm2.ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)*/
/*then (SELECT TOP 1 ID FROM TBLMSG_NICHOLAS TM2 WHERE msgtype = 'LV_ALLOC' and TM.GroupID =tm2.GroupID ORDER BY ID DESC)*/
then (select max(tm2.ID) OVER (PARTITION BY ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)
else tm.ID
end
WHERE
TA.TARGETTASKID IS NOT NULL AND
TA.RESPONSE IS NULL
About MAX(). It looks like you are computing your MAX() more-or-less like this.
select max(tm2.ID)
from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID
An index on TBLMSG_NICHOLAS (msgtype, GroupID, ID DESC) accelerates that subquery. The query planner can random-access that index directly: the first matching row contains the MAX(ID) value you want.
But you also use a so-called dependent -- correlated -- subquery. It's usually a good idea to refactor such subqueries into JOINed independent subqueries. But it's hard to help you do that because you didn't show your entire query.

Oracle FIRST_ROWS optimizer hint

I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.

Limit the number of rows being processed in this query

I cannot post the actual query here, so I am posting the basic outline of the query which should suffice. The query is used to page and return a set of users ranked according the output of a function, say F. F takes parameters from the User table and other tables which are joined. The query is something like as follows
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where DATEDIFF(dd, LastLogin, GetDate()) > 200 and Y.bar > FUBAR) as temp
where rownum > 0
According to the execution plan 91% of the cost is in the Sort. Since the sort is based on F, I cannot add an index to speed the sort. The inner query queries all the records, filters then sorts. Now most of the time the users just look at results in the 1 - 5 pages (1 page has 20 records hence the Top(20)) so I was thinking if there was any way I could limit the rows being processed and sorted and make the query faster and less CPU intensive most of the time.
EDIT: When I say to Calculate F tables are joined, what I mean is this. F takes in parameters such as X.blah and Y.foo and Y.bar. That's it. All these parameters also need to be returned as part of the resultset. e.g. The Latitude and Longitude of the User's Last location is stored in X.
At least you could try not to call DATEDIFF on every row
declare #target_date datetime
set #target_date = DATEADD(dd, -200, GetDate())
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where LastLogin < #target_date and Y.bar > FUBAR) as temp
where rownum > 0
Perhaps do the same thing with FUBAR and F?
The example above doesn't give you much performance but provides a general idea on how to reduce function calls
Not sure if and how much it'll help - but two things:
can you make sure all the foreign key columns and colums in the WHERE clause (user.blah, X.blah, user.foo, Y.foo, Y.bar) are indeed indexed? This will significantly help JOIN performance.
If those columns are not indexed, there also might be a sort operation in the execution plan that SQL Server uses so it can then use a Merge Join for the data. So your sort might not even really come from the OVER (ORDER BY F DESC) that you think causes the sort
you're combining TOP (20) with row numbers, but you're not defining any real ORDER BY for the complete result set - so your results will be random at best. Also, if you already define the rownum, couldn't you just use:
SELECT (columns)
FROM (.......) as temp
WHERE rownum BETWEEN 0 AND 20
Some thoughts:
What kind of function is F? Can it be rewritten as an inline table-valued function? That would give the optimizer an opportunity to expand the function into a reusable execution plan.
You're doing a LEFT OUTER JOIN on Y, but then include a column from Y in your WHERE clause, effectively rendering it as an INNER JOIN. Although the optimizer probably renders the execution plan in the same way, I would clean that up so that it's easier to troubleshoot in the future.

T-SQL SELECT TOP returns duplicates

I'm using SQL Server 2008 R2.
I'm not sure if I've discovered a strange SQL quirk, or (more likely) something in my code is causing this strange behaviour, particularly as Google has turned up nothing. I have a view called vwResponsible_Office_Address.
SELECT * FROM vwResponsible_Office_Address
..returns 403 rows
This code:
SELECT TOP 1000 * FROM vwResponsible_Office_Address
..returns 409 rows, as it includes 6 duplicates.
However this:
SELECT TOP 1000 * FROM vwResponsible_Office_Address
ORDER BY ID
..returns 403 rows again.
I can post the code for the view if it's relevant, but does it make sense for SELECT TOP to ever work in this way? I understand that SELECT TOP is free to return records in any order but don't understand why the number of records returned should vary.
The view does use cross apply which might be affecting the result set some how?
EDIT: View definition as requested
CREATE VIEW [dbo].[vwResponsible_Office_Address]
AS
SELECT fp.Entity_ID [Reg_Office_Entity_ID],
fp.Entity_Name [Reg_Office_Entity_Name],
addr.Address_ID
FROM [dbo].[Entity_Relationship] er
INNER JOIN [dbo].[Entity] fp
ON er.[Related_Entity_ID] = fp.[Entity_ID]
INNER JOIN [dbo].[Entity_Address] ea
ON ea.[Entity_ID] = fp.[Entity_ID]
CROSS APPLY (
SELECT TOP 1 Address_ID
FROM [dbo].[vwEntity_Address] vea
WHERE [vea].[Entity_ID] = fp.Entity_ID
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
) addr
WHERE [Entity_Relationship_Type_ID] = 25 -- fee payment relationship
UNION
SELECT ets.[Entity_ID],
ets.[Entity_Name],
addr.[Address_ID]
FROM dbo.[vwEntity_Entitlement_Status] ets
INNER JOIN dbo.[Entity_Address] ea
ON ea.[Entity_ID] = ets.[Entity_ID]
CROSS APPLY (
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
) addr
WHERE ets.[Entitlement_Type_ID] = 40 -- registered office
AND ets.[Entitlement_Status_ID] = 11 -- active
I would assume that there is some non determinism going on which means that different access methods can return different results.
Looking at the view definition the only place that appears likely would be if vwEntity_Address has some duplicates for Entity_ID.
This would make the top 1 Address_ID returned arbitrary in that case which will effect the result of the union operation when it removes duplicates.
Definitely this does look extremely suspect
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by ea.[Address_Type_ID] ASC, ea.[Address_ID] DESC
You are ordering by values from the outer query in the cross apply. This will have absolutely no effect whatsoever as these will be constant for a particular CROSS APPLY invocation.
Can you try changing to
SELECT TOP 1 [Address_ID]
FROM [dbo].[vwEntity_Address] vea
WHERE vea.[Entity_ID] = ets.[Entity_ID]
ORDER by vea.[Address_ID] DESC
I was wondering if your view included a function, until I got to the end, where you say you use cross-apply. I would assume that is your problem, if your interested in the details, take a look at the various query plans.
EDIT: Expansion of answer
I.e. your function is non-deterministic and can either return more than one row per input or return the same row for different input. In combination, this means that you'll get exactly what you are seeing: duplicate rows under some circumsntaces. Adding a distinct to your view is the costly way to solve your problem, a better way would be to change your function so that for any input there is only one row output, and for a row output only one input will produce that row.
EDIT: I didn't see that you're now including your view definition.
Your problem is definitely the cross apply, in particular you are sorting inside the cross apply by values from OUTSIDE of the cross apply, making the top 1 effectively random.

Is there a way to get different results for the same SQL query if the data stays the same?

I get a different result set for this query intermittently when I run it...sometimes it gives 1363, sometimes 1365 and sometimes 1366 results. The data doesn't change. What could be causing this and is there a way to prevent it? Query looks something like this:
SELECT *
FROM
(
SELECT
RC.UserGroupId,
RC.UserGroup,
RC.ClientId AS CLID,
CASE WHEN T1.MultipleClients = 1 THEN RC.Salutation1 ELSE RC.DisplayName1 END AS szDisplayName,
T1.MultipleClients,
RC.IsPrimaryRecord,
RC.RecordTypeId,
RC.ClientTypeId,
RC.ClientType,
RC.IsDeleted,
RC.IsCompany,
RC.KnownAs,
RC.Salutation1,
RC.FirstName,
RC.Surname,
Relationship,
C.DisplayName Client,
RC.DisplayName RelatedClient,
E.Email,
RC.DisplayName + ' is the ' + R.Relationship + ' of ' + C.DisplayName Description,
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
INNER JOIN
(
SELECT
E.Email,
CASE WHEN (COUNT(DISTINCT RC.DisplayName) > 1) THEN 1 ELSE 0 END AS MultipleClients
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
GROUP BY E.Email
) T1 ON E.Email = T1.Email
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
) T
WHERE
sequence_id = 1
AND T.UserGroupId IN (Select * from iCentral.dbo.GetSubUserGroups('471b9cbd-2312-4a8a-bb20-35ea53d30340',0))
AND T.IsDeleted = 0
AND T.RecordTypeId = 1
AND T.ClientTypeId IN
(
'1', --Client
'-1652203805' --NTU
)
AND T.CLID NOT IN
(
SELECT DISTINCT
UDDF.CLID
FROM
SLacsis_SLM.dbo.T_UserDef UD WITH (NOLOCK)
INNER JOIN
SLacsis_SLM.dbo.T_UserDefData UDDF WITH (NOLOCK)
ON UD.UserDef_ID = UDDF.UserDef_ID
INNER JOIN
SLacsis_SLM.dbo.T_Client CLL WITH (NOLOCK)
ON CLL.CLID = UDDF.CLID AND CLL.UserGroup_CLID = UD.UserID
WHERE
UD.UserDef_ID in
(
'F68F31CE-525B-4455-9D50-6DA77C66FEE5',
'A7CECB03-866C-4F1F-9E1A-CEB09474FE47'
)
AND UDDF.Data = 'NO'
)
ORDER BY T.Surname
EDIT:
I have removed all NOLOCK's (including the ones in views and UDFs) and I'm still having the same issue. I get the same results every time for the nested select (T) and if I put the result set of T into a temp table in the beginning of the query and join onto the temp table instead of the nested select then the final result set is the same every time I run the query.
EDIT2:
I have been doing some more reading on ROW_NUMBER()...I'm partitioning by email (of which there are duplicates) and ordering by Relationship (where there are only 1 of 2 relationships). Could this cause the query to be non-deterministic and would there be a way to fix that?
EDIT3:
Here are the actual execution plans if anyone is interested http://www.mediafire.com/?qo5gkh5dftxf0ml. Is it possible to see that it is running as read committed from the execution plan? I've compared the files using WinMerge and the only differences seem to be the counts (ActualRows="").
EDIT4:
This works:
SELECT * FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY B.Email ORDER BY Relationship DESC) AS sequence_id
FROM
(
SELECT DISTINCT
RC.UserGroupId,
...
) B...
EDIT5:
When running the same ROW_NUMBER() query (T in the original question, just selecting RC.DisplayName and ROW_NUMBER) twice in a row I get different rank for some people:
Does anyone have a good explanation/example of why or how ROW_NUMBER() over a result set that contains duplicates can rank differently each time it is run and ultimately change the number of results?
EDIT6:
Ok I think this makes sense to me now. This occurs when people 2 people have the same email address (e.g a husband and wife pair) and relationship. I guess in this case their ROW_NUMBER() ranking is arbitrary and can change every time it is run.
Your use of NOLOCK all over means you are doing dirty reads and will see uncommitted data, data that will be rolled back, transient and inconsistent data etc
Take these off, try again, report back pleas
Edit: some options with NOLOCKS removed
Data is really changing
Some parameter or filter is changing (eg GETDATE)
Some float comparisons running on different cores each time
See this on dba.se https://dba.stackexchange.com/q/4810/630
Embedded NOLOCKs in udfs or views (eg iCentral.dbo.GetSubUserGroups)
...
As I said yesterday in the comments the row numbering for rows with duplicate E.Email, Relationship values will be arbitrary.
To make it deterministic you would need to do PARTITION BY B.Email ORDER BY Relationship DESC, SomeUniqueColumn . Interesting that it changes between runs though using the same execution plan. I assume this is a consequence of the hash join.
I think your problem is the first row over the partition is not deterministic. I suspect that Email and Relationship is not unique.
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
Later you examine the first row of the partition.
WHERE T.sequence_id = 1
AND T.UserGroupId ...
If that first row is arbitrary then you are going to get an arbitrary where comparison. You need to add to the ORDER BY to include a complete unique key. If there is no unique key then you need to make one or live with arbitrary results. Even on a table with a clustered PK the select row order is not guaranteed unless the entire PK is in the sort clause.
This probably has to do with ordering. You have a sequence_id defined as a row_number ordered by Relationship. You'll always get a sensible order by relationship, but other than that your row_number will be random. So you can get different rows with sequence_id 1 each time. That in turn will affect your where clause, and you can get different numbers of results. To fix this to get a consistent result, add another field to your row_number's order by. Use a primary key to be certain of consistent results.
There's a recent KB that addresses problems with ROW_NUMBER() ... see FIX: You receive an incorrect result when you run a query that uses the row_number function in SQL Server 2008 for the details.
However this KB indicates that it's a problem when parallelism is invoked for execution, and looking at your execution plans I can't see this kicking in. But the fact that MS have found a problem with it in one situation makes me a little bit wary - i.e., could the same issue occur for a sufficiently complicated query (and your execution plan does look sufficiently large).
So it may be worth checking your patch levels of SQL Server 2008.
U Must only use
Order by
without prtition by.
ROW_NUMBER() OVER (ORDER BY Relationship DESC) AS sequence_id