SQL Server 2005 Table Spool (Lazy spool) - performance - sql

I have some legacy SQL (SP)
declare #FactorCollectionId int; select #FactorCollectionId = collectionID from dbo.collection where name = 'Factor'
declare #changeDate datetime; set #changeDate = getDate()
declare #changeTimeID int; set #changeTImeID = convert(int, convert(varchar(8), #changeDate, 112))
declare #MaxWindowID int; select #MaxWindowID = MAX(windowID) from dbo.window
select distinct #FactorCollectionId, ElementId, T.TimeID, #changeTimeId ChangeTimeID, 1 UserID, #MaxWindowID, 0 ChangeID
, null TransactionID, SystemSourceID, changeTypeID, 'R' OlapStatus, Comment, Net0 Delta0, Net0
, 1 CreatedBy, 1 UpdatedBy, #changeDate CreatedDate, #changeDate UpdatedDate, 1 CurrentRecord, MeasureTypeID
from dbo.aowCollectedFact FV
inner join dbo.timeView T on T.timeID >= FV.timeID
where FV.currentRecord = 1 --is current record
and T.CurrentHorizon <> 0 --Indicator that Time is part of current horizon
and FV.collectionID = #FactorCollectionId --factor collections only
and FV.timeID = (select MAX(timeID) --latest collected fact timeID for given collectionID and elementID
from aowCollectedFact FV2
where FV2.collectionId = #FactorCollectionId
and FV2.elementId = FV.elementID)
and (((T.ForecastLevel = 'Month') and (T.FirstDayInMonth = T.Date)) --Date is first of month for monthly customers, or
or
((T.ForecastLevel = 'Quarter')and (T.FirstDayInQuarter = T.Date))) --Date is first of quarter for quarterly customers
and not exists (select 1 --Record does not already exist in collected fact view
from aowCollectedFact FV3 -- for this factor collection, elementID, and timeID
where FV3.collectionId = #FactorCollectionId
and FV3.elementID = FV.elementId
and FV3.timeID = T.timeID)
This SQL processes over 2 million rows. I need to improve its performance. When I look at the execution plan I find that a lot of time is spent on a Table Spool (Lazy spool) operation (indexes exist in tables and they work well).
How to improve performance for this part ?

Before seeing the execution plan or table indices, I'll give best educated guesses. First, here are a couple links worth reading.
showplan operator of the week - lazy spool
Table spool/Lazy spool
INDEXING: Take a look at your indices to make sure that they're all covering the columns that you're selecting out of the tables. You'll want to aim to get all the columns included in JOINs and WHERE clauses within the indices. All other columns that are in the SELECT statements should be INCLUDEd, or covered, by the index.
OPERATORS: See if you can get rid of the not equals ("<>") operators, in favor of a single greater than or less than operator. Can this statement and T.CurrentHorizon <> 0 be changed to this and T.CurrentHorizon > 0?
JOINS: Get rid of the subqueries that are JOINing to tables outside of themselves. For instance, this line and FV2.elementId = FV.elementID might be causing some problems. There's no reason you can't move that out of a subquery and into a JOIN to dbo.aowCollectedFact FV, given that you're GROUPing (DISTINCT) in the main query already.
DISTINCT: Change it to a GROUP BY. I've got no reason other than, because it's good practice and takes two minutes.
LAST NOTE: The exception to all the above might be to leave the final subquery, the IF NOT EXISTS, as a subquery. If you change it to a JOIN, it'll have to be a LEFT JOIN...WHERE NULL statement, which can actually cause spooling operations. No great way to get around that one.

Related

SQL Server - Optimizing MAX() on large tables

My company has a series of SQL views. One critical view has a sub select that fetches the max(id) from a large table and joins with another table.
On a sample, I populated a test table with 1m rows. Max(id) (id is an integer value) takes 8 minutes. Top with and order by desc take 8 minutes. Just experimenting, I tried max(id) over(partion by(id)) takes one second. The result set is correct. Not sure why this sped things up so much. Any ideas much appreciated. New test table with 1m rows is tblmsg_nicholas
INNER JOIN LongviewHoldTable lvhold WITH (NOLOCK) ON lvhold.MsgID = case tm.MsgType when 'LV_BLIM' /*then (select max(tm2.ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)*/
/*then (SELECT TOP 1 ID FROM TBLMSG_NICHOLAS TM2 WHERE msgtype = 'LV_ALLOC' and TM.GroupID =tm2.GroupID ORDER BY ID DESC)*/
then (select max(tm2.ID) OVER (PARTITION BY ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)
else tm.ID
end
WHERE
TA.TARGETTASKID IS NOT NULL AND
TA.RESPONSE IS NULL
About MAX(). It looks like you are computing your MAX() more-or-less like this.
select max(tm2.ID)
from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID
An index on TBLMSG_NICHOLAS (msgtype, GroupID, ID DESC) accelerates that subquery. The query planner can random-access that index directly: the first matching row contains the MAX(ID) value you want.
But you also use a so-called dependent -- correlated -- subquery. It's usually a good idea to refactor such subqueries into JOINed independent subqueries. But it's hard to help you do that because you didn't show your entire query.

Performance slow in query its getting slow due to DATEDIFF function

I am writing a SQL query which gives me a slow performance. Because of DATEDIFF function that it gives me no any result into mails. Please help me to remake this query so that my output results faster. I will put the query below
SELECT DISTINCT isnull(hrr.SourceEmailID,'')
,''
,''
,hrr.RID
,hrr.ResID
,hrr.ReqID
,'Interview Rejected To Employee'
,(
SELECT TOP 1
RID
FROM HCM_TEMPLATE_LIBRARY WITH (NOLOCK)
WHERE Title = 'Interview Reject Mail To Employee (Applicant Source- EGES)'
)
,GETUTCDATE()
,hrr.CreatedUserID
,0
FROM hc_resume_bank hrb WITH (NOLOCK)
INNER JOIN hc_req_resume hrr WITH (NOLOCK)
ON hrr.resid = HRB.rid
INNER JOIN HC_REQ_RESUME_STAGE_STATUS hrrss WITH (NOLOCK) ON hrrss.ReqResID = hrr.RID
INNER JOIN HCM_RESUME_SOURCE hrs WITH (NOLOCK) ON hrs.RID = hrr.SourceID
INNER JOIN HC_REQ_RES_INTERVIEW_STAGES hrris ON hrris.ReqResId = hrr.RID
WHERE hrrss.stageid = 4
AND hrrss.statusid = 9
AND hrr.SourceID = 4
AND isnull(hrb.SourceEmailId, '') <> ''
AND isnull(hrr.SourceEmailId, '') <> ''
and hrr.AddedType=10
AND Datediff(MI, dateadd(mi, 330, hrrss.StatusDate), DATEADD(mi, 330, GETUTCDATE())) <=5
Assuming that you have established that datediff is the root cause of poor performance, I suggest changing this:
Datediff(MI, dateadd(mi, 330, hrrss.StatusDate), DATEADD(mi, 330, GETUTCDATE())) <=5
to this:
hrrss.StatusDate >= DATEADD(MI, -5, GETDATE())
This assumes dates in StatusDate are same timezone as the server.
Salmon A has a great answer that I'd like to expand on.
Similar to why Salman A suggested you move the function to the right side of your where clause for hrrss.StatusDate, the same applies to SourceEmailId, as putting a function on the left prevents the use of an index on these columns.
However, ISNULL() is a bit more tricky to resolve, and there are several possible ways it could be addressed.
Consider if the column should really allow NULLS, and if altering the column to not allow NULLS is an option. Then your where clause would look like this.
AND hrb.SourceEmailId <> ''
AND hrr.SourceEmailId <> ''
It's also possible that SourceEmailId is always ether going to have a valid value, or be NULL. This would be preferred, as NULL should be used where a value is unknown. In which case you shouldn't be checking for <> ''. Simply check that email IS NOT NULL.
AND hrb.SourceEmailId IS NOT NULL
AND hrr.SourceEmailId IS NOT NULL
If option 1 and 2 are not an option, then consider a UNION result set. In this case, you'd write a query for hrb.SourceEmailId <> '' and UNION that to the results of a second query for hrb.SourceEmailId IS NOT NULL. Since you have checks for SourceEmailId on two different tables, it could mean as meany as four queries. However, don't get caught up on the fact it's more queries, and that that would somehow mean it'll be slower. If all 4 queries are properly tuned, and each run in 100ms, that's better than one combined query running in 5 minutes.
More details of the issues and possible work around to using ISNULL() can be found in the below links.
isnull-around-the-predicate-and-sargability
What are different ways to replace ISNULL() in a WHERE clause that uses only literal values?
Once these changes have been applied, you'll have a query that can actually use indexes on these columns. At that point, I'd start reviewing your execution plans and indexes, and possibly looking at removing the DISTINCT. But, as long as you have several WHERE clauses in your query that are going to force a SCAN every time they execute, doing these things now won't yield much benefit.
DISTINCT in a query is almost always an indicator for a badly written query, where the author joins a lot of tables, builds a huge intermediate result thus that they must then boil down to its real size with DISTINCT. This is a costly operation. It seems to apply to your query. If you simply want to make sure that the hc_req_resume.resid has an entry in hc_resume_bank with a sourceemailid, then use EXISTS or IN for this lookup, not a join.
Your query with appropriate lookup clauses:
SELECT
ISNULL(hrr.sourceemailid,'')
,''
,''
,hrr.rid
,hrr.resid
,hrr.reqid
,'Interview Rejected To Employee'
,(
SELECT TOP 1
rid
FROM hcm_template_library
WHERE title = 'Interview Reject Mail To Employee (Applicant Source- EGES)'
)
,GETUTCDATE()
,hrr.createduserid
,0
FROM hc_req_resume hrr
WHERE hrr.sourceid = 4
AND hrr.addedtype = 10
AND hrr.resid IN
(
SELECT hrb.rid
FROM hc_resume_bank hrb
WHERE hrb.sourceemailid <> ''
)
AND hrr.rid IN
(
SELECT hrrss.reqresid
FROM hc_req_resume_stage_status hrrss
WHERE hrrss.stageid = 4
AND hrrss.statusid = 9
AND hrrss.statusdate >= DATEADD(MI, -5, GETUTCDATE())
)
AND hrr.sourceid IN (SELECT hrs.rid FROM hcm_resume_source hrs)
AND hrr.rid IN (SELECT hrris.reqresid FROM hc_req_res_interview_stages);
The naming of the columns doesn't make things easier here. Why is the column sometimes called rid and sometimes reqresid? And then I see a rid combined with a resid. Is this just yet another name for the same thing? Or are there two meanings of a rid? And what is the table called the ID actually refers to? Is there a table called r or reqres or res? It doesn't seem so, but why does the ID of the table have a different name from the table, so the reader must guess what is what? We cannot even make much of a guess, if it is possible for a rid not to have a match in hc_req_res_interview_stages or for a sourceid not to have a match in hcm_resume_source. Usually you have a foreign key constraint on IDs, so either the ID is null (if this is allowed) or it does have a match in the parent table. A lookup would we pointless. Is it in your query? Or arent those tables the parent tables, but just other child tables refering to the same parent?
Remove any lookups that are not needed. The lookups in hcm_resume_source and hc_req_res_interview_stages may be such candidates, but I cannot know.
At last you want appropriate indexes. For hc_req_resume this may be something like
create index idx1 on hc_req_resume (sourceid, addedtype, rid, resid);
Then you may want:
create index idx2 on hc_resume_bank (rid) where sourceemailid <> '';
create index idx3 on hc_req_resume_stage_status (stageid, statusid, statusdate, reqresid);
The order of the columns in the indexes should be adjusted according to their selectivity.
You search for a result in the future, is this correct? -Edit: i realised its just the last 5 minutes you are looking for so in this case you might just as well remove the function on the left and see if this prevents the index scan.
About the slow performance. your query (only focussing on the datediff here) is not sargable this way. SQL server will need compute the column in all the rows on the table first, always resulting in a table scan. Remove the function on the left side.
One way to get around this, is to get the results from the main table first in a sargable way, put in it a temptable and then use the temptable for the function and use its ids to get back to the maintable for the results. See below example.
IF OBJECT_ID('tempdb..#MyTableName') IS NOT NULL
DROP TABLE #MyTableName
CREATE TABLE #MyTableName
(
PK INT PRIMARY KEY IDENTITY (1,1) NOT NULL,
ID INT,
StatusDate DATETIME
)
INSERT INTO #MyTableName (ID,StatusDate )
SELECT
ID,StatusDate
FROM dbo.basetable p
WHERE p.StatusDate > GETUTCDATE() --narrow your date criteria as much as needed
GO
SELECT P.* FROM #MyTableName T
JOIN dbo.basetable P
ON P.Id = T.ID
WHERE Datediff(MI, dateadd(mi, 330, T.StatusDate), DATEADD(mi, 330, GETUTCDATE())) <= 5
OPTION (RECOMPILE)
;
If you can create a nonclustered index on your date column and see what it brings. In the way you wrote it, it will always scan the table but at least it has an index. In the sargable way that index will also help a bunch.

SQL Query Performance Issues Using Subquery

I am having issues with my query run time. I want the query to automatically pull the max id for a column because the table is indexed off of that column. If i punch in the number manually, it runs in seconds, but i want the query to be more dynamic if possible.
I've tried placing the sub-query in different places with no luck
SELECT *
FROM TABLE A
JOIN TABLE B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID
AND B.ACTV_FLG = 1
WHERE A.WK_END_THU_ID_NU >= (SELECT DISTINCT MAX (WK_END_THU_ID_NU) FROM TABLE A)
AND A.WK_END_THU_END_YR_NU = YEAR(GETDATE())
AND A.LGCY_NATL_STR_NU IN (7731)
AND B.SLD_MENU_ITM_ID = 4314
I just want this to run faster. Maybe there is a different approach i should be taking?
I would move the subquery to the FROM clause and change the WHERE clause to only refer to A:
SELECT *
FROM A CROSS JOIN
(SELECT MAX(WK_END_THU_ID_NU) as max_wet
FROM A
) am
ON a.WK_END_THU_ID_NU = max_wet JOIN
B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID AND
B.ACTV_FLG = 1
WHERE A.WK_END_THU_END_YR_NU = YEAR(GETDATE()) AND
A.LGCY_NATL_STR_NU IN (7731) AND
A.SLD_MENU_ITM_ID = 4314; -- is the same as B
Then you want indexes. I'm pretty sure you want indexes on:
A(SLD_MENU_ITM_ID, WK_END_THU_END_YR_NU, LGCY_NATL_STR_NU, SLD_MENU_ITM_ID)
B(SLD_MENU_ITM_ID, ACTV_FLG)
I will note that moving the subquery to the FROM clause probably does not affect performance, because SQL Server is smart enough to only execute it once. However, I prefer table references in the FROM clause when reasonable. I don't think a window function would actually help in this case.

Optimize WHERE clause in query

I have the following query:
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date = (select max(a.date) from table1 a where a.serialcode ='49257');
It seems it is retrieving the select max subquery for each join. It takes a lot of time. Is there a way to optimize it? Any help will be appreciated.
Sub selects that end up being evaluated "per row of the main query" can cause tremendous performance problems once you try to scale to larger number of rows.
Sub selects can almost always be eliminated with a data model tweak.
Here's one approach: add a new is_latest to the table to track if it's the max value (and for ties, use other fields like created time stamp or the row ID). Set it to 1 if true, else 0.
Then you can add where is_latest = 1 to your query and this will radically improve performance.
You can schedule the update to happen or add a trigger etc. if you need an automated way of keeping is_latest up to date.
Other approaches involve 2 tables - one where you keep only the latest record and another table where you keep the history.
declare #maxDate datetime;
select #maxDate = max(a.date) from table1 a where a.serialcode ='49257';
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date =#maxDate;
You can optimize this query using indexes. Here are somethat should help: table1(serialcode, serial, date), table1(serialcode, date), and pivots(serialcode).
Note: I find it very strange that you have columns called serial and serialcode in the same table, and the join is on serial.
Since you haven't mentioned which DB you are using, I would answer if it was for Oracle.
You can use WITH clause to take out the subquery and make it perform just once.
WITH d AS (
SELECT max(a.date) max_date from TABLE1 a WHERE a.serialcode ='49257'
)
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
JOIN d on (table2.date = d.max_date)
WHERE table2.serialcode = '49257'
Please note that you haven't qualified date column, so I just assumed it belonged to table1 and not pivots. You can change it. An advise on the same note - always qualify your columns by using table.column format.

Limit the number of rows being processed in this query

I cannot post the actual query here, so I am posting the basic outline of the query which should suffice. The query is used to page and return a set of users ranked according the output of a function, say F. F takes parameters from the User table and other tables which are joined. The query is something like as follows
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where DATEDIFF(dd, LastLogin, GetDate()) > 200 and Y.bar > FUBAR) as temp
where rownum > 0
According to the execution plan 91% of the cost is in the Sort. Since the sort is based on F, I cannot add an index to speed the sort. The inner query queries all the records, filters then sorts. Now most of the time the users just look at results in the 1 - 5 pages (1 page has 20 records hence the Top(20)) so I was thinking if there was any way I could limit the rows being processed and sorted and make the query faster and less CPU intensive most of the time.
EDIT: When I say to Calculate F tables are joined, what I mean is this. F takes in parameters such as X.blah and Y.foo and Y.bar. That's it. All these parameters also need to be returned as part of the resultset. e.g. The Latitude and Longitude of the User's Last location is stored in X.
At least you could try not to call DATEDIFF on every row
declare #target_date datetime
set #target_date = DATEADD(dd, -200, GetDate())
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where LastLogin < #target_date and Y.bar > FUBAR) as temp
where rownum > 0
Perhaps do the same thing with FUBAR and F?
The example above doesn't give you much performance but provides a general idea on how to reduce function calls
Not sure if and how much it'll help - but two things:
can you make sure all the foreign key columns and colums in the WHERE clause (user.blah, X.blah, user.foo, Y.foo, Y.bar) are indeed indexed? This will significantly help JOIN performance.
If those columns are not indexed, there also might be a sort operation in the execution plan that SQL Server uses so it can then use a Merge Join for the data. So your sort might not even really come from the OVER (ORDER BY F DESC) that you think causes the sort
you're combining TOP (20) with row numbers, but you're not defining any real ORDER BY for the complete result set - so your results will be random at best. Also, if you already define the rownum, couldn't you just use:
SELECT (columns)
FROM (.......) as temp
WHERE rownum BETWEEN 0 AND 20
Some thoughts:
What kind of function is F? Can it be rewritten as an inline table-valued function? That would give the optimizer an opportunity to expand the function into a reusable execution plan.
You're doing a LEFT OUTER JOIN on Y, but then include a column from Y in your WHERE clause, effectively rendering it as an INNER JOIN. Although the optimizer probably renders the execution plan in the same way, I would clean that up so that it's easier to troubleshoot in the future.