Stored procedure execution taking long because of function used inside - sql

In SQL Server 2012 I have the following user defined function:
CREATE FUNCTION [dbo].[udfMaxDateTime]()
RETURNS datetime
AS
BEGIN
RETURN '99991231';
END;
This is then being used in a stored procedure like so:
DECLARE #MaxDate datetime = dbo.udfMaxDateTime();
DELETE FROM TABLE_NAME
WHERE
ValidTo = #MaxDate
AND
Id NOT IN
(
SELECT
MAX(Id)
FROM
TABLE_NAME
WHERE
ValidTo = #MaxDate
GROUP
BY
COL1
);
Now, if I run the stored procedure with the above code, it takes around 12 seconds to execute. (1,2 million rows)
If I change the WHERE clauses to ValidTo = '99991231' then, the stored procedure runs in under 1 second and it runs in Parallel.
Could anyone try and explain why this is happening ?

It is not because of the user-defined function, it is because of the variable.
When you use a variable #MaxDate in the DELETE query optimizer doesn't know the value of this variable when generating the execution plan. So, it generates a plan based on available statistics on the ValidTo column and some built-in heuristics rules for cardinality estimates when you have an equality comparison in a query.
When you use a literal constant in the query the optimizer knows its value and can generate a more efficient plan.
If you add OPTION(RECOMPILE) the execution plan would not be cached and would be always regenerated and all parameter values would be known to the optimizer. It is quite likely that query will run fast with this option. This option does add a certain overhead, but it is noticeable only when you run a query very often.
DECLARE #MaxDate datetime = dbo.udfMaxDateTime();
DELETE FROM TABLE_NAME
WHERE
ValidTo = #MaxDate
AND
Id NOT IN
(
SELECT
MAX(Id)
FROM
TABLE_NAME
WHERE
ValidTo = #MaxDate
GROUP BY
COL1
)
OPTION(RECOMPILE);
I highly recommend to read Slow in the Application, Fast in SSMS by Erland Sommarskog.

Related

Maximizing query performance by joining with XML

While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.
The table MyTable, has a clustered primary key on (Identifier, MyId, date). The #xml variable usually contains tens of entries and data returned is a few hundred thousand rows.
Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?
SET QUOTED_IDENTIFIER ON;
DECLARE #xml xml;
SET #xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/>
<recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>
';
exec sp_executesql N'
SELECT date, val
FROM MyTable tbl
inner join (
SELECT t.data.value(''#MyId'', ''int'') AS xmlMyId,
t.data.value(''#StartDate'', ''datetime'') AS xmlStartDate,
t.data.value(''#EndDate'', ''datetime'') AS xmlEndDate,
t.data.value(''#Identifier'', ''varchar(32)'') as xmlIdentifier
FROM #queryXML.nodes(''/recipe'') t(data) ) cont
ON tbl.MyId = cont.xmlMyId
AND tbl.date >= cont.xmlStartDate
AND tbl.date <= cont.xmlEndDate
WHERE Identifier = cont.xmlIdentifier
ORDER BY date', N'#queryXML xml',#xml;
For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).
SELECT tbl.date, val
FROM marketdb.dbo.MyTable tbl
INNER JOIN #MyIds ids ON tbl.MyId = ids.MyId
AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))
AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))
WHERE tbl.Identifier in (SELECT Identifier FROM #identifier_list) AND date >= #start_date AND date <= #end_date
The actual execution plan of the XML query is shown below.
See also:
sp_executesql is slow with parameters
SQL Server doesn't have the statistics for the table variable?
As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).
A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.
Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.

Big difference in Estimated and Actual rows when using a local variable

This is my first post on Stackoverflow so I hope I'm correctly following all protocols!
I'm struggling with a stored procedure in which I create a table variable and filling this table with an insert statement using an inner join. The insert itself is simple, but it gets complicated because the inner join is done on a local variable. Since the optimizer doesn't have statistics for this variable my estimated row count is getting srewed up.
The specific piece of code that causes trouble:
declare #minorderid int
select #minorderid = MIN(lo.order_id)
from [order] lo with(nolock)
where lo.order_datetime >= #datefrom
insert into #OrderTableLog_initial
(order_id, order_log_id, order_id, order_datetime, account_id, domain_id)
select ot.order_id, lol.order_log_id, ot.order_id, ot.order_datetime, ot.account_id, ot.domain_id
from [order] ot with(nolock)
inner join order_log lol with(nolock)
on ot.order_id = lol.order_id
and ot.order_datetime >= #datefrom
where (ot.domain_id in (1,2,4) and lol.order_log_id not in ( select order_log_id
from dbo.order_log_detail lld with(nolock)
where order_id >= #minorderid
)
or
(ot.domain_id = 3 and ot.order_id not IN (select order_id
from dbo.order_log_detail_spa llds with(nolock)
where order_id >= #minorderid
)
))
order by lol.order_id, lol.order_log_id
The #datefrom local variable is also declared earlier in the stored procedure:
declare #datefrom datetime
if datepart(hour,GETDATE()) between 4 and 9
begin
set #datefrom = '2011-01-01'
end
else
begin
set #datefrom = DATEADD(DAY,-2,GETDATE())
end
I've also tested this with a temporary table in stead of a table variable, but nothing changes. However, when I replace the local variable >= #datefrom with a fixed datestamp then my estimates and actuals are almost the same.
ot.order_datetime >= #datefrom = SQL Sentry Plan Explorer
ot.order_datetime >= '2017-05-03 18:00:00.000' = SQL Sentry Plan Explorer
I've come to understand that there's a way to fix this by turning this code into a dynamic sp, but I'm not sure how to do this. I would be grateful if someone could give me suggestions on how to do this. Maybe I have to use a complete other approach? Forgive me if I forgot something to mention, this is my first post.
EDIT:
MSSQL version = 11.0.5636
I've also tested with trace flag 2453, but with no success
Best regards,
Peter
Indeed, the behavior what you are experiencing is because the variables. SQL Server won't store an execution plan for each and every possible inputs, thus for some queries the execution plan may or may not optimal.
To answer your explicit question: You'll have to create a varchar variable and build the query as a string, then execute it.
Some notes before the actual code:
This can be prone to SQL injection (in general)
SQL Server will store the plans separately, meaning they will use more memory and possibly knock out other plans from the cache
Using an imaginary setup, this is what you want to do:
DECLARE #inputDate DATETIME2 = '2017-01-01 12:21:54';
DELCARE #dynamiSQL NVARCHAR(MAX) = CONCAT('SELECT col1, col2 FROM MyTable WHERE myDateColumn = ''', FORMAT(#inputDate, 'yyyy-MM-dd HH:mm:ss'), ''';');
INSERT INTO #myTableVar (col1, col2)
EXEC sp_executesql #stmt = #dynamicSQL;
As an additional note:
you can try to use EXISTS and NOT EXISTS instead of IN and NOT IN.
You can try to use a temp table (#myTempTable) instead of a local variable and put some indexes on it. Physical temp tables can perform better with large amount of data and you can put indexes on it. (For more info you can go here: What's the difference between a temp table and table variable in SQL Server? or to the official documentation)

Does SQL Server optimize DATEADD calculation in select query?

I have a query like this on Sql Server 2008:
DECLARE #START_DATE DATETIME
SET #START_DATE = GETDATE()
SELECT * FROM MY_TABLE
WHERE TRANSACTION_DATE_TIME > DATEADD(MINUTE, -1440, #START_DATE)
In the select query that you see above, does SqlServer optimize the query in order to not calculate the DATEADD result again and again. Or is it my own responsibility to store the DATEADD result on a temp variable?
SQL Server functions that are considered runtime constants are evaluated only once. GETDATE() is such a function, and DATEADD(..., constant, GETDATE()) is also a runtime constant. By leaving the actual function call inside the query you let the optimizer see what value will actually be used (as opposed to a variable value sniff) and then it can adjust its cardinality estimations accordingly, possibly coming up with a better plan.
Also read this: Troubleshooting Poor Query Performance: Constant Folding and Expression Evaluation During Cardinality Estimation.
#Martin Smith
You can run this query:
set nocount on;
declare #known int;
select #known = count(*) from sysobjects;
declare #cnt int = #known;
while #cnt = #known
select #cnt = count(*) from sysobjects where getdate()=getdate()
select #cnt, #known;
In my case after 22 seconds it hit the boundary case and the loop exited. The inportant thing is that the loop exited with #cnt zero. One would expect that if the getdate() is evaluated per row then we would get a #cnt different from the correct #known count, but not 0. The fact that #cnt is zero when the loop exists shows each getdate() was evaluated once and then the same constant value was used for every row WHERE filtering (matching none). I am aware that one positive example does not prove a theorem, but I think the case is conclusive enough.
Surprisingly, I've found that using GETDATE() inline seems to be more efficient than performing this type of calculation beforehand.
DECLARE #sd1 DATETIME, #sd2 DATETIME;
SET #sd1 = GETDATE();
SELECT * FROM dbo.table
WHERE datetime_column > DATEADD(MINUTE, -1440, #sd1)
SELECT * FROM dbo.table
WHERE datetime_column > DATEADD(MINUTE, -1440, GETDATE())
SET #sd2 = DATEADD(MINUTE, -1440, #sd1);
SELECT * FROM dbo.table
WHERE datetime_column > #sd2;
If you check the plans on those, the middle query will always come out with the lowest cost (but not always the lowest elapsed time). Of course it may depend on your indexes and data, and you should not make any assumptions based on one query that the same pre-emptive optimization will work on another query. My instinct would be to not perform any calculations inline, and instead use the #sd2 variation above... but I've learned that I can't trust my instinct all the time and I can't make general assumptions based on behavior I experience in particular scenarios.
It will be executed just once. You can double check it by checking execution plan ("Compute Scalar"->Estimated Number of execution == 1)

sql query takes much long time compared to next run

I'm running a procedure which takes around 1 minute for the first time execution but for the next time it reduces to around 9-10 seconds. And after some time again it takes around 1 minute.
My procedure is working with single table which is having 6 non clustered and 1 clustered indexes and unique id column is uniqueidentifier data type with 1,218,833 rows.
Can you guide me where is the problem/possible performance improvement is?
Thanks in advance.
Here is the procedure.
PROCEDURE [dbo].[Proc] (
#HLevel NVARCHAR(100),
#HLevelValue INT,
#Date DATE,
#Numbers NVARCHAR(MAX)=NULL
)
AS
declare #LoopCount INT ,#DateLastYear DATE
DECLARE #Table1 TABLE ( list of columns )
DECLARE #Table2 TABLE ( list of columns )
-- LOOP FOR 12 MONTH DATA
SET #LoopCount=12
WHILE(#LoopCount>0)
BEGIN
SET #LoopCount= #LoopCount -1
-- LAST YEAR DATA
DECLARE #LastDate DATE;
SET #LastDate=DATEADD(D,-1, DATEADD(yy,-1, DATEADD(D,1,#Date)))
INSERT INTO #Table1
SELECT list of columns
FROM Table3 WHERE Date = #Date
AND
CASE
WHEN #HLevel='crieteria1' THEN col1
WHEN #HLevel='crieteria2' THEN col2
WHEN #HLevel='crieteria3' THEN col3
END =#HLevelValue
INSERT INTO #Table2
SELECT list of columns
FROM table4
WHERE Date= #LastDate
AND ( #Numbers IS NULL OR columnNumber IN ( SELECT * FROM dbo.ConvertNumbersToTable(#Numbers)))
INSERT INTO #Table1
SELECT list of columns
FROM #Table2 Prf2 WHERE Prf2.col1 IN (SELECT col2 FROM #Table1) AND Year(Date) = Year(#Date)
SET #Date = DATEADD(D,-1,DATEADD(m,-1, DATEADD(D,1,#Date)));
END
SELECT list of columns FROM #Table1
The first time the query runs, the data is not in the data cache and so has to be retrieved from disk. Also, it has to prepare an execution plan. Subsequent times you run the query, the data will be in the cache and so it will not have to go to disk to read it. It can also reuse the execution plan generated originally. This means execution time can be much quicker and why an ideal situation is to have large amounts of RAM in order to be able to cache as much data in memory as possible (it's the data cache that offers the biggest performance improvements).
If execution times subsequently increase again, it's possible that the data is being removed from the cache (and execution plans can be removed from the cache too) - depends on how much pressure there is for RAM. If SQL Server needs to free some up, it will remove stuff from the cache. Data/execution plans that are used most often/have the highest value will remain cached for longer.
There are of course other things that could be a factor such as what load is on the server at the time, whether your query is being blocked by other processes etc
It seems that stored procedure is recompiling repeatedly after some time. To reduce the recompilation please check this article:
http://blog.sqlauthority.com/2010/02/18/sql-server-plan-recompilation-and-reduce-recompilation-performance-tuning/

Will index be used when using OR clause in where

I wrote a stored procedure with optional parameters.
CREATE PROCEDURE dbo.GetActiveEmployee
#startTime DATETIME=NULL,
#endTime DATETIME=NULL
AS
SET NOCOUNT ON
SELECT columns
FROM table
WHERE (#startTime is NULL or table.StartTime >= #startTime) AND
(#endTIme is NULL or table.EndTime <= #endTime)
I'm wondering whether indexes on StartTime and EndTime will be used?
Yes they will be used (well probably, check the execution plan - but I do know that the optional-ness of your parameters shouldn't make any difference)
If you are having performance problems with your query then it might be a result of parameter sniffing. Try the following variation of your stored procedure and see if it makes any difference:
CREATE PROCEDURE dbo.GetActiveEmployee
#startTime DATETIME=NULL,
#endTime DATETIME=NULL
AS
SET NOCOUNT ON
DECLARE #startTimeCopy DATETIME
DECLARE #endTimeCopy DATETIME
set #startTimeCopy = #startTime
set #endTimeCopy = #endTime
SELECT columns
FROM table
WHERE (#startTimeCopy is NULL or table.StartTime >= #startTimeCopy) AND
(#endTimeCopy is NULL or table.EndTime <= #endTimeCopy)
This disables parameter sniffing (SQL server using the actual values passed to the SP to optimise it) - In the past I've fixed some weird performance issues doing this - I still can't satisfactorily explain why however.
Another thing that you might want to try is splitting your query into several different statements depending on the NULL-ness of your parameters:
IF #startTime is NULL
BEGIN
IF #endTime IS NULL
SELECT columns FROM table
ELSE
SELECT columns FROM table WHERE table.EndTime <= #endTime
END
ELSE
IF #endTime IS NULL
SELECT columns FROM table WHERE table.StartTime >= #startTime
ELSE
SELECT columns FROM table WHERE table.StartTime >= #startTime AND table.EndTime <= #endTime
BEGIN
This is messy, but might be worth a try if you are having problems - the reason it helps is because SQL server can only have a single execution plan per sql statement, however your statement can potentially return vastly different result sets.
For example, if you pass in NULL and NULL you will return the entire table and the most optimal execution plan, however if you pass in a small range of dates it is more likely that a row lookup will be the most optimal execution plan.
With this query as a single statement SQL server is forced to choose between these two options, and so the query plan is likely to be sub-optimal in certain situations. By splitting the query into several statements however SQL server can have a different execution plan in each case.
(You could also use the exec function / dynamic SQL to achieve the same thing if you preferred)
There is a great article to do with dynamic search criteria in SQL. The method I personally use from the article is the X=#X or #X IS NULL style with the OPTION (RECOMPILE) added at the end. If you read the article it will explain why
http://www.sommarskog.se/dyn-search-2008.html
Yes, based on the query provided indexes on or including the StartTime and EndTime columns can be used.
However, the [variable] IS NULL OR... makes the query not sargable. If you don't want to use an IF statement (because CASE is an expression, and can not be used for control of flow decision logic), dynamic SQL is the next alternative for performant SQL.
IF #startTime IS NOT NULL AND #endTime IS NOT NULL
BEGIN
SELECT columns
FROM TABLE
WHERE starttime >= #startTime
AND endtime <= #endTime
END
ELSE IF #startTime IS NOT NULL
BEGIN
SELECT columns
FROM TABLE
WHERE endtime <= #endTime
END
ELSE IF #endTIme IS NOT NULL
BEGIN
SELECT columns
FROM TABLE
WHERE starttime >= #startTime
END
ELSE
BEGIN
SELECT columns
FROM TABLE
END
Dynamically changing searches based on the given parameters is a complicated subject and doing it one way over another, even with only a very slight difference, can have massive performance implications. The key is to use an index, ignore compact code, ignore worrying about repeating code, you must make a good query execution plan (use an index).
Read this and consider all the methods. Your best method will depend on your parameters, your data, your schema, and your actual usage:
Dynamic Search Conditions in T-SQL by by Erland Sommarskog
The Curse and Blessings of Dynamic SQL by Erland Sommarskog
The portion of the above articles that apply to this query is Umachandar's Bag of Tricks, but it is basically defaulting the parameters to some value to eliminate needing to use the OR. This will give the best index usage and overall performance:
CREATE PROCEDURE dbo.GetActiveEmployee
#startTime DATETIME=NULL,
#endTime DATETIME=NULL
AS
SET NOCOUNT ON
DECLARE #startTimeCopy DATETIME
DECLARE #endTimeCopy DATETIME
set #startTimeCopy = COALESCE(#startTime,'01/01/1753')
set #endTimeCopy = COALESCE(#endTime,'12/31/9999')
SELECT columns
FROM table
WHERE table.StartTime >= #startTimeCopy AND table.EndTime <= #endTimeCopy)
Probably not. Take a look at this blog posting from Tony Rogerson SQL Server MVP:
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2006/05/17/444.aspx
You should at least get the idea that you need to test with credible data and examine the execution plans.
I don't think you can guarantee that the index will be used. It will depend a lot on the size of the table, the columns you are showing, the structure of the index and other factors.
Your best bet is to use SQL Server Management Studio (SSMS) and run the query, and include the "Actual Execution Plan". Then you can study that and see exactly which index or indices were used.
You'll often be surprised by what you find.
This is especially true if there in an OR or IN in the query.