Maximizing query performance by joining with XML

Maximizing query performance by joining with XML - sql

While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.
The table MyTable, has a clustered primary key on (Identifier, MyId, date). The #xml variable usually contains tens of entries and data returned is a few hundred thousand rows.
Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?
SET QUOTED_IDENTIFIER ON;
DECLARE #xml xml;
SET #xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/>
<recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>
';
exec sp_executesql N'
SELECT date, val
FROM MyTable tbl
inner join (
SELECT t.data.value(''#MyId'', ''int'') AS xmlMyId,
t.data.value(''#StartDate'', ''datetime'') AS xmlStartDate,
t.data.value(''#EndDate'', ''datetime'') AS xmlEndDate,
t.data.value(''#Identifier'', ''varchar(32)'') as xmlIdentifier
FROM #queryXML.nodes(''/recipe'') t(data) ) cont
ON tbl.MyId = cont.xmlMyId
AND tbl.date >= cont.xmlStartDate
AND tbl.date <= cont.xmlEndDate
WHERE Identifier = cont.xmlIdentifier
ORDER BY date', N'#queryXML xml',#xml;
For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).
SELECT tbl.date, val
FROM marketdb.dbo.MyTable tbl
INNER JOIN #MyIds ids ON tbl.MyId = ids.MyId
AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))
AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))
WHERE tbl.Identifier in (SELECT Identifier FROM #identifier_list) AND date >= #start_date AND date <= #end_date
The actual execution plan of the XML query is shown below.
See also:
sp_executesql is slow with parameters
SQL Server doesn't have the statistics for the table variable?

As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).
A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.
Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.

Related

Big difference in Estimated and Actual rows when using a local variable

This is my first post on Stackoverflow so I hope I'm correctly following all protocols!
I'm struggling with a stored procedure in which I create a table variable and filling this table with an insert statement using an inner join. The insert itself is simple, but it gets complicated because the inner join is done on a local variable. Since the optimizer doesn't have statistics for this variable my estimated row count is getting srewed up.
The specific piece of code that causes trouble:
declare #minorderid int
select #minorderid = MIN(lo.order_id)
from [order] lo with(nolock)
where lo.order_datetime >= #datefrom
insert into #OrderTableLog_initial
(order_id, order_log_id, order_id, order_datetime, account_id, domain_id)
select ot.order_id, lol.order_log_id, ot.order_id, ot.order_datetime, ot.account_id, ot.domain_id
from [order] ot with(nolock)
inner join order_log lol with(nolock)
on ot.order_id = lol.order_id
and ot.order_datetime >= #datefrom
where (ot.domain_id in (1,2,4) and lol.order_log_id not in ( select order_log_id
from dbo.order_log_detail lld with(nolock)
where order_id >= #minorderid
)
or
(ot.domain_id = 3 and ot.order_id not IN (select order_id
from dbo.order_log_detail_spa llds with(nolock)
where order_id >= #minorderid
)
))
order by lol.order_id, lol.order_log_id
The #datefrom local variable is also declared earlier in the stored procedure:
declare #datefrom datetime
if datepart(hour,GETDATE()) between 4 and 9
begin
set #datefrom = '2011-01-01'
end
else
begin
set #datefrom = DATEADD(DAY,-2,GETDATE())
end
I've also tested this with a temporary table in stead of a table variable, but nothing changes. However, when I replace the local variable >= #datefrom with a fixed datestamp then my estimates and actuals are almost the same.
ot.order_datetime >= #datefrom = SQL Sentry Plan Explorer
ot.order_datetime >= '2017-05-03 18:00:00.000' = SQL Sentry Plan Explorer
I've come to understand that there's a way to fix this by turning this code into a dynamic sp, but I'm not sure how to do this. I would be grateful if someone could give me suggestions on how to do this. Maybe I have to use a complete other approach? Forgive me if I forgot something to mention, this is my first post.
EDIT:
MSSQL version = 11.0.5636
I've also tested with trace flag 2453, but with no success
Best regards,
Peter

Indeed, the behavior what you are experiencing is because the variables. SQL Server won't store an execution plan for each and every possible inputs, thus for some queries the execution plan may or may not optimal.
To answer your explicit question: You'll have to create a varchar variable and build the query as a string, then execute it.
Some notes before the actual code:
This can be prone to SQL injection (in general)
SQL Server will store the plans separately, meaning they will use more memory and possibly knock out other plans from the cache
Using an imaginary setup, this is what you want to do:
DECLARE #inputDate DATETIME2 = '2017-01-01 12:21:54';
DELCARE #dynamiSQL NVARCHAR(MAX) = CONCAT('SELECT col1, col2 FROM MyTable WHERE myDateColumn = ''', FORMAT(#inputDate, 'yyyy-MM-dd HH:mm:ss'), ''';');
INSERT INTO #myTableVar (col1, col2)
EXEC sp_executesql #stmt = #dynamicSQL;
As an additional note:
you can try to use EXISTS and NOT EXISTS instead of IN and NOT IN.
You can try to use a temp table (#myTempTable) instead of a local variable and put some indexes on it. Physical temp tables can perform better with large amount of data and you can put indexes on it. (For more info you can go here: What's the difference between a temp table and table variable in SQL Server? or to the official documentation)

Stored procedure execution taking long because of function used inside

In SQL Server 2012 I have the following user defined function:
CREATE FUNCTION [dbo].[udfMaxDateTime]()
RETURNS datetime
AS
BEGIN
RETURN '99991231';
END;
This is then being used in a stored procedure like so:
DECLARE #MaxDate datetime = dbo.udfMaxDateTime();
DELETE FROM TABLE_NAME
WHERE
ValidTo = #MaxDate
AND
Id NOT IN
(
SELECT
MAX(Id)
FROM
TABLE_NAME
WHERE
ValidTo = #MaxDate
GROUP
BY
COL1
);
Now, if I run the stored procedure with the above code, it takes around 12 seconds to execute. (1,2 million rows)
If I change the WHERE clauses to ValidTo = '99991231' then, the stored procedure runs in under 1 second and it runs in Parallel.
Could anyone try and explain why this is happening ?

It is not because of the user-defined function, it is because of the variable.
When you use a variable #MaxDate in the DELETE query optimizer doesn't know the value of this variable when generating the execution plan. So, it generates a plan based on available statistics on the ValidTo column and some built-in heuristics rules for cardinality estimates when you have an equality comparison in a query.
When you use a literal constant in the query the optimizer knows its value and can generate a more efficient plan.
If you add OPTION(RECOMPILE) the execution plan would not be cached and would be always regenerated and all parameter values would be known to the optimizer. It is quite likely that query will run fast with this option. This option does add a certain overhead, but it is noticeable only when you run a query very often.
DECLARE #MaxDate datetime = dbo.udfMaxDateTime();
DELETE FROM TABLE_NAME
WHERE
ValidTo = #MaxDate
AND
Id NOT IN
(
SELECT
MAX(Id)
FROM
TABLE_NAME
WHERE
ValidTo = #MaxDate
GROUP BY
COL1
)
OPTION(RECOMPILE);
I highly recommend to read Slow in the Application, Fast in SSMS by Erland Sommarskog.

Performance Dynamic SQL vs Temporary Tables

I'm wondering if copying an existing Table into a Temporary Table results in a worse performance compared to Dynamic SQL.
To be concrete i wonder if i should expect a different performance between the following two SQL Server stored procedures:
CREATE PROCEDURE UsingDynamicSQL
(
#ID INT ,
#Tablename VARCHAR(100)
)
AS
BEGIN
DECLARE #SQL VARCHAR(MAX)
SELECT #SQL = 'Insert into Table2 Select Sum(ValColumn) From '
+ #Tablename + ' Where ID=' + #ID
EXEC(#SQL)
END
CREATE PROCEDURE UsingTempTable
(
#ID INT ,
#Tablename Varachar(100)
)
AS
BEGIN
Create Table #TempTable (ValColumn float, ID int)
DECLARE #SQL VARCHAR(MAX)
SELECT #SQL = 'Select ValColumn, ID From ' + #Tablename
+ ' Where ID=' + #ID
INSERT INTO #TempTable
EXEC ( #SQL );
INSERT INTO Table2
SELECT SUM(ValColumn)
FROM #TempTable;
DROP TABLE #TempTable;
END
I'm asking this since I'm currently using a Procedure build in the latter style where i create many Temporary Tables in the beginning as simple extracts of existing Tables and am afterwards working with these Temporary Tables.
Could I improve the performance of the stored procedure by getting rid of the Temporary Tables and using Dynamic SQL instead? In my opinion the Dynamic SQL Version is a lot uglier to programm - therefore i used Temporary Tables in the first place.

Table variables suffer performance problems because the query optimizer always assumes there will be exactly one row in them. If you have table variables holding > 100 rows, I'd switch them to temp tables.
Using dynamic sql with EXEC(#sql) instead of exec sp_executesql #sql will prevent the query plan from being cached, which will probably hurt performance.
However, you are using dynamic sql on both queries. The only difference is that the second query has the unnecessary step of loading to a table variable first, then loading into the final table. Go with the first stored procedure you have, but switch to sp_executesql.

In the posted query the temporary table is an extra write.
It is not going to help.
Don't just time a query look at the query plan.
If you have two queries the query plan will tell you the split.
And there is a difference between a table variable and temp table
The temp table is faster - the query optimizer does more with a temp table
A temporary table can help in a few situations
The output from a select is going to be used more than once
You materialize the output so it is only executed once
Where you see this is with a an expensive CTE that is evaluated many times
People of falsely think a CTE is just executed once - no it is just syntax
The query optimizer need help
An example
You are doing a self join on a large table with multiple conditions and some of conditions eliminate most of the rows
A query to a #temp can filter the rows and also reduce the number of join conditions

I agree with everyone else that you always need to test both... I'm putting it in an answer here so it's more clear.
If you have an index setup that is perfect for the final query, going to temp tables could be nothing but extra work.
If that's not the case, pre-filtering to a temp table may or may not be faster.
You can predict it at the extremes - if you're filtering down from a million to a dozen rows, I would bet it helps.
But otherwise it can be genuinely difficult to know without trying.
I agree with you that maintenance is also an issue and lots of dynamic sql is a maintenance cost to consider.

Why is SQL evaluating a WHERE clause that is False?

I've got a query in SQL (2008) that I can't understand why it's taking so much longer to evaluate if I include a clause in a WHERE statement that shouldn't affect the result. Here is an example of the query:
declare #includeAll bit = 0;
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE
#includeAll = 1 OR Id = 3926
Obviously, in this case, the #includeAll = 1 will evaluate false; however, including that increases the time of the query as if it were always true. The result I get is correct with or without that clause: I only get the 1 entry with Id = 3926, but (in my real-world query) including that line increases the query time from < 0 seconds to about 7 minutes...so it seems it's running the query as if the statement were true, even though it's not, but still returning the correct results.
Any light that can be shed on why would be helpful. Also, if you have a suggestion on working around it I'd take it. I want to have a clause such as this one so that I can include a parameter in a stored procedure that will make it disregard the Id that it has and return all results if set to true, but I can't allow that to affect the performance when simply trying to get one record.

You'd need to look at the query plan to be sure, but using OR will often make it scan like this in some DBMS.
Also, read #Bogdan Sahlean's response for some great details as why this happening.
This may not work, but you can try something like if you need to stick with straight SQL:
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
UNION ALL
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id <> 3926
AND #includeAll = 1
If you are using a stored procedure, you could conditionally run the SQL either way instead which is probably more efficient.
Something like:
if #includeAll = 0 then
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
else
SELECT
Id
,Name
,Total
FROM
MyTable

Obviously, in this case, the #includeAll = 1 will evaluate false;
however, including that increases the time of the query as if it were
always true.
This happens because those two predicates force SQL Server to choose an Index|Table Scan operator. Why ?
The execution plan is generated for all possible values of #includeAll variable / parameter. So, the same execution plan is used when #includeAll = 0 and when #includeAll = 1. If #includeAll = 0 is true and if there is an index on Id column then SQL Server could use an Index Seek or Index Seek + Key|RID Lookup to find the rows. But if #includeAll = 1 is true the optimal data access operator is Index|Table Scan. So if the execution plan must be usable for all values of #includeAll variable what is the data access operator used by SQL Server: Seek or Scan ? The answer is bellow where you can find a similar query:
DECLARE #includeAll BIT = 0;
-- Initial solution
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE #includeAll = 1 OR p.ProductID = 345
-- My solution
DECLARE #SqlStatement NVARCHAR(MAX);
SET #SqlStatement = N'
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
' + CASE WHEN #includeAll = 1 THEN '' ELSE 'WHERE p.ProductID = #ProductID' END;
EXEC sp_executesql #SqlStatement, N'#ProductID INT', #ProductID = 345;
These queries have the following execution plans:
As you can see, first execution plan includes an Clustered Index Scan with two not optimized predicates.
My solution is based on dynamic queries and it generates two different queries depending on the value of #includeAll variable thus:
[ 1 ] When #includeAll = 0 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE p.ProductID = #ProductID
and the execution plan includes an Index Seek (as you can see in image above) and
[ 2 ] When #includeAll = 1 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
and the execution plan includes an Clustered Index Scan. These two generated queries have different optimal execution plan.
Note: I've used Adventure Works 2012 sample database

My guess would be parameter sniffing - the procedure compiled when #includeAll was 1, and this is the query plan that has been cached. Meaning that when it is false you are still doing a full table scan when potentially and index seek and key lookup would be faster.
I think the best way of doing this is:
declare #includeAll bit = 0;
if #includeAll = 1
BEGIN
SELECT Id, Name,Total
FROM MyTable;
END
ELSE
BEGIN
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926;
END
Or you can force recomplilation each time it is run:
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926
OR #IncludeAll = 1
OPTION (RECOMPILE);
To demonstrate this further, I set up a very simple table and filled it with nonsense data:
CREATE TABLE dbo.T (ID INT, Filler CHAR(1000));
INSERT dbo.T (ID)
SELECT TOP 100000 a.Number
FROM master..spt_values a, master..spt_values b
WHERE a.type = 'P'
AND b.Type = 'P'
AND b.Number BETWEEN 1 AND 100;
CREATE NONCLUSTERED INDEX IX_T_ID ON dbo.T (ID);
I then ran the same query 4 times.
With #IncludeAll set to 1, query plan uses a table scan and plan is cached
Same query with #IncludeAll set to false, the plan with the table scan is still cached so that is used.
Clear the cache of plans, and run the query again with #IncludeAll false, so that the plan is now compiled and stored with an Index seek and bookmark lookup.
Run with #IncludeAll set to true. The Index seek and lookup are again used.
DECLARE #SQL NVARCHAR(MAX) = 'SELECT COUNT(Filler) FROM dbo.T WHERE #IncludeAll = 1 OR ID = 2;',
#ParamDefinition NVARCHAR(MAX) = '#IncludeAll BIT',
#PlanHandle VARBINARY(64);
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
SELECT #PlanHandle = cp.Plan_Handle
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
WHERE st.text LIKE '%' + #SQL;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
Inspecting the execution plans show that once the query has been compiled it will reuse the same plan regardless of parameter value, and that it will cache the plan that is appropriate for the value passed when it is run first, not on a most flexible basis.

Putting an OR statement in your SQL like this will cause a scan. It is probably scanning your entire table or index which will be extremely inefficient in a very large table.
If you look at the query plan for a query without the #includeAll portion at all, you would probably see an Index seek operation. As soon as you add that OR you are more than likely changing the query plan to a table/index scan. You should take a look at the plan for your query to see what exactly it is doing.

Where Sql Server generates a queryplan, it must create a plan that would work for any possible values of any embedded variables. In your case an index seek would give you the best results when #IncludeAll = 0 however an index seek can not be used in the event #IncludeAll = 1. So the query optimizer has no choice but use the query plan that would work for either values of #IncludeAll. That results in a table scan.

It is because sql server does not short circuit when you have OR in where clause.
Normally in other programming languages if we have a condition like
IF (#Param1 == 1 || #Param2 == 2)
If Param1 is = 1 it would not even bother evaluating the other expression.
In Sql Server if you have similar OR in your where clause like you have in your query
WHERE #includeAll = 1 OR Id = 3926
Even if #includeAll = 1 evaluates to true it might go ahead and check for the second condition anyway.
And changing the Order in your where clause that which expression is evaluated 1st doesnt make any difference because this is something Tunning optimiser will decied at run time. You cannot control this behaviour of sql server. Short circuiting expression evaluation or Order in which expressions are evaluated.

sql query takes much long time compared to next run

I'm running a procedure which takes around 1 minute for the first time execution but for the next time it reduces to around 9-10 seconds. And after some time again it takes around 1 minute.
My procedure is working with single table which is having 6 non clustered and 1 clustered indexes and unique id column is uniqueidentifier data type with 1,218,833 rows.
Can you guide me where is the problem/possible performance improvement is?
Thanks in advance.
Here is the procedure.
PROCEDURE [dbo].[Proc] (
#HLevel NVARCHAR(100),
#HLevelValue INT,
#Date DATE,
#Numbers NVARCHAR(MAX)=NULL
)
AS
declare #LoopCount INT ,#DateLastYear DATE
DECLARE #Table1 TABLE ( list of columns )
DECLARE #Table2 TABLE ( list of columns )
-- LOOP FOR 12 MONTH DATA
SET #LoopCount=12
WHILE(#LoopCount>0)
BEGIN
SET #LoopCount= #LoopCount -1
-- LAST YEAR DATA
DECLARE #LastDate DATE;
SET #LastDate=DATEADD(D,-1, DATEADD(yy,-1, DATEADD(D,1,#Date)))
INSERT INTO #Table1
SELECT list of columns
FROM Table3 WHERE Date = #Date
AND
CASE
WHEN #HLevel='crieteria1' THEN col1
WHEN #HLevel='crieteria2' THEN col2
WHEN #HLevel='crieteria3' THEN col3
END =#HLevelValue
INSERT INTO #Table2
SELECT list of columns
FROM table4
WHERE Date= #LastDate
AND ( #Numbers IS NULL OR columnNumber IN ( SELECT * FROM dbo.ConvertNumbersToTable(#Numbers)))
INSERT INTO #Table1
SELECT list of columns
FROM #Table2 Prf2 WHERE Prf2.col1 IN (SELECT col2 FROM #Table1) AND Year(Date) = Year(#Date)
SET #Date = DATEADD(D,-1,DATEADD(m,-1, DATEADD(D,1,#Date)));
END
SELECT list of columns FROM #Table1

The first time the query runs, the data is not in the data cache and so has to be retrieved from disk. Also, it has to prepare an execution plan. Subsequent times you run the query, the data will be in the cache and so it will not have to go to disk to read it. It can also reuse the execution plan generated originally. This means execution time can be much quicker and why an ideal situation is to have large amounts of RAM in order to be able to cache as much data in memory as possible (it's the data cache that offers the biggest performance improvements).
If execution times subsequently increase again, it's possible that the data is being removed from the cache (and execution plans can be removed from the cache too) - depends on how much pressure there is for RAM. If SQL Server needs to free some up, it will remove stuff from the cache. Data/execution plans that are used most often/have the highest value will remain cached for longer.
There are of course other things that could be a factor such as what load is on the server at the time, whether your query is being blocked by other processes etc

It seems that stored procedure is recompiling repeatedly after some time. To reduce the recompilation please check this article:
http://blog.sqlauthority.com/2010/02/18/sql-server-plan-recompilation-and-reduce-recompilation-performance-tuning/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Maximizing query performance by joining with XML - sql

Related

Big difference in Estimated and Actual rows when using a local variable

Stored procedure execution taking long because of function used inside

Performance Dynamic SQL vs Temporary Tables

Why is SQL evaluating a WHERE clause that is False?

sql query takes much long time compared to next run

Categories

Resources