While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.
The table MyTable, has a clustered primary key on (Identifier, MyId, date). The #xml variable usually contains tens of entries and data returned is a few hundred thousand rows.
Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?
SET QUOTED_IDENTIFIER ON;
DECLARE #xml xml;
SET #xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/>
<recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>
';
exec sp_executesql N'
SELECT date, val
FROM MyTable tbl
inner join (
SELECT t.data.value(''#MyId'', ''int'') AS xmlMyId,
t.data.value(''#StartDate'', ''datetime'') AS xmlStartDate,
t.data.value(''#EndDate'', ''datetime'') AS xmlEndDate,
t.data.value(''#Identifier'', ''varchar(32)'') as xmlIdentifier
FROM #queryXML.nodes(''/recipe'') t(data) ) cont
ON tbl.MyId = cont.xmlMyId
AND tbl.date >= cont.xmlStartDate
AND tbl.date <= cont.xmlEndDate
WHERE Identifier = cont.xmlIdentifier
ORDER BY date', N'#queryXML xml',#xml;
For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).
SELECT tbl.date, val
FROM marketdb.dbo.MyTable tbl
INNER JOIN #MyIds ids ON tbl.MyId = ids.MyId
AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))
AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))
WHERE tbl.Identifier in (SELECT Identifier FROM #identifier_list) AND date >= #start_date AND date <= #end_date
The actual execution plan of the XML query is shown below.
See also:
sp_executesql is slow with parameters
SQL Server doesn't have the statistics for the table variable?
As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).
A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.
Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.
I've got a query in SQL (2008) that I can't understand why it's taking so much longer to evaluate if I include a clause in a WHERE statement that shouldn't affect the result. Here is an example of the query:
declare #includeAll bit = 0;
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE
#includeAll = 1 OR Id = 3926
Obviously, in this case, the #includeAll = 1 will evaluate false; however, including that increases the time of the query as if it were always true. The result I get is correct with or without that clause: I only get the 1 entry with Id = 3926, but (in my real-world query) including that line increases the query time from < 0 seconds to about 7 minutes...so it seems it's running the query as if the statement were true, even though it's not, but still returning the correct results.
Any light that can be shed on why would be helpful. Also, if you have a suggestion on working around it I'd take it. I want to have a clause such as this one so that I can include a parameter in a stored procedure that will make it disregard the Id that it has and return all results if set to true, but I can't allow that to affect the performance when simply trying to get one record.
You'd need to look at the query plan to be sure, but using OR will often make it scan like this in some DBMS.
Also, read #Bogdan Sahlean's response for some great details as why this happening.
This may not work, but you can try something like if you need to stick with straight SQL:
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
UNION ALL
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id <> 3926
AND #includeAll = 1
If you are using a stored procedure, you could conditionally run the SQL either way instead which is probably more efficient.
Something like:
if #includeAll = 0 then
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
else
SELECT
Id
,Name
,Total
FROM
MyTable
Obviously, in this case, the #includeAll = 1 will evaluate false;
however, including that increases the time of the query as if it were
always true.
This happens because those two predicates force SQL Server to choose an Index|Table Scan operator. Why ?
The execution plan is generated for all possible values of #includeAll variable / parameter. So, the same execution plan is used when #includeAll = 0 and when #includeAll = 1. If #includeAll = 0 is true and if there is an index on Id column then SQL Server could use an Index Seek or Index Seek + Key|RID Lookup to find the rows. But if #includeAll = 1 is true the optimal data access operator is Index|Table Scan. So if the execution plan must be usable for all values of #includeAll variable what is the data access operator used by SQL Server: Seek or Scan ? The answer is bellow where you can find a similar query:
DECLARE #includeAll BIT = 0;
-- Initial solution
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE #includeAll = 1 OR p.ProductID = 345
-- My solution
DECLARE #SqlStatement NVARCHAR(MAX);
SET #SqlStatement = N'
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
' + CASE WHEN #includeAll = 1 THEN '' ELSE 'WHERE p.ProductID = #ProductID' END;
EXEC sp_executesql #SqlStatement, N'#ProductID INT', #ProductID = 345;
These queries have the following execution plans:
As you can see, first execution plan includes an Clustered Index Scan with two not optimized predicates.
My solution is based on dynamic queries and it generates two different queries depending on the value of #includeAll variable thus:
[ 1 ] When #includeAll = 0 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE p.ProductID = #ProductID
and the execution plan includes an Index Seek (as you can see in image above) and
[ 2 ] When #includeAll = 1 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
and the execution plan includes an Clustered Index Scan. These two generated queries have different optimal execution plan.
Note: I've used Adventure Works 2012 sample database
My guess would be parameter sniffing - the procedure compiled when #includeAll was 1, and this is the query plan that has been cached. Meaning that when it is false you are still doing a full table scan when potentially and index seek and key lookup would be faster.
I think the best way of doing this is:
declare #includeAll bit = 0;
if #includeAll = 1
BEGIN
SELECT Id, Name,Total
FROM MyTable;
END
ELSE
BEGIN
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926;
END
Or you can force recomplilation each time it is run:
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926
OR #IncludeAll = 1
OPTION (RECOMPILE);
To demonstrate this further, I set up a very simple table and filled it with nonsense data:
CREATE TABLE dbo.T (ID INT, Filler CHAR(1000));
INSERT dbo.T (ID)
SELECT TOP 100000 a.Number
FROM master..spt_values a, master..spt_values b
WHERE a.type = 'P'
AND b.Type = 'P'
AND b.Number BETWEEN 1 AND 100;
CREATE NONCLUSTERED INDEX IX_T_ID ON dbo.T (ID);
I then ran the same query 4 times.
With #IncludeAll set to 1, query plan uses a table scan and plan is cached
Same query with #IncludeAll set to false, the plan with the table scan is still cached so that is used.
Clear the cache of plans, and run the query again with #IncludeAll false, so that the plan is now compiled and stored with an Index seek and bookmark lookup.
Run with #IncludeAll set to true. The Index seek and lookup are again used.
DECLARE #SQL NVARCHAR(MAX) = 'SELECT COUNT(Filler) FROM dbo.T WHERE #IncludeAll = 1 OR ID = 2;',
#ParamDefinition NVARCHAR(MAX) = '#IncludeAll BIT',
#PlanHandle VARBINARY(64);
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
SELECT #PlanHandle = cp.Plan_Handle
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
WHERE st.text LIKE '%' + #SQL;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
Inspecting the execution plans show that once the query has been compiled it will reuse the same plan regardless of parameter value, and that it will cache the plan that is appropriate for the value passed when it is run first, not on a most flexible basis.
Putting an OR statement in your SQL like this will cause a scan. It is probably scanning your entire table or index which will be extremely inefficient in a very large table.
If you look at the query plan for a query without the #includeAll portion at all, you would probably see an Index seek operation. As soon as you add that OR you are more than likely changing the query plan to a table/index scan. You should take a look at the plan for your query to see what exactly it is doing.
Where Sql Server generates a queryplan, it must create a plan that would work for any possible values of any embedded variables. In your case an index seek would give you the best results when #IncludeAll = 0 however an index seek can not be used in the event #IncludeAll = 1. So the query optimizer has no choice but use the query plan that would work for either values of #IncludeAll. That results in a table scan.
It is because sql server does not short circuit when you have OR in where clause.
Normally in other programming languages if we have a condition like
IF (#Param1 == 1 || #Param2 == 2)
If Param1 is = 1 it would not even bother evaluating the other expression.
In Sql Server if you have similar OR in your where clause like you have in your query
WHERE #includeAll = 1 OR Id = 3926
Even if #includeAll = 1 evaluates to true it might go ahead and check for the second condition anyway.
And changing the Order in your where clause that which expression is evaluated 1st doesnt make any difference because this is something Tunning optimiser will decied at run time. You cannot control this behaviour of sql server. Short circuiting expression evaluation or Order in which expressions are evaluated.
In my SELECT statement i use optional parameters in a way like this:
DECLARE #p1 INT = 1
DECLARE #p2 INT = 1
SELECT name FROM some_table WHERE (id = #p1 OR #p1 IS NULL) AND (name = #p2 OR #p2 IS NULL)
In this case the optimizer generates "index scan" (not seek) operations for the entity which is not most effective when parameters are supplied with not null values.
If i add the RECOMPILE hint to the query the optimizer builds more effective plan which uses "seek". It works on my MSSQL 2008 R2 SP1 server and it also means that the optimizer CAN build a plan which consider only one logic branch of my query.
How can i make it to use that plan everywhere i want with no recompiling? The USE PLAN hint seemes not to work in this case.
Below is test code:
-- see plans
CREATE TABLE test_table(
id INT IDENTITY(1,1) NOT NULL,
name varchar(10),
CONSTRAINT [pk_test_table] PRIMARY KEY CLUSTERED (id ASC))
GO
INSERT INTO test_table(name) VALUES ('a'),('b'),('c')
GO
DECLARE #p INT = 1
SELECT name FROM test_table WHERE id = #p OR #p IS NULL
SELECT name FROM test_table WHERE id = #p OR #p IS NULL OPTION(RECOMPILE)
GO
DROP TABLE test_table
GO
Note that not all versions of SQL server will change the plan the way i shown.
The reason you get a scan is because the predicate will not short-circuit and both statements will always be evaluated. As you have already stated it will not work well with the optimizer and force a scan. Even though with recompile appears to help sometimes, it's not consistent.
If you have a large table where seeks are a must then you have two options:
Dynamic sql.
If statements separating your queries and thus creating separate execution plans (when #p is null you will of course always get a scan).
Response to Comment on Andreas' Answer
The problem is that you need two different plans.
If #p1 = 1 then you can use a SEEK on the index.
If #p1 IS NULL, however, it is not a seek, by definition it's a SCAN.
This means that when the optimiser is generating a plan Prior to knowledge of the parameters, it needs to create a plan that can fullfil all possibilities. Only a Scan can cover the needs of Both #p1 = 1 And #p1 IS NULL.
It also means that if the plan is recompiled at the time when the parameters are known, and #p1 = 1, a SEEK plan can be created.
This is the reason that, as you mention in your comment, IF statements resolve your problem; Each IF block represents a different portion of the problem space, and each can be given a different execution plan.
See Dynamic Search Conditions in T-SQL.
This explains comprehensively the versions where the RECOMPILE option works and alternatives where it doesn't.
Look at this article http://www.bigresource.com/Tracker/Track-ms_sql-fTP7dh01/
It seems that you can try to use proposal solution:
`SELECT * FROM <table> WHERE IsNull(column, -1) = IsNull(#value, -1)`
or
`SELECT * FROM <table> WHERE COALESCE(column, -1) = COALESCE(#value, -1)`
I am trying to understand what is going on with CREATE INDEX internally. When I create a NONCLUSTERED index it shows as an INSERT in the execution plan as well as when I get the query test.
DECLARE #sqltext VARBINARY(128)
SELECT #sqltext = sql_handle
FROM sys.sysprocesses s
WHERE spid = 73 --73 is the process creating the index
SELECT TEXT
FROM sys.dm_exec_sql_text(#sqltext)
GO
Show:
insert [dbo].[tbl] select * from [dbo].[tbl] option (maxdop 1)
This is consistent in the execution plan. Any info is appreciated.
This was my lack of knowledge on indexes, what a difference 4 months of experience makes! :)
The index creation will cause writes to the index to insert the leafs as needed.
How do conditional statements (like IF ... ELSE) affect the query execution plan in SQL Server (2005 and above)?
Can conditional statements cause poor execution plans, and are there any form of conditionals you need to be wary of when considering performance?
** Edited to add ** :
I'm specifically referring to the cached query execution plan. For instance, when caching the query execution plan in the instance below, are two execution plans cached for each of the outcomes of the conditional?
DECLARE #condition BIT
IF #condition = 1
BEGIN
SELECT * from ...
END
ELSE
BEGIN
SELECT * from ..
END
You'll get plan recompiles often with that approach. I generally try to split them up, so you end up with:
DECLARE #condition BIT
IF #condition = 1
BEGIN
EXEC MyProc1
END
ELSE
BEGIN
EXEC MyProc2
END
This way there's no difference to the end users, and MyProc1 & 2 get their own, proper cached execution plans. One procedure, one query.