Why is SQL evaluating a WHERE clause that is False? - sql

I've got a query in SQL (2008) that I can't understand why it's taking so much longer to evaluate if I include a clause in a WHERE statement that shouldn't affect the result. Here is an example of the query:
declare #includeAll bit = 0;
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE
#includeAll = 1 OR Id = 3926
Obviously, in this case, the #includeAll = 1 will evaluate false; however, including that increases the time of the query as if it were always true. The result I get is correct with or without that clause: I only get the 1 entry with Id = 3926, but (in my real-world query) including that line increases the query time from < 0 seconds to about 7 minutes...so it seems it's running the query as if the statement were true, even though it's not, but still returning the correct results.
Any light that can be shed on why would be helpful. Also, if you have a suggestion on working around it I'd take it. I want to have a clause such as this one so that I can include a parameter in a stored procedure that will make it disregard the Id that it has and return all results if set to true, but I can't allow that to affect the performance when simply trying to get one record.

You'd need to look at the query plan to be sure, but using OR will often make it scan like this in some DBMS.
Also, read #Bogdan Sahlean's response for some great details as why this happening.
This may not work, but you can try something like if you need to stick with straight SQL:
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
UNION ALL
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id <> 3926
AND #includeAll = 1
If you are using a stored procedure, you could conditionally run the SQL either way instead which is probably more efficient.
Something like:
if #includeAll = 0 then
SELECT
Id
,Name
,Total
FROM
MyTable
WHERE Id = 3926
else
SELECT
Id
,Name
,Total
FROM
MyTable

Obviously, in this case, the #includeAll = 1 will evaluate false;
however, including that increases the time of the query as if it were
always true.
This happens because those two predicates force SQL Server to choose an Index|Table Scan operator. Why ?
The execution plan is generated for all possible values of #includeAll variable / parameter. So, the same execution plan is used when #includeAll = 0 and when #includeAll = 1. If #includeAll = 0 is true and if there is an index on Id column then SQL Server could use an Index Seek or Index Seek + Key|RID Lookup to find the rows. But if #includeAll = 1 is true the optimal data access operator is Index|Table Scan. So if the execution plan must be usable for all values of #includeAll variable what is the data access operator used by SQL Server: Seek or Scan ? The answer is bellow where you can find a similar query:
DECLARE #includeAll BIT = 0;
-- Initial solution
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE #includeAll = 1 OR p.ProductID = 345
-- My solution
DECLARE #SqlStatement NVARCHAR(MAX);
SET #SqlStatement = N'
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
' + CASE WHEN #includeAll = 1 THEN '' ELSE 'WHERE p.ProductID = #ProductID' END;
EXEC sp_executesql #SqlStatement, N'#ProductID INT', #ProductID = 345;
These queries have the following execution plans:
As you can see, first execution plan includes an Clustered Index Scan with two not optimized predicates.
My solution is based on dynamic queries and it generates two different queries depending on the value of #includeAll variable thus:
[ 1 ] When #includeAll = 0 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
WHERE p.ProductID = #ProductID
and the execution plan includes an Index Seek (as you can see in image above) and
[ 2 ] When #includeAll = 1 the generated query (#SqlStatement) is
SELECT p.ProductID, p.Name, p.Color
FROM Production.Product p
and the execution plan includes an Clustered Index Scan. These two generated queries have different optimal execution plan.
Note: I've used Adventure Works 2012 sample database

My guess would be parameter sniffing - the procedure compiled when #includeAll was 1, and this is the query plan that has been cached. Meaning that when it is false you are still doing a full table scan when potentially and index seek and key lookup would be faster.
I think the best way of doing this is:
declare #includeAll bit = 0;
if #includeAll = 1
BEGIN
SELECT Id, Name,Total
FROM MyTable;
END
ELSE
BEGIN
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926;
END
Or you can force recomplilation each time it is run:
SELECT Id, Name,Total
FROM MyTable
WHERE Id = 3926
OR #IncludeAll = 1
OPTION (RECOMPILE);
To demonstrate this further, I set up a very simple table and filled it with nonsense data:
CREATE TABLE dbo.T (ID INT, Filler CHAR(1000));
INSERT dbo.T (ID)
SELECT TOP 100000 a.Number
FROM master..spt_values a, master..spt_values b
WHERE a.type = 'P'
AND b.Type = 'P'
AND b.Number BETWEEN 1 AND 100;
CREATE NONCLUSTERED INDEX IX_T_ID ON dbo.T (ID);
I then ran the same query 4 times.
With #IncludeAll set to 1, query plan uses a table scan and plan is cached
Same query with #IncludeAll set to false, the plan with the table scan is still cached so that is used.
Clear the cache of plans, and run the query again with #IncludeAll false, so that the plan is now compiled and stored with an Index seek and bookmark lookup.
Run with #IncludeAll set to true. The Index seek and lookup are again used.
DECLARE #SQL NVARCHAR(MAX) = 'SELECT COUNT(Filler) FROM dbo.T WHERE #IncludeAll = 1 OR ID = 2;',
#ParamDefinition NVARCHAR(MAX) = '#IncludeAll BIT',
#PlanHandle VARBINARY(64);
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
SELECT #PlanHandle = cp.Plan_Handle
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
WHERE st.text LIKE '%' + #SQL;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
EXECUTE sp_executesql #SQL, #ParamDefinition, 0;
EXECUTE sp_executesql #SQL, #ParamDefinition, 1;
DBCC FREEPROCCACHE (#PlanHandle); -- CLEAR THE CACHE
Inspecting the execution plans show that once the query has been compiled it will reuse the same plan regardless of parameter value, and that it will cache the plan that is appropriate for the value passed when it is run first, not on a most flexible basis.

Putting an OR statement in your SQL like this will cause a scan. It is probably scanning your entire table or index which will be extremely inefficient in a very large table.
If you look at the query plan for a query without the #includeAll portion at all, you would probably see an Index seek operation. As soon as you add that OR you are more than likely changing the query plan to a table/index scan. You should take a look at the plan for your query to see what exactly it is doing.

Where Sql Server generates a queryplan, it must create a plan that would work for any possible values of any embedded variables. In your case an index seek would give you the best results when #IncludeAll = 0 however an index seek can not be used in the event #IncludeAll = 1. So the query optimizer has no choice but use the query plan that would work for either values of #IncludeAll. That results in a table scan.

It is because sql server does not short circuit when you have OR in where clause.
Normally in other programming languages if we have a condition like
IF (#Param1 == 1 || #Param2 == 2)
If Param1 is = 1 it would not even bother evaluating the other expression.
In Sql Server if you have similar OR in your where clause like you have in your query
WHERE #includeAll = 1 OR Id = 3926
Even if #includeAll = 1 evaluates to true it might go ahead and check for the second condition anyway.
And changing the Order in your where clause that which expression is evaluated 1st doesnt make any difference because this is something Tunning optimiser will decied at run time. You cannot control this behaviour of sql server. Short circuiting expression evaluation or Order in which expressions are evaluated.

Related

Maximizing query performance by joining with XML

While working on query performance optimisation, I noticed that the pattern below outperforms by a wide margin other, more obvious, ways of writing the same query. After looking at the execution plans, it appears this is due to parallelism.
The table MyTable, has a clustered primary key on (Identifier, MyId, date). The #xml variable usually contains tens of entries and data returned is a few hundred thousand rows.
Is there a way to achieve parallelism without using the XML or is this a standard pattern/trick?
SET QUOTED_IDENTIFIER ON;
DECLARE #xml xml;
SET #xml = '<recipe MyId="3654969" Identifier="foo1" StartDate="12-Dec-2017 00:00:00" EndDate="09-Jan-2018 23:59:59"/>
<recipe MyId="3670306" Identifier="foo2" StartDate="10-Jan-2018 00:00:00" EndDate="07-Feb-2018 23:59:59"/>
';
exec sp_executesql N'
SELECT date, val
FROM MyTable tbl
inner join (
SELECT t.data.value(''#MyId'', ''int'') AS xmlMyId,
t.data.value(''#StartDate'', ''datetime'') AS xmlStartDate,
t.data.value(''#EndDate'', ''datetime'') AS xmlEndDate,
t.data.value(''#Identifier'', ''varchar(32)'') as xmlIdentifier
FROM #queryXML.nodes(''/recipe'') t(data) ) cont
ON tbl.MyId = cont.xmlMyId
AND tbl.date >= cont.xmlStartDate
AND tbl.date <= cont.xmlEndDate
WHERE Identifier = cont.xmlIdentifier
ORDER BY date', N'#queryXML xml',#xml;
For example, the stored procedure below which returns the same data severely underperforms the query above (parameters for stored proc are passed in and the whole thing is executed using sp_executesql).
SELECT tbl.date, val
FROM marketdb.dbo.MyTable tbl
INNER JOIN #MyIds ids ON tbl.MyId = ids.MyId
AND (ids.StartDate IS NULL or (ids.StartDate IS NOT NULL AND ids.StartDate <= tbl.date))
AND (ids.EndDate IS NULL or (ids.EndDate IS NOT NULL AND tbl.date <= ids.EndDate))
WHERE tbl.Identifier in (SELECT Identifier FROM #identifier_list) AND date >= #start_date AND date <= #end_date
The actual execution plan of the XML query is shown below.
See also:
sp_executesql is slow with parameters
SQL Server doesn't have the statistics for the table variable?
As Jeroen Mostert said, table variables do not have statistics and the actual execution plan is not optimal. In my case, the xml version of the query was parallelised whereas the stored proc was not (this is what I mean by the execution plan not being optimal).
A way to help the optimiser is to add an appropriate primary key or an index on the table variables. One can also create statistics on the table columns in question but in the SQL server that I am using, table variables do not support statistics.
Having added an index on all columns in the table variables, the optimiser started parallelising the query and the execution speed was greatly improved.

sql execution latency when assign to a variable

The following query will be ran in about 22 seconds:
DECLARE #i INT, #x INT
SET #i = 156567
SELECT
TOP 1
#x = AncestorId
FROM
dbo.tvw_AllProjectStructureParents_ChildView a
WHERE
ProjectStructureId = #i AND
a.NodeTypeCode = 42 AND
a.AncestorTypeDiffLevel = 1
OPTION (RECOMPILE)
The problem is with variable assignment (indeed this line: #x = AncestorId). when removing the assignment, it speeds up to about 15 miliseconds!
I solved it with inserting the result to a temp table but I think it is a bad way.
Can anyone help me what the source of problem is?!
P.S.
bad Execution plan (22s) : https://www.brentozar.com/pastetheplan/?id=Sy6a4c9bW
good execution plan (20ms) :https://www.brentozar.com/pastetheplan/?id=Byg8Hc5ZZ
When you use OPTION (RECOMPILE) SQL Server can generally perform parameter embedding optimisation.
The plan it is compiling is single use so it can sniff the values of all variables and parameters and treat them as constants.
A trivial example showing the parameter embedding optimisation in action and the effect of assigning to a variable is below (actual execution plans not estimated).
DECLARE #A INT = 1,
#B INT = 2,
#C INT;
SELECT TOP (1) number FROM master..spt_values WHERE #A > #B;
SELECT TOP (1) number FROM master..spt_values WHERE #A > #B OPTION (RECOMPILE);
SELECT TOP (1) #C = number FROM master..spt_values WHERE #A > #B OPTION (RECOMPILE);
The plans for this are below
Note the middle one does not even touch the table at all as SQL Server can deduce at compile time that #A > #B is not true. But plan 3 is back to including the table in the plan as the variable assignment evidently prevents the effect of OPTION (RECOMPILE) shown in plan 2.
(As an aside the third plan is not really 4-5 times as expensive as the first. Assigning to a variable also seems to suppress the usual row goal logic where the costs of the index scan would be scaled down to reflect the TOP 1)
In your good plan the #i value of 156567 is pushed right into the seek in the anchor leg of the recursive CTE, it returned 0 rows and so the recursive part had to do no work.
In your bad plan the recursive CTE gets fully materialised with 627,393 executions of the recursive sub tree and finally the predicate is applied on the resulting 627,393 rows (discarding all of them) at the end
I'm not sure why SQL Server can't push the predicate with a variable down. You haven't supplied the definitions of your tables - or the view with the recursive CTE. There is a similar issue with predicate pushing, views, and window functions though.
One solution would be to change the view to an inline table valued function that accepts a parameter for mainid and then add that in to the WHERE clause in the anchor part of the definition. Rather than relying on SQL Server to push the predicate down for you.
The difference comes probably from SELECT TOP 1.
When you have only field, SQL Server will take only first row. When you have variable assignment SQL Server is fetching all results but use only the top one.
I checked on different queries and it is not always a case, but probably here SQL Server optimization fails because of complexity of views/tables.
You can try following workaround:
DECLARE #i INT, #x INT
SET #i = 156567
SET #x = (SELECT
TOP 1
AncestorId
FROM
dbo.tvw_AllProjectStructureParents_ChildView a
WHERE
ProjectStructureId = #i AND
a.NodeTypeCode = 42 AND
a.AncestorTypeDiffLevel = 1)

Optional parameters, "index seek" plan

In my SELECT statement i use optional parameters in a way like this:
DECLARE #p1 INT = 1
DECLARE #p2 INT = 1
SELECT name FROM some_table WHERE (id = #p1 OR #p1 IS NULL) AND (name = #p2 OR #p2 IS NULL)
In this case the optimizer generates "index scan" (not seek) operations for the entity which is not most effective when parameters are supplied with not null values.
If i add the RECOMPILE hint to the query the optimizer builds more effective plan which uses "seek". It works on my MSSQL 2008 R2 SP1 server and it also means that the optimizer CAN build a plan which consider only one logic branch of my query.
How can i make it to use that plan everywhere i want with no recompiling? The USE PLAN hint seemes not to work in this case.
Below is test code:
-- see plans
CREATE TABLE test_table(
id INT IDENTITY(1,1) NOT NULL,
name varchar(10),
CONSTRAINT [pk_test_table] PRIMARY KEY CLUSTERED (id ASC))
GO
INSERT INTO test_table(name) VALUES ('a'),('b'),('c')
GO
DECLARE #p INT = 1
SELECT name FROM test_table WHERE id = #p OR #p IS NULL
SELECT name FROM test_table WHERE id = #p OR #p IS NULL OPTION(RECOMPILE)
GO
DROP TABLE test_table
GO
Note that not all versions of SQL server will change the plan the way i shown.
The reason you get a scan is because the predicate will not short-circuit and both statements will always be evaluated. As you have already stated it will not work well with the optimizer and force a scan. Even though with recompile appears to help sometimes, it's not consistent.
If you have a large table where seeks are a must then you have two options:
Dynamic sql.
If statements separating your queries and thus creating separate execution plans (when #p is null you will of course always get a scan).
Response to Comment on Andreas' Answer
The problem is that you need two different plans.
If #p1 = 1 then you can use a SEEK on the index.
If #p1 IS NULL, however, it is not a seek, by definition it's a SCAN.
This means that when the optimiser is generating a plan Prior to knowledge of the parameters, it needs to create a plan that can fullfil all possibilities. Only a Scan can cover the needs of Both #p1 = 1 And #p1 IS NULL.
It also means that if the plan is recompiled at the time when the parameters are known, and #p1 = 1, a SEEK plan can be created.
This is the reason that, as you mention in your comment, IF statements resolve your problem; Each IF block represents a different portion of the problem space, and each can be given a different execution plan.
See Dynamic Search Conditions in T-SQL.
This explains comprehensively the versions where the RECOMPILE option works and alternatives where it doesn't.
Look at this article http://www.bigresource.com/Tracker/Track-ms_sql-fTP7dh01/
It seems that you can try to use proposal solution:
`SELECT * FROM <table> WHERE IsNull(column, -1) = IsNull(#value, -1)`
or
`SELECT * FROM <table> WHERE COALESCE(column, -1) = COALESCE(#value, -1)`

SQL Server: what will trigger execution plan?

If I have statement
DECLARE #i INT;
DECLARE #d NUMERIC(9,3);
SET #i = 123;
SET #d = #i;
SELECT #d;
and I include actual execution plan and run this query, I don't get an execution plan. Will the query trigger execution plan only when there is FROM statement in the batch?
The simple answer is you don't get execution plans without table access.
Execution plans are what the optimiser produces: it work out the best way to satisfy the query based on indexes, statistics, etc.
What you have above is trivial and has no table access. Why do you need a plan?
Edit:
A derived table is table access as per Lucero's example in comments
Edit 2:
"Trivial" table access gives constant scans, not real scans or seeks:
SELECT * FROM sys.tables WHERE 1=0
Lucero's examples in comments
What you mean by what will trigger execution plan? Also I didn't understand I include actual execution plan and run this query, I don't get an execution plan. Hope this link may help you.
SQL Tuning Tutorial - Understanding
a Database Execution Plan (1)
I would assume that a query plan is generated whenever a set-based operation needs to be performed.
Yes you need a from clause.
You can do like this
declare #i int
declare #d numeric(9,3)
set #i = 123
select #d = #i
from (select 1) as x(x)
select #d
And in the execution plan you see this
<ScalarOperator ScalarString="CONVERT_IMPLICIT(numeric(9,3),[#i],0)">

Stored Procedure; Insert Slowness

I have an SP that takes 10 seconds to run about 10 times (about a second every time it is ran). The platform is asp .net, and the server is SQL Server 2005. I have indexed the table (not on the PK also), and that is not the issue. Some caveats:
usp_SaveKeyword is not the issue. I commented out that entire SP and it made not difference.
I set #SearchID to 1 and the time was significantly reduced, only taking about 15ms on average for the transaction.
I commented out the entire stored procedure except the insert into tblSearches and strangely it took more time to execute.
Any ideas of what could be going on?
set ANSI_NULLS ON
go
ALTER PROCEDURE [dbo].[usp_NewSearch]
#Keyword VARCHAR(50),
#SessionID UNIQUEIDENTIFIER,
#time SMALLDATETIME = NULL,
#CityID INT = NULL
AS
BEGIN
SET NOCOUNT ON;
IF #time IS NULL SET #time = GETDATE();
DECLARE #KeywordID INT;
EXEC #KeywordID = usp_SaveKeyword #Keyword;
PRINT 'KeywordID : '
PRINT #KeywordID
DECLARE #SearchID BIGINT;
SELECT TOP 1 #SearchID = SearchID
FROM tblSearches
WHERE SessionID = #SessionID
AND KeywordID = #KeywordID;
IF #SearchID IS NULL BEGIN
INSERT INTO tblSearches
(KeywordID, [time], SessionID, CityID)
VALUES
(#KeywordID, #time, #SessionID, #CityID)
SELECT Scope_Identity();
END
ELSE BEGIN
SELECT #SearchID
END
END
Why are you using top 1 #SearchID instead of max (SearchID) or where exists in this query? top requires you to run the query and retrieve the first row from the result set. If the result set is large this could consume quite a lot of resources before you get out the final result set.
SELECT TOP 1 #SearchID = SearchID
FROM tblSearches
WHERE SessionID = #SessionID
AND KeywordID = #KeywordID;
I don't see any obvious reason for this - either of aforementioned constructs should get you something semantically equivalent to this with a very cheap index lookup. Unless I'm missing something you should be able to do something like
select #SearchID = isnull (max (SearchID), -1)
from tblSearches
where SessionID = #SessionID
and KeywordID = #KeywordID
This ought to be fairly efficient and (unless I'm missing something) semantically equivalent.
Enable "Display Estimated Execution Plan" in SQL Management Studio - where does the execution plan show you spending the time? It'll guide you on the heuristics being used to optimize the query (or not in this case). Generally the "fatter" lines are the ones to focus on - they're ones generating large amounts of I/O.
Unfortunately even if you tell us the table schema, only you will be able to see actually how SQL chose to optimize the query. One last thing - have you got a clustered index on tblSearches?
Triggers!
They are insidious indeed.
What is the clustered index on tblSearches? If the clustered index is not on primary key, the database may be spending a lot of time reordering.
How many other indexes do you have?
Do you have any triggers?
Where does the execution plan indicate the time is being spent?