I've had a SQL performance review done on a project we're working on, and one 'Critical' item that has come up is this:
This kind of wildcard query pattern will cause a table scan, resulting
in poor query performance.
SELECT *
FROM TabFoo
WHERE ColBar = #someparam OR #someparam IS NULL
Their recommendation is:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate If
clauses or (not recommended) use a dynamic SQL statement.
Dynamic SQL surely isn't the right way forward. Basically the procedure is one where I am search for something, OR something else. Two parameters come into the procedure, and I am filtering on one, or the other.
A better example than what they showed is:
SELECT ..
FROM...
WHERE (ColA = #ParA OR #ColA IS NULL)
(AND ColB = #ParB OR #ParB IS NULL)
Is that bad practice, and besides dynamic SQL (because, I thought dynamic sql can't really compile and be more efficient in it's execution plan?), how would this best be done?
A query like
select *
from foo
where foo.bar = #p OR #p is null
might or might not cause a table scan. My experience is that it will not: the optimizer perfectly able to do an index seek on the expression foo.bar = #p, assuming a suitable index exists. Further, it's perfectly able to short-circuit things if the variable is null. You won't know what your execution plan looks like until you try it and examine the bound execution plane. A better technique, however is this:
select *
from foo
where foo.bar = coalesce(#p,foo.bar)
which will give you the same behavior.
If you are using a stored procedure, one thing that can and will bite you in the tookus is something like this:
create dbo.spFoo
#p varchar(32)
as
select *
from dbo.foo
where foo.bar = #p or #p = null
return ##rowcount
The direct use of the stored procedure parameter in the where clause will cause the cached execution plan to be based on the value of #p on its first execution. That means that if the first execution of your stored procedure has an outlier value for #p, you may get a cached execution plan that performs really poorly for the 95% of "normal" executions and really well only for the oddball cases. To prevent this from occurring, you want to do this:
create dbo.spFoo
#p varchar(32)
as
declare #pMine varchar(32)
set #pMine = #p
select *
from dbo.foo
where foo.bar = #pMine or #pMine = null
return ##rowcount
That simple assignment of the parameter to a local variable makes it an expression and so the cached execution plan is not bound to the initial value of #p. Don't ask how I know this.
Further the recommendation you received:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate
If clauses or (not recommended) use a dynamic SQL statement.
is hogwash. Option(recompile) means that the stored procedure is recompiled on every execution. When the stored procedure is being compiled, compile-time locks on taken out on dependent object. Further, nobody else is going to be able to execute the stored procedure until the compilation is completed. This has, shall we say, negative impact on concurrency and performance. Use of option(recompile) should be a measure of last resort.
Write clean SQL and vet your execution plans using production data, or as close as you can get to it: the execution plan you get is affected by the size and shape/distribution of the data.
I could be wrong, but I'm pretty sure a table scan will occur no matter what if the column you have in your where clause isn't indexed. Also, you could probably get better performance by reordering your OR clauses so that if #ParA IS NULL is true, it evaluates first and would not require evaluating the value in the column. Something to remember is that the where clause is evaluated for every row that comes back from the from clause. I would not recommend dynamic SQL, and honestly, even under relatively heavy load I'd find it difficult to believe that this form of filter would cause a significant performance hit, since a table scan is required anytime the column isn't indexed.
We did a Microsoft engagement where they noted that we had a ton of this "Wildcard Pattern Usage", and their suggestion was to convert the query to an IF/ELSE structure...
IF (#SomeParam is null) BEGIN
SELECT *
FROM TabFoo
END
ELSE BEGIN
SELECT *
FROM TabFoo
WHERE ColBar = #someparam
END
They preferred this approach over recompile (adds to execution time) or dynamic code (can't plan ahead, so kind of the same thing, having to figure out the plan every time); and I seem to recall that it is still an issue even with local variables (plus, you need extra memory regardless).
You can see that things get a bit crazy if you write queries with multiple WPU issues, but at least for the smaller ones, MS recommends the IF/ELSE approach.
In all the examples I saw, NULL was involved, but I can't help but think if you had a parameter utilizing a default, whether on the parameter itself or set with an ISNULL(), and essentially the same pattern used, that might also be bad (well, as long as the default is something an "actual value" would never be, that is).
Related
I have a parametrized query which looks like (With ? being the applications parameter):
SELECT * FROM tbl WHERE tbl_id = ?
What are the performance implications of adding a variable like so:
DECLARE #id INT = ?;
SELECT * FROM tbl WHERE tbl_id = #id
I have attempted to investigate myself but have had no luck other than query plans taking slightly longer to compile when the query is first run.
If tbl_id is unique there is no difference at all. I'm trying to explain why.
SQL Server usually can solve a query with many different execution plans. SQL Server has to choose one. It tries to find the the most efficient one without too much effort. Once SQL Server chooses one plan it usually caches it for later reuse. Cardinality plays a key role in the efficiency of an execution plan, i.e How many rows there are on tbl with a given value of tbl_id?. SQL Server stores column values frequency statistics to estimate cardinality.
Firstly, lets assume tbl_id is not unique and has a non uniform distribution.
In the first case we have tbl_id = ?. Lets figure out its cardinality. The first thing we need to do to figure it out is knowing the value of the parameter ?. Is it unknown? Not really. We have a value the first time the query is executed. SQL Server takes this value, it goes to stored statistics and estimates cadinality for this specific value, it estimates the cost of a bunch of possible execution plans taking into account the estimated cardinality, chooses the most efficient one and cache it for later reuse. This approach works most of the time. However if you execute the query later with other parameter value that has a very different cardinality, the cached execution plan might be very inefficient.
In the second case we have tbl_id = #id being #id a variable declared in the query, it isn't a query parameter. Which is the value of #id?. SQL Server treats it as an unknown value. SQL Server peaks the mean frequency from stored statistics as the estimated cardinality for unknown values. Then SQL Server do the same as before: it estimates the cost of a bunch of possible execution plans taking into account the estimated cardinality, chooses the most efficient one and cache it for later reuse. Again, this approach works most of the time. However if you execute the query with one parameter value that has a very different cardinality than the mean, the execution plan might be very inefficient.
When all values have the same cardinality they have the mean cardinality, so there is no difference between parameter and variable. This is the case of unique values, therefore there are no difference when values are unique.
one advantage of the 2nd approach is that is reduces the number of plans SQL will store.
in the first version it will create a different plan for every datatype (tinyint, smallint,int & bigint)
thats assuming its an adhoc statement.
If its in a stored proc - you might run into p-sniffing as mentioned above.
You could try adding
OPTION ( OPTIMIZE FOR (#id = [some good value]))
to the select to see if that helps - but is is usually not considered good practice to couple your queries to values.
I'm not sure if this helps, but I have to account for parameter sniffing for a lot of the stored procedures I write. I do this by creating local variables, setting those to the parameter values, and then using the local variables in the stored procedure. If you look at the stored execution plan, you can see that this prevents the parameter values from being used in the plan.
This is what I do:
CREATE PROCEDURE dbo.Test ( #var int )
AS
DECLARE
#_var int
SELECT
#_var = #var
SELECT *
FROM dbo.SomeTable
WHERE
Id = #_var
I do this mostly for SSRS. I've had a query/stored procedure return <1sec, but the report takes several minutes, for example. Doing the trick above fixed that.
There are also options for optimizing specific values (e.g. OPTION (OPTIMIZE #var FOR UNKNOWN)), but I've found this usually does not help me and will not have the same effects as the trick above. I haven't been able to investigate the specifics into why they are different, but I have experienced the OPTIMIZE FOR UNKNOWN did not help, where as using local variables in place of variables did.
Is there any difference, with regards to performance, when there are many queries running with (different) constant values inside a where clause, as opposed to having a query with declared parameters on top, where instead the parameter value is changing?
Sample query with with constant value in where clause:
select
*
from [table]
where [guid_field] = '00000000-0000-0000-000000000000' --value changes
Proposed (improved?) query with declared parameters:
declare #var uniqueidentifier = '00000000-0000-0000-000000000000' --value changes
select
*
from [table]
where [guid_field] = #var
Is there any difference? I'm looking at the execution plans of something similar to the two above queries and I don't see any difference. However, I seem to recall that if you use constant values in SQL statements that SQL server won't reuse the same query execution plans, or something to that effect that causes worse performance -- but is that actually true?
It is important to distinguish between parameters and variables here. Parameters are passed to procedures and functions, variables are declared.
Addressing variables, which is what the SQL in the question has, when compiling an ad-hoc batch, SQL Server compiles each statement within it's own right.
So when compiling the query with a variable it does not go back to check any assignment, so it will compile an execution plan optimised for an unknown variable.
On first run, this execution plan will be added to the plan cache, then future executions can, and will reuse this cache for all variable values.
When you pass a constant the query is compiled based on that specific value, so can create a more optimum plan, but with the added cost of recompilation.
So to specifically answer your question:
However, I seem to recall that if you use constant values in SQL statements that SQL server won't reuse the same query execution plans, or something to that effect that causes worse performance -- but is that actually true?
Yes it is true that the same plan cannot be reused for different constant values, but that does not necessarily cause worse performance. It is possible that a more appropriate plan can be used for that particular constant (e.g. choosing bookmark lookup over index scan for sparse data), and this query plan change may outweigh the cost of recompilation. So as is almost always the case regarding SQL performance questions. The answer is it depends.
For parameters, the default behaviour is that the execution plan is compiled based on when the parameter(s) used when the procedure or function is first executed.
I have answered similar questions before in much more detail with examples, that cover a lot of the above, so rather than repeat various aspects of it I will just link the questions:
Does assigning stored procedure input parameters to local variables help optimize the query?
Ensure cold cache when running query
Why is SQL Server using index scan instead of index seek when WHERE clause contains parameterized values
There are so many things involved in your question and all has to do with statistics..
SQL compiles execution plan for even Adhoc queries and stores them in plan cache for Reuse,if they are deemed safe.
select * into test from sys.objects
select schema_id,count(*) from test
group by schema_id
--schema_id 1 has 15
--4 has 44 rows
First ask:
we are trying a different literal every time,so sql saves the plan if it deems as safe..You can see second query estimates are same as literla 4,since SQL saved the plan for 4
--lets clear cache first--not for prod
dbcc freeproccache
select * from test
where schema_id=4
output:
select * from test where
schema_id=1
output:
second ask :
Passing local variable as param,lets use same value of 4
--lets pass 4 which we know has 44 rows,estimates are 44 whem we used literals
declare #id int
set #id=4
select * from test
As you can see below screenshot,using local variables estimated less some rough 29.5 rows which has to do with statistics ..
output:
So in summary ,statistics are crucial in choosing query plan(nested loops or doing a scan or seek)
,from the examples,you can see how estimates are different for each method.further from a plan cache bloat perspective
You might also wonder ,what happens if i pass many adhoc queries,since SQL generates a new plan for same query even if there is change in space,below are the links which will help you
Further readings:
http://www.sqlskills.com/blogs/kimberly/plan-cache-adhoc-workloads-and-clearing-the-single-use-plan-cache-bloat/
http://sqlperformance.com/2012/11/t-sql-queries/ten-common-threats-to-execution-plan-quality
First, note that a local variable is not the same as a parameter.
Assuming the column is indexed or has statistics, SQL Server uses the statistics histogram to glean an estimate the qualifying row count based on the constant value supplied. The query will also be auto-parameterized and cached if it is trivial (yield the same plan regardless of values) so that subsequent executions avoid query compilation costs.
A parameterized query also generates a plan using the stats histogram with the initially supplied parameter value. The plan is cached and reused for subsequent executions regardless of whether or not it is trivial.
With a local variable, SQL Server uses the overall statistics cardinality to generate the plan because the actual value is unknown at compile time. This plan may be good for some values but suboptimal for others when the query is not trivial.
I found this article that explains using IF/ELSE statements in a SP can cause performance deterioration over using separate SPs for each 'branch'. http://sqlmag.com/t-sql/if-statements-and-stored-procedure-performance
But I have an SP which selects the same columns, from the same tables, and only the WHERE clause changes depending on what variables are present. Here is an example:
IF #Variable1 IS NOT NULL
BEGIN
SELECT
*
FROM
dbo.Table1
WHERE
Column1 = #Variable1
END
ELSE IF #Variable1 IS NULL AND #Variable2 IS NOT NULL
BEGIN
SELECT
*
FROM
dbo.Table1
WHERE
Column1 = Column1
AND
Column2 = #Variable2
END
So in this example, is it better to have 2 seperate SPs to handle the different variables or is it ok to have it all in one like this?
(I know using SELECT * is not good practice. I just did it for the sake of example)
Normally, I wouldn't worry about this, although you should look at the white paper referenced by Mikael Eriksson which has a copious amount of useful information on this subject. However, I would remove the Column1 = Column1 statement in the else branch, because that could potentially confuse the optimizer.
What the article is referring to is the fact that the stored procedure is compiled the first time it is run. This can have perverse results. For instance, if the table is empty when you first call it, then the optimizer might prefer a full table scan to an index lookup, and that would be bad as the table gets larger.
The issue may be that one of the branches gets a suboptimal performance plan, because the data is not typical on the first call. This is especially true if one of the values is NULL. This doesn't only occur with if, but that is one case where you need to be sensitive to the issue.
I would recommend the following:
If your tables are growing/shrinking over time, periodically recompile your stored procedures.
If your tables are representative of the data, don't worry about splitting into multiple stored procedures.
Your examples should do an index lookup, which is pretty simple. But monitor performance and check execution plans to be sure they are what you want.
You can use hints if you want to force index usage. (Personally, I have needed hints to force particular join algorithms, but not index usage, but I'm sure someone else has had different experiences.)
For your examples, an index on table1(column1) and table1(column2) should suffice.
The summary of the advice is not to fix this until you see there is a problem. Putting the logic into two stored procedures should be for fixing a problem that you actually see, rather than pre-empting problem that may never exist. If you do go with a two-procedure approach, you can still have a single interface that calls each of them, so you still have the same API. In other words, the one procedure should become three rather than two.
I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.
So I have this weird problem with an SQL Server stored procedure. Basically I have this long and complex procedure. Something like this:
SELECT Table1.col1, Table2.col2, col3
FROM Table1 INNER JOIN Table2
Table2 INNER JOIN Table3
-----------------------
-----------------------
(Lots more joins)
WHERE Table1.Col1 = dbo.fnGetSomeID() AND (More checks)
-----------------------
-----------------------
(6-7 More queries like this with the same check)
The problem is that check in the WHERE clause at the end Table1.Col1 = dbo.fnGetSomeID(). The function dbo.fnGetSomeID() returns a simple integer value 1. So when I hardcode the value 1 where the function call should be the SP takes only about 15 seconds. BUT when I replace it with that function call in the WHERE clause it takes around 3.5 minutes.
So I do this:
DECLARE #SomeValue INT
SET #SomeValue = dbo.fnGetSomeID()
--Where clause changed
WHERE Table1.Col1 = #SomeValue
So now the function is only called once. But still the same 3.5 minutes. So I go ahead and do this:
DECLARE #SomeValue INT
--Removed the function, replaced it with 1
SET #SomeValue = 1
--Where clause changed
WHERE Table1.Col1 = #SomeValue
And still it takes 3.5 minutes. Why the performance impact? And how to make it go away?
Even with #SomeValue set at 1, when you have
WHERE Table1.Col1 = #SomeValue
SQL Server probably still views #SomeValue as a variable, not as a hardcoded 1, and that would affect the query plan accordingly. And since Table1 is linked to Table2, and Table2 is linked to Table3, etc., the amount of time to run the query is magnified. On the other hand, when you have
WHERE Table1.Col1 = 1
The query plan gets locked in with Table1.Col1 at a constant value of 1. Just because we see
WHERE Table1.Col1 = #SomeValue
as 'hardcoding', doesn't mean SQL sees it the same way. Every possible cartesian product is a candidate and #SomeValue needs to be evaluated for each.
So, the standard recommendations apply - check your execution plan, rewrite the query if needed.
Also, are those join columns indexed?
As is mentioned elsewhere, there will be execution plan differences depending on which approach you take. I'd look at both execution plans to see if there's an obvious answer there.
This question described a similar problem, and the answer in that case turned out to involve connection settings.
I've also run into almost the exact same problem as this myself, and what I found out in that case was that using the newer constructs (analytic functions in SQL 2008) was apparently confusing the optimizer. This may not be the case for you, as you're using SQL 2005, but something similar might be going on depending on the rest of your query.
One other thing to look at is whether you have a biased distribution of values for Table1.Col1 -- if the optimizer is using a general execution plan when you use the function or the variable rather than the constant, that might lead it to choose suboptimal joins than when it can clearly see that the value is one specific constant.
If all else fails, and this query is not inside another UDF, you can precalculate the fnGetSomeID() UDF's value like you were doing, then wrap the whole query in dynamic SQL, providing the value as a constant in the SQL string. That should give you the faster performance, at the cost of recompiling the query every time (which should be a good trade in this case).
Another thing to try.
Instead of loading the id into a variable, load it into a table
if object_id('myTable') is not null drop myTable
select dbo.fnGetSomeID() as myID into myTable
and then use
WHERE Table1.Col1 = (select myID from myTable)
in your query.
You could try the OPTIMIZE FOR hint to force a plan for a given constant, but it may have inconsistent results; in 2008 you can use OPTIMIZE FOR UNKNOWN
I think that since the optimizer has no idea how much work the function does, it tries to evaluate them last.
I would try storing the return value of the function in a variable ahead of time, and using that in your where clause.
Also, you might want to try schema binding your function, because apparently sometimes it seriously affects peformance.
You can make your function schema bound like so:
create function fnGetSomeID()
with schema_binding
returns int
... etc.
(Lots more joins)
WHERE Table1.Col1 = dbo.fnGetSomeID() AND (More checks)
This is not a nice problem to have. It shouldn't matter, finally, whether the value is returned by a function or subquery or variable or is a constant. But it does, and at some level of complexity it's very hard to get consistent results. And you can't really debug it, because neither you nor anyone else here can peer inside the black box that is the query optimizer. All you can do is poke at it and see how it behaves.
I think the query optimizer is behaving erratically because there are many tables in the query. When you tell it to look for 1 it looks at the index statistics and makes a good choice. When you tell it anything else, it assumes it should join based on what it does
know, not trusting your function/variable to return a selective value. For that to true, Table1.Col1 must have an uneven distribution of values. Or the query optimizer is not, um, optimal.
Either way, the estimated query plan should show a difference. Look for opportunities to add (or, sometimes, remove) an index. It could be the 3.5 plan is reasonable in a lot of cases, and what the server really wants is better indexes.
Beyond that is guesswork. Sometimes, sad to say, the answer lies in finding the subset of tables that produce a small set of rows, putting them in a temporary table, and joining that to the rest of the tables. The OPTIMIZE FOR hint might be useful, too.
Keep in mind, though, that any solution you come with will be fragile, data and version dependent.