So I have this weird problem with an SQL Server stored procedure. Basically I have this long and complex procedure. Something like this:
SELECT Table1.col1, Table2.col2, col3
FROM Table1 INNER JOIN Table2
Table2 INNER JOIN Table3
-----------------------
-----------------------
(Lots more joins)
WHERE Table1.Col1 = dbo.fnGetSomeID() AND (More checks)
-----------------------
-----------------------
(6-7 More queries like this with the same check)
The problem is that check in the WHERE clause at the end Table1.Col1 = dbo.fnGetSomeID(). The function dbo.fnGetSomeID() returns a simple integer value 1. So when I hardcode the value 1 where the function call should be the SP takes only about 15 seconds. BUT when I replace it with that function call in the WHERE clause it takes around 3.5 minutes.
So I do this:
DECLARE #SomeValue INT
SET #SomeValue = dbo.fnGetSomeID()
--Where clause changed
WHERE Table1.Col1 = #SomeValue
So now the function is only called once. But still the same 3.5 minutes. So I go ahead and do this:
DECLARE #SomeValue INT
--Removed the function, replaced it with 1
SET #SomeValue = 1
--Where clause changed
WHERE Table1.Col1 = #SomeValue
And still it takes 3.5 minutes. Why the performance impact? And how to make it go away?
Even with #SomeValue set at 1, when you have
WHERE Table1.Col1 = #SomeValue
SQL Server probably still views #SomeValue as a variable, not as a hardcoded 1, and that would affect the query plan accordingly. And since Table1 is linked to Table2, and Table2 is linked to Table3, etc., the amount of time to run the query is magnified. On the other hand, when you have
WHERE Table1.Col1 = 1
The query plan gets locked in with Table1.Col1 at a constant value of 1. Just because we see
WHERE Table1.Col1 = #SomeValue
as 'hardcoding', doesn't mean SQL sees it the same way. Every possible cartesian product is a candidate and #SomeValue needs to be evaluated for each.
So, the standard recommendations apply - check your execution plan, rewrite the query if needed.
Also, are those join columns indexed?
As is mentioned elsewhere, there will be execution plan differences depending on which approach you take. I'd look at both execution plans to see if there's an obvious answer there.
This question described a similar problem, and the answer in that case turned out to involve connection settings.
I've also run into almost the exact same problem as this myself, and what I found out in that case was that using the newer constructs (analytic functions in SQL 2008) was apparently confusing the optimizer. This may not be the case for you, as you're using SQL 2005, but something similar might be going on depending on the rest of your query.
One other thing to look at is whether you have a biased distribution of values for Table1.Col1 -- if the optimizer is using a general execution plan when you use the function or the variable rather than the constant, that might lead it to choose suboptimal joins than when it can clearly see that the value is one specific constant.
If all else fails, and this query is not inside another UDF, you can precalculate the fnGetSomeID() UDF's value like you were doing, then wrap the whole query in dynamic SQL, providing the value as a constant in the SQL string. That should give you the faster performance, at the cost of recompiling the query every time (which should be a good trade in this case).
Another thing to try.
Instead of loading the id into a variable, load it into a table
if object_id('myTable') is not null drop myTable
select dbo.fnGetSomeID() as myID into myTable
and then use
WHERE Table1.Col1 = (select myID from myTable)
in your query.
You could try the OPTIMIZE FOR hint to force a plan for a given constant, but it may have inconsistent results; in 2008 you can use OPTIMIZE FOR UNKNOWN
I think that since the optimizer has no idea how much work the function does, it tries to evaluate them last.
I would try storing the return value of the function in a variable ahead of time, and using that in your where clause.
Also, you might want to try schema binding your function, because apparently sometimes it seriously affects peformance.
You can make your function schema bound like so:
create function fnGetSomeID()
with schema_binding
returns int
... etc.
(Lots more joins)
WHERE Table1.Col1 = dbo.fnGetSomeID() AND (More checks)
This is not a nice problem to have. It shouldn't matter, finally, whether the value is returned by a function or subquery or variable or is a constant. But it does, and at some level of complexity it's very hard to get consistent results. And you can't really debug it, because neither you nor anyone else here can peer inside the black box that is the query optimizer. All you can do is poke at it and see how it behaves.
I think the query optimizer is behaving erratically because there are many tables in the query. When you tell it to look for 1 it looks at the index statistics and makes a good choice. When you tell it anything else, it assumes it should join based on what it does
know, not trusting your function/variable to return a selective value. For that to true, Table1.Col1 must have an uneven distribution of values. Or the query optimizer is not, um, optimal.
Either way, the estimated query plan should show a difference. Look for opportunities to add (or, sometimes, remove) an index. It could be the 3.5 plan is reasonable in a lot of cases, and what the server really wants is better indexes.
Beyond that is guesswork. Sometimes, sad to say, the answer lies in finding the subset of tables that produce a small set of rows, putting them in a temporary table, and joining that to the rest of the tables. The OPTIMIZE FOR hint might be useful, too.
Keep in mind, though, that any solution you come with will be fragile, data and version dependent.
Related
I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?
This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).
We are in the process of some data integration and I get update scripts in the form of
UPDATE Table1 SET Table1.field1 = '12345' WHERE Table1.field2 = '345667';
UPDATE Table1 SET Table1.field1 = '12365' WHERE Table1.field2 = '567885';
Table1.field2 is not indexed.
The scripts run without problems, but it takes forever. 8000 rows affected in a bit over 7 minutes, which I feel is a bit long. (It's running on a dev server which is not the best, but a look at the server doesn't indicate that it is overly busy).
So my question is, is there a better way (i.e. faster) to run this type of update statements. (SQL 2008 R2)
Many Thanks!
You may be RBAR-ing (row-by-agonizing-row) the server with multiple UPDATE statements. Essentially, you're having it do a table scan for each query, which is obviously non-ideal. While an index would help the most, executing multiple single-value statements will still cost you.
SQL Server allows you to use JOINs for update statements, so you may see some improvement doing something like this:
WITH Incoming AS (SELECT field1, field2
FROM (VALUES('12345', '345667'),
('12365', '567885')) i(field1, field2))
UPDATE Table1
SET Table1.field1 = Incoming.field1
FROM Table1
JOIN Incoming
ON Incoming.field2 = Table1.field2;
SQL Fiddle Example
If it turns out that the number of rows in Incoming is large, you should probably realize it as an actual table that you bulk-load into first. You should be able to put an index on the load table (refreshed after the import, to make sure statistics are correct).
But really, an index on field2 should probably be the first thing, especially if there are multiple queries that use that column.
As I see it, you can try different things (like storing all values on another table, then updating one using the other), but in the end, the engine is going to search using a single field, testing equality with a value.
That would require an index. If you can at least test in dev, maybe you can show the performance improvement to someone who can authorize the creation of the new index in the production environment.
That's my answer, I hope someone comes up with a better one!
I've had a SQL performance review done on a project we're working on, and one 'Critical' item that has come up is this:
This kind of wildcard query pattern will cause a table scan, resulting
in poor query performance.
SELECT *
FROM TabFoo
WHERE ColBar = #someparam OR #someparam IS NULL
Their recommendation is:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate If
clauses or (not recommended) use a dynamic SQL statement.
Dynamic SQL surely isn't the right way forward. Basically the procedure is one where I am search for something, OR something else. Two parameters come into the procedure, and I am filtering on one, or the other.
A better example than what they showed is:
SELECT ..
FROM...
WHERE (ColA = #ParA OR #ColA IS NULL)
(AND ColB = #ParB OR #ParB IS NULL)
Is that bad practice, and besides dynamic SQL (because, I thought dynamic sql can't really compile and be more efficient in it's execution plan?), how would this best be done?
A query like
select *
from foo
where foo.bar = #p OR #p is null
might or might not cause a table scan. My experience is that it will not: the optimizer perfectly able to do an index seek on the expression foo.bar = #p, assuming a suitable index exists. Further, it's perfectly able to short-circuit things if the variable is null. You won't know what your execution plan looks like until you try it and examine the bound execution plane. A better technique, however is this:
select *
from foo
where foo.bar = coalesce(#p,foo.bar)
which will give you the same behavior.
If you are using a stored procedure, one thing that can and will bite you in the tookus is something like this:
create dbo.spFoo
#p varchar(32)
as
select *
from dbo.foo
where foo.bar = #p or #p = null
return ##rowcount
The direct use of the stored procedure parameter in the where clause will cause the cached execution plan to be based on the value of #p on its first execution. That means that if the first execution of your stored procedure has an outlier value for #p, you may get a cached execution plan that performs really poorly for the 95% of "normal" executions and really well only for the oddball cases. To prevent this from occurring, you want to do this:
create dbo.spFoo
#p varchar(32)
as
declare #pMine varchar(32)
set #pMine = #p
select *
from dbo.foo
where foo.bar = #pMine or #pMine = null
return ##rowcount
That simple assignment of the parameter to a local variable makes it an expression and so the cached execution plan is not bound to the initial value of #p. Don't ask how I know this.
Further the recommendation you received:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate
If clauses or (not recommended) use a dynamic SQL statement.
is hogwash. Option(recompile) means that the stored procedure is recompiled on every execution. When the stored procedure is being compiled, compile-time locks on taken out on dependent object. Further, nobody else is going to be able to execute the stored procedure until the compilation is completed. This has, shall we say, negative impact on concurrency and performance. Use of option(recompile) should be a measure of last resort.
Write clean SQL and vet your execution plans using production data, or as close as you can get to it: the execution plan you get is affected by the size and shape/distribution of the data.
I could be wrong, but I'm pretty sure a table scan will occur no matter what if the column you have in your where clause isn't indexed. Also, you could probably get better performance by reordering your OR clauses so that if #ParA IS NULL is true, it evaluates first and would not require evaluating the value in the column. Something to remember is that the where clause is evaluated for every row that comes back from the from clause. I would not recommend dynamic SQL, and honestly, even under relatively heavy load I'd find it difficult to believe that this form of filter would cause a significant performance hit, since a table scan is required anytime the column isn't indexed.
We did a Microsoft engagement where they noted that we had a ton of this "Wildcard Pattern Usage", and their suggestion was to convert the query to an IF/ELSE structure...
IF (#SomeParam is null) BEGIN
SELECT *
FROM TabFoo
END
ELSE BEGIN
SELECT *
FROM TabFoo
WHERE ColBar = #someparam
END
They preferred this approach over recompile (adds to execution time) or dynamic code (can't plan ahead, so kind of the same thing, having to figure out the plan every time); and I seem to recall that it is still an issue even with local variables (plus, you need extra memory regardless).
You can see that things get a bit crazy if you write queries with multiple WPU issues, but at least for the smaller ones, MS recommends the IF/ELSE approach.
In all the examples I saw, NULL was involved, but I can't help but think if you had a parameter utilizing a default, whether on the parameter itself or set with an ISNULL(), and essentially the same pattern used, that might also be bad (well, as long as the default is something an "actual value" would never be, that is).
I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.
I have been asked the following question, what would you look into when you want to improve a stored procedure performance? The stored procedure is returning some value and have three joins in it.
Other than making sure the joins are well written what one can do to make it perform better? This was a general question and no code was provided.
Any ideas?
Check the indexes on the tables used in the joins. Particularly, are the columns used in the joins indexed?
Example -
SELECT *
FROM SomeTable a
JOIN SomeOtherTable b on a.ItemId = b.ItemId
If these tables are large, indexing ItemId in both tables will typically help performance a lot.
You should do the same thing for any columns that are used in the WHERE clause, if your query has one.
WHERE a.ProductId = #SomeVariableYouPassedToTheStoredProc
Indexing ProductId may help in this case.
Query performance is something you could go into a rabbit hole on, but this is a logical (and quick) place to start.
There are a lot of things you can do to optimize procedures, but it sounds like your SQL statement is pretty simple. Some things to watch out for:
Inline functions. These can cause SQL to do a row by row evaluation and slow things down
Data conversions on join statements. These can prevent indexes from being used.
Make sure columns being joined on/in the where clause are indexed (for large data sets)
You can check out this website for more performance tips, but I think I covered most of what you need for simple statements:
SQL Optimizations School
The fact that it's a stored procedure has little to nothing to do with it. Optimise the sql inside.
As to how, all the usual suspects, including written by the sort of eejit who thinks you can guess what's wrong.
Copy the sql from the proc into a suitable tool, prefix it with Explain to see what's going on.
I presume there are others options. For example:
1. each of those joins could use restrict conditions which looks like 'and permited_used_name = (select user_name from user_list where )'. The value could be derived once during procedure start (I mean the first string of procedure) to not overload the DB by many similar queries.
2. starting from Oracle11 you could declare a function as function with cached results (i.e. function is calculated once and isn't recalculated each time when it is invoked) defining a set of tables which changes invalidate cache.
At any case the question is mostly DB-specific.
Run the Query Analyser on the SQL statement