IF/ELSE Performance in Stored Procedures

IF/ELSE Performance in Stored Procedures - sql

I found this article that explains using IF/ELSE statements in a SP can cause performance deterioration over using separate SPs for each 'branch'. http://sqlmag.com/t-sql/if-statements-and-stored-procedure-performance
But I have an SP which selects the same columns, from the same tables, and only the WHERE clause changes depending on what variables are present. Here is an example:
IF #Variable1 IS NOT NULL
BEGIN
SELECT
*
FROM
dbo.Table1
WHERE
Column1 = #Variable1
END
ELSE IF #Variable1 IS NULL AND #Variable2 IS NOT NULL
BEGIN
SELECT
*
FROM
dbo.Table1
WHERE
Column1 = Column1
AND
Column2 = #Variable2
END
So in this example, is it better to have 2 seperate SPs to handle the different variables or is it ok to have it all in one like this?
(I know using SELECT * is not good practice. I just did it for the sake of example)

Normally, I wouldn't worry about this, although you should look at the white paper referenced by Mikael Eriksson which has a copious amount of useful information on this subject. However, I would remove the Column1 = Column1 statement in the else branch, because that could potentially confuse the optimizer.
What the article is referring to is the fact that the stored procedure is compiled the first time it is run. This can have perverse results. For instance, if the table is empty when you first call it, then the optimizer might prefer a full table scan to an index lookup, and that would be bad as the table gets larger.
The issue may be that one of the branches gets a suboptimal performance plan, because the data is not typical on the first call. This is especially true if one of the values is NULL. This doesn't only occur with if, but that is one case where you need to be sensitive to the issue.
I would recommend the following:
If your tables are growing/shrinking over time, periodically recompile your stored procedures.
If your tables are representative of the data, don't worry about splitting into multiple stored procedures.
Your examples should do an index lookup, which is pretty simple. But monitor performance and check execution plans to be sure they are what you want.
You can use hints if you want to force index usage. (Personally, I have needed hints to force particular join algorithms, but not index usage, but I'm sure someone else has had different experiences.)
For your examples, an index on table1(column1) and table1(column2) should suffice.
The summary of the advice is not to fix this until you see there is a problem. Putting the logic into two stored procedures should be for fixing a problem that you actually see, rather than pre-empting problem that may never exist. If you do go with a two-procedure approach, you can still have a single interface that calls each of them, so you still have the same API. In other words, the one procedure should become three rather than two.

Related

Wildcard Pattern usage

I've had a SQL performance review done on a project we're working on, and one 'Critical' item that has come up is this:
This kind of wildcard query pattern will cause a table scan, resulting
in poor query performance.
SELECT *
FROM TabFoo
WHERE ColBar = #someparam OR #someparam IS NULL
Their recommendation is:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate If
clauses or (not recommended) use a dynamic SQL statement.
Dynamic SQL surely isn't the right way forward. Basically the procedure is one where I am search for something, OR something else. Two parameters come into the procedure, and I am filtering on one, or the other.
A better example than what they showed is:
SELECT ..
FROM...
WHERE (ColA = #ParA OR #ColA IS NULL)
(AND ColB = #ParB OR #ParB IS NULL)
Is that bad practice, and besides dynamic SQL (because, I thought dynamic sql can't really compile and be more efficient in it's execution plan?), how would this best be done?

A query like
select *
from foo
where foo.bar = #p OR #p is null
might or might not cause a table scan. My experience is that it will not: the optimizer perfectly able to do an index seek on the expression foo.bar = #p, assuming a suitable index exists. Further, it's perfectly able to short-circuit things if the variable is null. You won't know what your execution plan looks like until you try it and examine the bound execution plane. A better technique, however is this:
select *
from foo
where foo.bar = coalesce(#p,foo.bar)
which will give you the same behavior.
If you are using a stored procedure, one thing that can and will bite you in the tookus is something like this:
create dbo.spFoo
#p varchar(32)
as
select *
from dbo.foo
where foo.bar = #p or #p = null
return ##rowcount
The direct use of the stored procedure parameter in the where clause will cause the cached execution plan to be based on the value of #p on its first execution. That means that if the first execution of your stored procedure has an outlier value for #p, you may get a cached execution plan that performs really poorly for the 95% of "normal" executions and really well only for the oddball cases. To prevent this from occurring, you want to do this:
create dbo.spFoo
#p varchar(32)
as
declare #pMine varchar(32)
set #pMine = #p
select *
from dbo.foo
where foo.bar = #pMine or #pMine = null
return ##rowcount
That simple assignment of the parameter to a local variable makes it an expression and so the cached execution plan is not bound to the initial value of #p. Don't ask how I know this.
Further the recommendation you received:
In many cases, an OPTION (RECOMPILE) hint can be a quick workaround.
From a design point of view, you can also consider using separate
If clauses or (not recommended) use a dynamic SQL statement.
is hogwash. Option(recompile) means that the stored procedure is recompiled on every execution. When the stored procedure is being compiled, compile-time locks on taken out on dependent object. Further, nobody else is going to be able to execute the stored procedure until the compilation is completed. This has, shall we say, negative impact on concurrency and performance. Use of option(recompile) should be a measure of last resort.
Write clean SQL and vet your execution plans using production data, or as close as you can get to it: the execution plan you get is affected by the size and shape/distribution of the data.

I could be wrong, but I'm pretty sure a table scan will occur no matter what if the column you have in your where clause isn't indexed. Also, you could probably get better performance by reordering your OR clauses so that if #ParA IS NULL is true, it evaluates first and would not require evaluating the value in the column. Something to remember is that the where clause is evaluated for every row that comes back from the from clause. I would not recommend dynamic SQL, and honestly, even under relatively heavy load I'd find it difficult to believe that this form of filter would cause a significant performance hit, since a table scan is required anytime the column isn't indexed.

We did a Microsoft engagement where they noted that we had a ton of this "Wildcard Pattern Usage", and their suggestion was to convert the query to an IF/ELSE structure...
IF (#SomeParam is null) BEGIN
SELECT *
FROM TabFoo
END
ELSE BEGIN
SELECT *
FROM TabFoo
WHERE ColBar = #someparam
END
They preferred this approach over recompile (adds to execution time) or dynamic code (can't plan ahead, so kind of the same thing, having to figure out the plan every time); and I seem to recall that it is still an issue even with local variables (plus, you need extra memory regardless).
You can see that things get a bit crazy if you write queries with multiple WPU issues, but at least for the smaller ones, MS recommends the IF/ELSE approach.
In all the examples I saw, NULL was involved, but I can't help but think if you had a parameter utilizing a default, whether on the parameter itself or set with an ISNULL(), and essentially the same pattern used, that might also be bad (well, as long as the default is something an "actual value" would never be, that is).

SQL performance for a returning stored procedure

I have been asked the following question, what would you look into when you want to improve a stored procedure performance? The stored procedure is returning some value and have three joins in it.
Other than making sure the joins are well written what one can do to make it perform better? This was a general question and no code was provided.
Any ideas?

Check the indexes on the tables used in the joins. Particularly, are the columns used in the joins indexed?
Example -
SELECT *
FROM SomeTable a
JOIN SomeOtherTable b on a.ItemId = b.ItemId
If these tables are large, indexing ItemId in both tables will typically help performance a lot.
You should do the same thing for any columns that are used in the WHERE clause, if your query has one.
WHERE a.ProductId = #SomeVariableYouPassedToTheStoredProc
Indexing ProductId may help in this case.
Query performance is something you could go into a rabbit hole on, but this is a logical (and quick) place to start.

There are a lot of things you can do to optimize procedures, but it sounds like your SQL statement is pretty simple. Some things to watch out for:
Inline functions. These can cause SQL to do a row by row evaluation and slow things down
Data conversions on join statements. These can prevent indexes from being used.
Make sure columns being joined on/in the where clause are indexed (for large data sets)
You can check out this website for more performance tips, but I think I covered most of what you need for simple statements:
SQL Optimizations School

The fact that it's a stored procedure has little to nothing to do with it. Optimise the sql inside.
As to how, all the usual suspects, including written by the sort of eejit who thinks you can guess what's wrong.
Copy the sql from the proc into a suitable tool, prefix it with Explain to see what's going on.

I presume there are others options. For example:
1. each of those joins could use restrict conditions which looks like 'and permited_used_name = (select user_name from user_list where )'. The value could be derived once during procedure start (I mean the first string of procedure) to not overload the DB by many similar queries.
2. starting from Oracle11 you could declare a function as function with cached results (i.e. function is calculated once and isn't recalculated each time when it is invoked) defining a set of tables which changes invalidate cache.
At any case the question is mostly DB-specific.

Run the Query Analyser on the SQL statement

is "where (ParamID = #ParamID) OR (#ParamID = -1)" a good practice in sql selection

i used to write sql statments like
select * from teacher where (TeacherID = #TeacherID) OR (#TeacherID = -1)
read more
and pass #TeacherID value = -1 to select all teachers
now i'm worry about the performance
can you tell me is that a good practice or bad one?
many thanks

If TeacherID is indexed and you are passing a value other than -1 as TeacherID to search for details of a specific teacher then this query will end up doing a full table scan rather than the potentially far more efficient option of seeking into the index to retrieve the details of the specific teacher...
... Unless you are on SQL 2008 SP1 CU5 and later and use the OPTION (RECOMPILE) hint. See Dynamic Search Conditions in T-SQL for the definitive article on the topic.

We use this in a very limited fashion in stored procedures.
The problem is that the database engine isn't able to keep a good query plan for it. When dealing with a lot of data this can have a serious negative performance impact.
However, for smaller data sets (I'd say less than 1000 records, but that's a guess) it should be fine. You'll have to test in your particular environment.
If it's in a stored procedure, you might want to include something like a WITH RECOMPILE option so that the plan is regenerated on each execution. This adds (slightly) to the time for each run, but over several runs can actually reduce the average execution time. Also, this allows the database to inspect the actual query and "short circuit" the parts that aren't necessary on each call.
If you are directly creating your SQL and passing it through, then I'd suggest you make the part that builds your sql a little smarter so that it only includes the part of the where clause you actually need.
Another path you might consider is using UNION ALL queries as opposed to optional parameters. For example:
SELECT * FROM Teacher WHERE (TeacherId = #TeacherID)
UNION ALL
SELECT * FROM Teacher WHERE (#TeacherId = -1)
This actually accomplishes the exact same thing; however, the query plan is cacheable. We've used this method in a few places as well and saw performance improvements over using WITH RECOMPILE. We don't do this everywhere because some of our queries are extremely complicated and I'd rather have a performance hit than to complicate them further.
Ultimately though, you need to do a lot of testing.
There is a second part here that you should reconsider. SELECT *. It is ALWAYS preferable to actually name the columns you want returned and to make sure that you are only returning the ones you will actually need. Moving data across network boundaries is very expensive and you can generally get a fair amount of performance boost simply by specifying exactly what you want. In addition if what you need is very limited you can sometimes do covering indexes so that the database engine doesn't even have to touch the underlying tables to get the data you want.

If you're really worried about performance, you could break up your procedure to call on two different procs: one for all records, and one based on the parameter.
If #TeacherID = -1
exec proc_Get_All_Teachers
else
exec proc_Get_Teacher_By_TeacherID #TeacherID
Each one can be optimized individually.
It's your system, compare the performance. Consider optimizing on the most popular choice. If most users are going to select a single record, why hider their preformance just to accomodate the few that selct all teachers (And should have a reasonable expectation of performance.).
I know a single select query is easier to maintain, but at some point ease of maintenance eventually gives way to performance.

Cost of logic in a query

I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?

Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?

For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.

You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.

SQL Query theory question - single-statement vs multi-statement queries

When I write SQL queries, I find myself often thinking that "there's no way to do this with a single query". When that happens I often turn to stored procedures or multi-statement table-valued functions that use temp tables (of one sort or another) and end up simply combining the results and returning the result table.
I'm wondering if anyone knows, simply as a matter of theory, whether it should be possible to write ANY query that returns a single result set as a single query (not multiple statements). Obviously, I'm ignoring relevant points such as code readability and maintainability, maybe even query performance/efficiency. This is more about theory - can it be done... and don't worry, I certainly don't plan to start forcing myself to write a single-statement query when multi-statement will better suit my purpose in all cases, but it might make me think twice or a little bit longer on whether there is a viable way to get the result from a single query.
I guess a few parameters are in order - I'm thinking of a relational database (such as MS SQL) with tables that follow common best practices (such as all tables having a primary key and so forth).
Note: in order to win 'Accepted Answer' on this, you'll need to provide a definitive proof (reference to web material or something similar.)

I believe it is possible. I've worked with very difficult queries, very long queries, and often, it is possible to do it with a single query. But most of the time, it's harder to mantain, so if you do it with a single query, make sure you comment your query carefully.
I've never encountered something that could not be done in a single query.
But sometimes it's best to do it in more than one query.

At least with the a recent version of Oracle is absolutely possible. It has a 'model clause' which makes sql turing complete. ( http://blog.schauderhaft.de/2009/06/18/building-a-turing-engine-in-oracle-sql-using-the-model-clause/ ). Of course this is all with the usual limitation that we don't really have unlimited time and memory.
For a normal sql dialect without these abdominations I don't think it is possible.
A task that I can't see how to implement in 'normal sql' would be:
Assume a table with a single column of type integer
For every row
'take the value at the current row and go that many rows back, fetch that value, go that many rows back, and continue until you fetch the same value twice consecutively and return that as the result.'

I can't prove it, but I believe the answer is a cautious yes - provided your database design is done properly. Usually being forced to write multiple statements to get a certain result is a sign that your schema may need some improvements.

I'd say "yes" but can't prove it. However, my main thought process:
Any select should be a set based operation
Your assumption is that you are dealing with mathematically correct sets (ie normalised correctly)
Set theory should guarantee it's possible
Other thoughts:
Multiple SELECT statement often load temp tables/table variables. These can be derived or separated in CTEs.
Any RBAR processing (for good or bad) now be dealt with CROSS/OUTER APPLY onto derived tables
UDFs would be classed as "cheating" in this context I feel, because it allows you to put a SELECT into another module rather than in your single one
No writes allowed in your "before" sequence of DML: this changes state from SELECT to SELECT
Have you seen some of the code in our shop?
Edit, glossary
RBAR = Row By Agonising Row
CTE = Common Table Expression
UDF = User Defined Function
Edit: APPLY: cheating?
SELECT
*
FROM
MyTable1 t1
CROSS APPLY
(
SELECT * FROM MyTable2 t2
WHERE t1.something = t2.something
) t2

In theory yes, if you use functions or a torturous maze of OUTER APPLYs or sub-queries; however, for readability and performance, we have always ended up going with temp tables and multi-statement stored procedures.
As someone above commented, this is usually a sign that your data structure is starting to smell; not that it's bad, but that maybe it's time to denormalise for performance reasons (happens to the best of us), or maybe put a denormalised querying layer in front of your normalised "real" data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas