WHERE-CASE clause Subquery Performance - sql

The question can be specific to SQL server.
When I write a query such as :
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (fnQuarterDate('20131231')='20131231') THEN 1
WHEN (fnQuarterDate('20131231')!='20131231') THEN 4
END;
Does the Function Call fnQuarterDate (or any Subquery) within Case inside a Where clause is executed for EACH row of the table ?
How would it be better if I get the function's (or any subquery) value beforehand inside a variable like:
DECLARE #X INT
IF fnQuarterDate('20131231')='20131231'
SET #X=1
ELSE
SET #X=0
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
WHEN (#X = 0) THEN 4
END;
I know that in MySQL if there is a subquery inside IN(..) within a WHERE clause, it is executed for each row, I just wanted to find out the same for SQL SERVER.
...
Just populated table with about 30K rows and found out the Time Difference:
Query1= 70ms ; Query 2= 6ms. I think that explains it but still don't know the actual facts behind it.
Also would there be any difference if instead of a UDF there was a simple subquery ?

I think the solution may in theory help you increase the performance, but it also depends on what the scalar function actually does. I think that in this case (my guess is formatting the date to last day in the quarter) would really be negligible.
You may want to read this page with suggested workarounds:
http://connect.microsoft.com/SQLServer/feedback/details/273443/the-scalar-expression-function-would-speed-performance-while-keeping-the-benefits-of-functions#
Because SQL Server must execute each function on every row, using any function incurs a cursor like performance penalty.
And in Workarounds, there is a comment that
I had the same problem when I used scalar UDF in join column, the
performance was horrible. After I replaced the UDF with temp table
that contains the results of UDF and used it in join clause, the
performance was order of magnitudes better. MS team should fix UDF's
to be more reliable.
So it appears that yes, this may increase the performance.
Your solution is correct, but I would recommend considering an improvement of the SQL to use ELSE instead, it looks cleaner to me:
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
ELSE 4
END;

It depends. See User-Defined Functions:
The number of times that a function specified in a query is actually executed can vary between execution plans built by the optimizer. An example is a function invoked by a subquery in a WHERE clause. The number of times the subquery and its function is executed can vary with different access paths chosen by the optimizer.

This approach uses in-line MySQL variables... The query alias of "sqlvars" will prepare the #dateBasis first with the date in question, then a second variable #qtrReportType based on the function call done ONCE for the entire query. Then, by cross-join (via no where clause between the tables since the sqlvars is considered a single row anyhow), will use those values to get data from your IndustryData table.
select
ID.*
from
( select
#dateBasis := '20131231',
#qtrReportType := case when fnQuarterDate(#dateBasis) = #dateBasis
then 1 else 4 end ) sqlvars,
IndustryData ID
where
ID.Date = #dateBasis
AND ID.ReportTypeID = #qtrReportType

Related

SQL - Why does adding CASE to an ORDER BY clause drastically cut performance? And how can I avoid this?

I'm looking to improve the performance of one of our stored procedures and have come across something that I'm struggling to find information on. I'm by no means a DBA, so my knowledge of SQL is not brilliant.
Here's a simplified version of the problem:
If I use the following query -
SELECT * FROM Product
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
I get the results in around 20ms
However if I apply a conditional ordering -
DECLARE #so int = 1
SELECT * FROM Product
ORDER BY
CASE WHEN #so = 1 THEN Name END,
CASE WHEN #so = 2 THEN Name END DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
The overall request in my mind is the same, but the results take 600ms, 30x longer.
The execution plans are drastically different, but being a novice I've no idea how to bring the execution path for the second case into line with the the first case.
Is this even possible, or should I look at creating separate procedures for the order by cases and move choosing the order logic to the code?
NB. This is using MS SQL Server
The reason is because SQL Server can no longer use an index. One solution is dynamic SQL. Another is a simple IF:
IF (#so = 1)
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
END;
ELSE
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY;
END:
Gordon Linoff is right that this prevents an index from being used, but to expand a bit on that:
When SQL Server prepares execution of a query, it generates an execution plan. This is a query being compiled to steps that the database engine can execute. It's at this point, generally, that it looks at which indices are available for use, but at this point, parameter values are not yet known, so the query optimiser cannot see whether an index on name is useful.
The workarounds in his answer are valid, but I'd like to offer one more:
Add OPTION (RECOMPILE) to your query. This forces your query execution plan to be recompiled each time, and each time, the parameter values are known, and allows the optimiser to optimise for those specific parameter values. It will generally be a bit less efficient than fully dynamic SQL, since dynamic SQL allows each possible statement's execution plan to be cached, but it will likely be better than what you have now, and more maintainable than the other options.

can oracle hints be used to defer a (pl/sql) condition till last?

I'm trying to optimise a select (cursor in pl/sql code actually) that includes a pl/sql function e.g.
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
and myplsqlfunction(t.val) = 'X'
The myplsqlfunction() is very expensive, but is only applicable to a manageably small subset of the other conditions.
The problem is that Oracle appears to evaluating myplsqlfunction() on more data than is ideal.
My evidence for this is if I recast the above as either
select * from (
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
) where myplsqlfunction(t.val) = 'X'
or pl/sql as:
begin
for t in ( select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns ) loop
if myplsqlfunction(t.val) = 'X' then
-- process the desired subset
end if;
end loop;
end;
performance is an order of magnitude better.
I am resigned to restructuring the offending code to use either of the 2 above idioms, but it would be delighted if there was any simpler way to get the Oracle optimizer to do this for me.
You could specify a bunch of hints to force a particular plan. But that would almost assuredly be more of a pain than restructuring the code.
I would expect that what you really want to do is to associate non-default statistics with the function. If you tell Oracle that the function is less selective than the optimizer is guessing or (more likely) if you provide high values for the CPU or I/O cost of the function, you'll cause the optimizer to try to call the function as few times as possible. The oracle-developer.net article walks through how to pick reasonably correct values for the cost (or going a step beyond that how to make those statistics change over time as the cost of the function call changes). You can probably fix your immediate problem by setting crazy-high costs but you probably want to go to the effort of setting accurate values so that you're giving the optimizer the most accurate information possible. Setting costs way too high or way too low tends to cause some set of queries to do something stupid.
You can use WITH clause to first evaluate all your join conditions and get a manageable subset of data. Then you can go for the pl/sql Function on the subset of data. But it all depends on the volume still you can try this. Let me know for any issues.
You can use CTE like:
WITH X as
( select /*+ MATERIALIZE */ * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
)
SELECT * FROM X
where myplsqlfunction(t.val) = 'X';
Note the Materiliaze hint. CTEs can be either inlined or materialized(into TEMP tablespace).
Another option would be to use NO_PUSH_PRED hint. This is generally better solution (avoids materializing of the subquery), but it requires some tweaking.
PS: you should not call another SQL from myplsqlfunction. This SQL might see data added after your query started and you might get surprising results.
You can also declare your function as RESULT_CACHE, to force the Oracle to remember return values from the function - if applicable i.e. the amount of possible function's parameter values is reasonably small.
Probably the best solution is to associate the stats, as Justin describes.

What is the most efficient way to process rows in a table?

I am teaching myself basic and intermediate SQL concepts for a project I am working on.
I have a lot of data that needs to undergo processing so it can be presented in different ways. Right now I am using scalar functions calls in my select statement to process the data.
A simple example, lets say I have an attribute in my table called fun as data type int. I want to process my table so that all rows with fun < 10 are 'foo' and all rows with fun > 10 are 'faa'.
So I write an SQL function something like
CREATE FUNCTION dbo.fooORfaa
(
#fun AS int
)
RETURNS VARCHAR(3)
AS
BEGIN
IF (#fun < 10)
RETURN 'foo'
RETURN 'faa'
END
Then I use my function in something like this select statement
select dbo.fooORfaa([mytable].[fun]) AS 'blah'
from mytable
This example is trivial, but in my real code I need perform some fairly involved logic against one or more columns, and I am selecting sub results from procedures and joining tables together and other things you need to do in a data base.
I have to process lots of records in a reasonable time span. Is this method an efficient way to tackle this problem? Is there another technique I should be using instead?
For this use case, you need a CASE construct.
SELECT
CASE
WHEN T.fun < 10 THEN 'foo'
ELSE 'faa'
END foo_faa
FROM
myTable T
Always try to use set-based operations. User-defined functions will (mostly) kill your performance, and should be a last resort.
See: CASE (Transact-SQL)

Does Postgresql plpgsql/sql support short circuiting in the where clause?

If I have the following toy query
SELECT *
FROM my_tables
WHERE my_id in (
SELECT my_other_id
FROM my_other_tables
) AND some_slow_func(arg) BETWEEN 1 AND 2;
Would the first condition in the WHERE clause short circuit the second condition which would have a complex run time?
I'm working on some sql that is actually part of a FOR LOOP in plpgsql, and I could do iterations over all records that exist in the my_other_tables, and then test within the scope of the FOR LOOP with the some_slow_func(). But I'm curious if sql supports, or plpgsql supports short circuiting.
Some Research:
I looked in the Postgres mailing lists and found this saying SQL in general doesn't support short circuiting:
http://www.postgresql.org/message-id/171423D4-9229-4D56-B06B-58D29BB50A77#yahoo.com
But one of the responses says that order can be enforced through subselects. I'm not exactly sure what he's speaking about. I know what a subselect is, but I'm not sure how order would be enforced? Could some one clarify this for me?
As documented, the evaluation order in a WHERE clause is supposed to be unpredictable.
It's different with subqueries. With PostgreSQL older than version 12, the simplest and common technique to drive the evaluation order is to write a subquery in a CTE. To make sure that the IN(...) is evaluated first, your code could be written as:
WITH subquery AS
(select * from my_tables
WHERE my_id in (SELECT my_other_id FROM my_other_tables)
)
SELECT * FROM subquery
WHERE some_slow_func(arg) BETWEEN 1 AND 2;
Starting with PostgreSQL version 12, WITH subqueries may be inlined by the optimizer (see the doc page on WITH queries for all the details), and the non-inlining is only guaranteed when adding the MATERIALIZED clause:
WITH subquery AS MATERIALIZED
(select * ... the rest is similar as above)
Something else that you may tweak is the cost of your function to signal to the optimizer that it's slow. The default cost for a function is 100, and it can be altered with a statement like:
ALTER FUNCTION funcname(argument types) cost N;
where N is the estimated per-call cost, expressed in an arbitrary unit that should be compared to the Planner Cost Constants.
I know this is an old question, but recently ran into similar issue, and found using a CASE predicate in the WHERE clause worked better for me. In the context of the answer above:
SELECT *
FROM my_tables
WHERE CASE WHEN my_id in (SELECT my_other_id
FROM my_other_tables)
AND some_slow_func(arg) BETWEEN 1 AND 2
THEN 1
ELSE 0
END = 1;
This makes for SQL that is slightly more DB agnostic. Of course, it may not make use of indexes if you have some on my_id, but depending on the context you're in, this could be a good option.
According to the Postgresql docs and this answer by Tom Lane, the order of execution of WHERE constraints is not reliable.
I think your best bet here may be to add that other part of your WHERE clause into the top of your function and "fail fast"; ie, run my_id in (
SELECT my_other_id
FROM my_other_tables) in your function, and if it doesn't pass, return right there before doing you intensive processing. That should get you about the same effect.

"IF..ElseIf..Else" or "Where clause" to guide Stored Procedure results

I have the following two SQL statements
First one:
IF(#User_Id IS NULL)
BEGIN
SELECT *
FROM [UserTable]
END
ELSE
BEGIN
SELECT *
FROM [UserTable] AS u
WHERE u.[Id] = #User_Id
END
Second one:
SELECT *
FROM [UserTable] AS u
WHERE (#User_Id IS NULL OR u.[Id] = #User_Id)
Both of those queries would be wrapped in its own stored procedure. I am suspecting that the IF statement is causing a lot of recompilations on SQL. I am faced with either separating each part of the IF statement into its own stored procedure, OR replacing the entire IF statement with a WHERE clause (illustrated above in the second SQL statement)
My question is: What is the difference between the two statements from a performance perspective, and how would SQL treat each statement?
Thanks.
Both solution will generate identical number of compilations.
The first solution the query optimizer is free to come up with the best plan for each of the two, different, queries. The first query (on the NULL branch of the IF) is not much that can be optimized, but the second one (on the NOT NULL branch of the ID) can be optimized if an index on Id column exists.
But the second solution is an optimization disaster. No matter the value of the #User_Id parameter, the optimizer has to come up with a plan that works for any value of the parameter. As such, no matter the value of #User_Id, the plan will always use the suboptimal table scan. There is just no way around this issue, and this is not parameter sniffing as some might think. Is just correctness of the plan, even if the value at plan generation time is NOT NULL, the plan has to work even when the parameter is NULL, so it cannot use the index on Id.
Always, always, always, use the first form with the explicit IF.