Does ISNULL or OR have better performance? - sql

I have the SQL query:
SELECT ISNULL(t.column1, t.column2) as [result]
FROM t
I need to filter out data by [result] column. What is the best approach regarding performance from the two listed below:
WHERE ISNULL(t.column1, t.column2) = #filterValue
or:
WHERE t.column1 = #filterValue OR t.column2 = #filterValue
UPDATE: Sorry, I have forgotten to mention that the column2 is always null if the column1 is filled.

Measure, don't guess! This is something you should be doing yourself, with production-like data. We don't know the make-up of your data and that makes a big difference.
Having said that, I wouldn't do it either way. I'd create another column, column3 to store column1 if non-NULL and column2 if column1 is NULL.
Then I'd have an insert/update trigger to populate that column correctly, index it and use the screaming-banshee-speed:
select t.column3 as [result] from t
The vast majority of databases are read more often than written and it's better if this calculation is done as few times as possible (i.e., when the data changes, not every time you select it). If you want your databases to be scalable, don't use per-row functions.
It's perfectly valid to sacrifice disk space for speed and the triggers ensure that the data doesn't become inconsistent.
If adding another column and triggers is out of the question, I'd go for the or solution since it can often be split into two parallel queries by the smarter DBMS engines.
An alternative, which MarkB gave but since deleted his answer so I'll have to go hunting for another good answer of his to upvote :-), is to use UNION ALL. If your DBMS isn't quite smart enough to recognise OR as a chance for parallelism, it may be smart enough to recognise UNION ALL in that context, something like:
select column1 as c from t where column1 is not NULL
union all
select column2 as c from t where column1 is NULL
But again, it depends on both your database and your data. A smart DBA would put the whole thing in a stored procedure so they could swap in a new method seamlessly should the data change its properties.

On an MSSQL-Table (MSSQL 2000) with 13.000.000 entries and indexes on Col1 and Col2 i get the following results:
select top 1000000 * from Table1 with(nolock) where isnull(Col1,Col2) > '0'
-- Compile-Time: 4ms
-- CPU-Time: 18265ms
-- Elapsed-Time: 24882ms = ~25s
select top 1000000 * from Table1 with(nolock) where Col1 > '0' or (Col1 is null and Col2 > '0')
-- Compile-Time: 9ms
-- CPU-Time: 7781ms
-- Elapsed-Time: 25734 = ~26s
The measured values are subject to strong fluctuations base on the workload of the server.
The first statment need lesser time to compile but takes more cpu-time for excecution (culstered index scan).
Its important to know that many storage-engines have an optimizer who reorganize the statment for better results und executiontimes. Ultimately, both statements will rebuild to mostly the same statement by the optimizer.

I think, your replacement expression does not mean the same. Assume filterValue is 2, then ISNULL(1,2)=2 is false, but 1=2 or 2=2 is true. The expression you need looks more like:
(c1=filter) or ((c1 is null) and (c2 = filter));
There is a chance that a server can answer this from the index on c1. First part of the soultion is an index scan over c1=filter. The second part is a scan over c1=null and then a linear search for c2=filter. I'd even say that a clustered index (c1,c2) could work here.
OTOH, you should rather measure before make assumptions like this, speculations doesn't work usually in SQL unless you have intimate knowledge on the implementation. For example, I'm pretty sure that the query planners already knows that ISNULL(X,Y) can be decomposed into a boolean statement with its implications for searching, but I would not rely on that but rather measure and then decide what to do.

I have measured the performance of both queries on SQL Sever 2008.
And have got the following results:
Both approaches had almost the same Estimated Subtree Cost metric.
But the OR approach had more accurate value of the Estimated Number of Rows metric.
So the query optimizer will build more appropriate execution plan for the OR than for ISNULL approach.

Related

Is there any performance benefit if we exclude null rows in where clause in Select query

Table Foo structure:
ID – PK
SampleCol – Can have null and is not indexed
SampleCol2, SampleCol3, etc
Table Foo has some 100,000+ rows with many SampleCol as NULL.
SQL query #1:
select *
from Foo
where SampleCol = 'Test'
SQL query #2:
select *
from Foo
where SampleCol is not null and SampleCol = 'Test'
Does query #2 have any performance benefit over query #1? Or any suggestions on how to improve performance of these SQL queries?
Thanks!
No, it will not help -- although it could make things slightly (probably unmeasurably) worse.
The condition SampleCol = 'Test' is exactly the comparison you want to make. So, the database has to make this comparison, in some fashion, for every row that is returned.
There are basically two situations. Without an index, your query needs to do a full table scan. Two comparisons on each row (one for NULL and one for the value) take longer than a single comparison. To be honest, some databases might optimize this just to the equality comparison, so the two could be equal. I don't think SQL Server does this elimination but it might.
With an index, SQL Server will use an index for the = comparison. It might then do an additional comparison against NULL (even though that is redundant). You run into a bigger issue here, though: The more complicated the predicate the more likely the optimizer gets confused and doesn't use an index.
There is a third case where your column is used for partitioning. I do not know if the redundant comparison would have an impact on partition pruning.
You want your where comparisons to be simple. In general, you want to let the optimizer do its work. On very rare occasions, you might want to give the optimizer some help, but that is very, very, very rare -- and generally involves functions that are much more expensive to run than simple comparisons.

SQL 'case when' vs 'where' efficiency

Which is more efficient:
Select SUM(case when col2=2 then col1 Else 0 End) From myTable
OR
Select SUM(Col1) From myTable where col2=2
Or are they the same speed?
Definitively the second one should be faster. This is because of the concept of "Access". Access refers to the amount of data that the query needs to retrieve in order to produce the result. It has a big impact on the "operator" the database engine optimizer decides to include in the execution plan.
Safe some exceptions, the first query needs to access all the table rows and then compute the result, including rows that don't have anything to do with the case.
The second query only refers to the specific rows needed to compute the result. Therefore, it has the potentiality of being faster. In order for it to be materialized, the presence of indexes is crucial. For example:
create index ix1 on myTable (col2);
In this case it will only access the subset of rows that match the filtering predicate col2 = 2.
The second is more efficient:
It would generally process fewer rows (assuming there are non-"2" values), because rows would be ignored before the aggregation function is called.
It allows the optimizer to take advantage of indexes.
It allows the optimizer to take advantage of table partitions.
Under some circumstances, they might appear to take the same amount of time, particularly on small tables.

Can I alias and reuse my subqueries?

I'm working with a data warehouse doing report generation. As the name would suggest, I have a LOT of data. One of the queries that pulls a LOT of data is getting to take longer than I like (these aren't performed ad-hoc, these queries run every night and rebuild tables to cache the reports).
I'm looking at optimizing it, but I'm a little limited on what I can do. I have one query that's written along the lines of...
SELECT column1, column2,... columnN, (subQuery1), (subquery2)... and so on.
The problem is, the sub queries are repeated a fair amount because each statement has a case around them such as...
SELECT
column1
, column2
, columnN
, (SELECT
CASE
WHEN (subQuery1) > 0 AND (subquery2) > 0
THEN CAST((subQuery1)/(subquery2) AS decimal)*100
ELSE 0
END) AS "longWastefulQueryResults"
Our data comes from multiple sources and there are occasional data entry errors, so this prevents potential errors when dividing by a zero. The problem is, the sub-queries can repeat multiple times even though the values won't change. I'm sure there's a better way to do it...
I'd love something like what you see below, but I get errors about needing sq1 and sq2 in my group by clause. I'd provide an exact sample, but it'd be painfully tedious to go over.
SELECT
column1
, column2
, columnN
, (subQuery1) as sq1
, (subquery2) as sq2
, (SELECT
CASE
WHEN (sq1) > 0 AND (sq2) > 0
THEN CAST((sq1)/(sq2) AS decimal)*100
ELSE 0
END) AS "lessWastefulQueryResults"
I'm using Postgres 9.3 but haven't been able to get a successful test yet. Is there anything I can do to optimize my query?
Yup, you can create a Temp Table to store your results and query them again in the same session
I'm not sure how good the Postgres optimizer is, so I'm not sure whether optimizing in this way will do any good. (In my opinion, it shouldn't because the DBMS should be taking care of this kind of thing; but it's not at all surprising if it isn't.) OTOH if your current form has you repeating query logic, then you can benefit from doing something different whether or not it helps performance...
You could put the subqueries in with clauses up front, and that might help.
with subauery1 as (select ...)
, subquery2 as (select ...)
select ...
This is similar to putting the subqueries in the FROM clause as Allen suggests, but may offer more flexibility if your queries are complex.
If you have the freedom to create a temp table as Andrew suggests, that too might work but could be a double-edged sword. At this point you're limiting the optimizer's options by insisting that the temp tables be populated first and then used in the way that makes sense to you, which may not always be the way that actually gets the most efficiency. (Again, this comes down to how good the optimizer is... it's often folly to try to outsmart a really good one.) On the other hand, if you do create temp or working tables, you might be able to apply useful indexes or stats (if they contain large datasets) that would further improve downstream steps' performance.
It looks like many of your subqueries might return single values. You could put the queries into a procedure and capture those individual values as variables. This is similar to the temp table approach, but doesn't require creation of objects (as you may not be able to do that) and will have less risk of confusing the optimizer by making it worry about a table where there's really just one value.
Sub-queries in the column list tend to be a questionable design. The first approach I'd take to solving this is to see if you can move them down to the from clause.
In addition to allowing you to use the result of those queries in multiple columns, doing this often helps the optimizer to come up with a better plan for your query. This is because the queries in the column list have to be executed for every row, rather than merged into the rest of the result set.
Since you only included a portion of the query in your question, I can't demonstrate this particularly well, but what you should be looking for would look more like:
SELECT column1,
column2,
columnn,
subquery1.sq1,
subquery2.sq2,
(SELECT CASE
WHEN (subquery1.sq1) > 0 AND (subquery2.sq2) > 0 THEN
CAST ( (subquery1.sq1) / (subquery2.sq2) AS DECIMAL) * 100
ELSE
0
END)
AS "lessWastefulQueryResults"
FROM some_table
JOIN (SELECT *
FROM other_table
GROUP BY some_columns) subquery1
ON some_table.some_columns = subquery1.some_columns
JOIN (SELECT *
FROM yet_another_table
GROUP BY more_columns) subquery1
ON some_table.more_columns = subquery1.more_columns

can oracle hints be used to defer a (pl/sql) condition till last?

I'm trying to optimise a select (cursor in pl/sql code actually) that includes a pl/sql function e.g.
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
and myplsqlfunction(t.val) = 'X'
The myplsqlfunction() is very expensive, but is only applicable to a manageably small subset of the other conditions.
The problem is that Oracle appears to evaluating myplsqlfunction() on more data than is ideal.
My evidence for this is if I recast the above as either
select * from (
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
) where myplsqlfunction(t.val) = 'X'
or pl/sql as:
begin
for t in ( select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns ) loop
if myplsqlfunction(t.val) = 'X' then
-- process the desired subset
end if;
end loop;
end;
performance is an order of magnitude better.
I am resigned to restructuring the offending code to use either of the 2 above idioms, but it would be delighted if there was any simpler way to get the Oracle optimizer to do this for me.
You could specify a bunch of hints to force a particular plan. But that would almost assuredly be more of a pain than restructuring the code.
I would expect that what you really want to do is to associate non-default statistics with the function. If you tell Oracle that the function is less selective than the optimizer is guessing or (more likely) if you provide high values for the CPU or I/O cost of the function, you'll cause the optimizer to try to call the function as few times as possible. The oracle-developer.net article walks through how to pick reasonably correct values for the cost (or going a step beyond that how to make those statistics change over time as the cost of the function call changes). You can probably fix your immediate problem by setting crazy-high costs but you probably want to go to the effort of setting accurate values so that you're giving the optimizer the most accurate information possible. Setting costs way too high or way too low tends to cause some set of queries to do something stupid.
You can use WITH clause to first evaluate all your join conditions and get a manageable subset of data. Then you can go for the pl/sql Function on the subset of data. But it all depends on the volume still you can try this. Let me know for any issues.
You can use CTE like:
WITH X as
( select /*+ MATERIALIZE */ * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
)
SELECT * FROM X
where myplsqlfunction(t.val) = 'X';
Note the Materiliaze hint. CTEs can be either inlined or materialized(into TEMP tablespace).
Another option would be to use NO_PUSH_PRED hint. This is generally better solution (avoids materializing of the subquery), but it requires some tweaking.
PS: you should not call another SQL from myplsqlfunction. This SQL might see data added after your query started and you might get surprising results.
You can also declare your function as RESULT_CACHE, to force the Oracle to remember return values from the function - if applicable i.e. the amount of possible function's parameter values is reasonably small.
Probably the best solution is to associate the stats, as Justin describes.

SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks
Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.