SQL and logical operators and null checks - sql

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.

I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.

Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...

I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)

Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!

Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .

Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

Related

Why in MS-SQL-SERVER documentation BETWEEN has lower precedence than AND?

Documentation shows that the AND operator has higher precedence than ALL, ANY, BETWEEN, IN, LIKE, OR, and SOME.
How can the AND operator, that is used to combine conditions,have higher precedence than operators such as IN, BETWEEN, and LIKE that are used within conditions? It seems impossible to combine conditions before the conditions have been evaluated individually.
What would that mean?
Thanks to Jeroen Mostert, Martin Smith and allmhuran for the clarification.
The crucial point to understand is that : the "evaluation" in the documentation is really with respect to parsing, not actual execution.
Suppose having the following table
Test
[col1, col2, col3]
1 2 'a'
0 4 'e'
0 2 'f'
The following query is then executed against the above table:
SELECT *
FROM Test
WHERE col3 LIKE 'a%' AND col1*col1*col1<>0 AND col2/col1>0;
Now the Operator Precedence doesn't determine the evaluation order of conditions, but, as Martin Smith said in his comment, it is just about where the implied parentheses would go.
In this case the two AND have the same precedence and the query is parsed in one of the two following ways:
(col3 LIKE 'a%' AND col1*col1*col1<>0) AND col2/col1>0 (1)
or
col3 LIKE 'a%' AND (col1*col1*col1<>0 AND col2/col1>0) (2)
(in my case the first one (1)) and then the optimizer (always in my case!!) chooses to execute the division col2/col1>0 first and a division by zero error is thrown.
Keep in mind the declarative nature of SQL: you say what you want not how it's done: this is left to the dbms engine which will performs optimizations, create the optimal query plan and execute it returning the desired result.

PostgresSQL: Performance of using CASE inside WHERE

With PostgreSQL, I needed to define a query that would conditionally SELECT some data based on a parameter. In this statement, I have a condition that only needs to be evaluated only if another condition evaluates to TRUE.
To solve this, my first idea was to use a CASE inside a WHERE, which worked fine. For example, using a table test_table with columns id, name, and value:
SELECT
name
FROM test_table
WHERE
CASE
WHEN $1 > 10 THEN test_table.value > $1
ELSE TRUE
END
;
However, other peers suggested to use regular boolean logic instead, as it would perform faster. For example:
SELECT
name
FROM test_table
WHERE
(
$1 <= 10
OR test_table.value > $1
)
;
After testing with around 100k rows and using EXPLAIN ANALYSE, the speed difference seemed to average at about 1ms difference:
With CASE: ~19.5ms Execution Time
With Regular Booleans: ~18.5ms Execution Time
Question is: Is the second approach really faster? Or was my testing flawed, and thus the results incorrect?
It is not so much that not using the case is faster all the time. The issue is that case prevents the optimizer from choosing different execution paths. It basically says that the conditions have to be executed in order, because case expressions impose ordering.
With fewer options, the resulting query plan might be slower.
There are some cases where the execution order is an advantage. This is particularly true when you know that some condition is very expensive to evaluate, so you want to be sure that other conditions are evaluated first:
where (case when x = 1 then true
when <some expensive function> then true
end)
could be more performant than:
where (x = 1) or <some expensive function>
(although in this simple example I would expect the Postgres compiler to be smart enough to do the right thing in the second case).
The second reason for avoiding case in a where clause is aesthetic. There are already sufficient boolean operators to generate just about any condition you need -- so case is syntactic sugar that usually provides no additional nutrients

Can the order of individual operands affect whether a SQL expression is sargable?

A colleague of mine who is generally well-versed in SQL told me that the order of operands in a > or = expression could determine whether or not the expression was sargable. In particular, with a query whose case statement included:
CASE
when (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) > 1 THEN 2
ELSE 1
and was told to reverse the order of the operands to the equivalent
CASE
when 1 < (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) THEN 2
ELSE 1
for sargability concerns. I found no difference in query plans, though ultimately I made the change for the sake of sticking to team coding standards. Is what my co-worker said true in some cases? Does the order of operands in an expression have potential impact on its execution time? This doesn't mesh with how I understand sargability to work.
For Postgres, the answer is definitely: "No." (sql-server was added later.)
The query planner can flip around left and right operands of an operator as long as a COMMUTATOR is defined, which is the case for all instance of < and >. (Operators are actually defined by the operator itself and their accepted operands.) And the query planner will do so to make an expression "sargable". Related answer with detailed explanation:
Can PostgreSQL index array columns?
It's different for other operators without COMMUTATOR. Example for ~~ (LIKE):
LATERAL JOIN not using trigram index
If you're talking about the most popular modern databases like Microsoft SQL, Oracle, Postgres, MySql, Teradata, the answer is definitely NO.
What is a SARGable query?
A SARGable query is the one that strive to narrow the number of rows a database has to process in order to return you the expected result. What I mean, for example:
Consider this query:
select * from table where column1 <> 'some_value';
Obviously, using an index in this case is useless, because a database most certainly would have to look through all rows in a table to give you expected rows.
But what if we change the operator?
select * from table where column1 = 'some_value';
In this case an index can give good performance and return expected rows almost in a flash.
SARGable operators are: =, <, >, <= ,>=, LIKE (without %), BETWEEN
Non-SARGable operators are: <>, IN, OR
Now, back to your case.
Your problem is simple. You have X and you have Y. X > Y or Y < X - in both cases you have to determine the values of both variables, so this switching gives you nothing.
P.S. Of course, I concede, there could be databases with very poor optimizers where this kind of swithing could play role. But, as I said before, in modern databases you should not worry about it.

Behavior of SQL OR and AND operator

We have following expression as T-Sql Query:
Exp1 OR Exp2
Is Exp2 evaluated when Exp1 is True? I think there is no need to evaluate it.
Similarly; for,
Exp1 AND Exp2
is Exp2 evaluated when Exp1 is false?
Unlike in some programming languages, you cannot count on short-circuiting in T-SQL WHERE clauses. It might happen, or it might not.
SQL Server doesn't necessarily evaluate expressions in left to right order. Evaluation order is controlled by the execution plan and the plan is chosen based on the overall estimated cost for the whole of a query. So there is no certainty that SQL will perform the kind of short circuit optimisation you are describing. That flexibility is what makes the opimiser useful. For example it could be that the second expression in each case can be evaluated more efficiently than the first (if it is indexed or subject to some constraint for example).
SQL also uses three-value logic, which means that some of the equivalence rules used in two-value logic don't apply (although that doesn't alter the specific example you describe).
SQL Server query operators OR and AND are commutative. There is no inherent order and the query optimizer is free to choose the path of least cost to begin evaluation. Once the plan is set, the other part is not evaluated if a result is pre-determined.
This knowledge allows queries like
select * from master..spt_values
where (type = 'P' or 1=#param1)
and (1=#param2 or number < 1000)
option (recompile)
Where the pattern of evaluation is guaranteed to short circuit when #param is set to 1. This pattern is typical of optional filters. Notice that it does not matter whether the #params are tested before or after the other part.
If you are very good with SQL and know for a fact that the query is best forced down a certain plan, you can game SQL Server using CASE statements, which are always evaluated in nested order. Example below will force type='P' to always be evaluated first.
select *
from master..spt_values
where
case when type='P' then
case when number < 100 then 1
end end = 1
If you don't believe order of evaluation of the last query, try this
select *
from master..spt_values
where
case when type='P' then
case when number < 0 then
case when 1/0=1 then 1
end end end = 1
Even though the constants in the expression 1/0=1 is the least cost to evaluate, it is NEVER evaluated - otherwise the query would have resulted in divide-by-zero instead of returning no rows (there are no rows in master..spt_values matching both conditions).
SQL Server sometimes performs boolean short circuiting, and sometimes does not.
It depends upon the query execution plan that is generated. The execution plan chosen depends on several factors, including the selectivity of the columns in the WHERE clause, table size, available indexes etc.

Logical operator AND having higher order of precedence than IN

I’ve read that logical operator AND has higher order of precedence than logical operator IN, but that doesn’t make sense since if that was true, then wouldn’t in the following statement the AND condition got evaluated before the IN condition ( thus before IN operator would be able to check whether Released field equals to any of the values specified within parentheses ?
SELECT Song, Released, Rating
FROM Songs
WHERE
Released IN (1967, 1977, 1987)
AND
SongName = ’WTTJ’
thanx
EDIT:
Egrunin and ig0774, I’ve checked it and unless I totally misunderstood your posts, it seems that
WHERE x > 0 AND x < 10 OR special_case = 1
is indeed the the same as
WHERE (x > 0 AND x < 10) OR special_case = 1
Namely, I did the the following three queries
SELECT *
FROM Songs
WHERE AvailableOnCD='N' AND Released > 2000 OR Released = 1989
SELECT *
FROM Songs
WHERE (AvailableOnCD='N' AND Released > 2000) OR Released = 1989
SELECT *
FROM Songs
WHERE AvailableOnCD='N' AND (Released > 2000 OR Released = 1989)
and as it turns out the following two queries produce the same result:
SELECT *
FROM Songs
WHERE AvailableOnCD='N' AND Released > 2000 OR Released = 1989
SELECT *
FROM Songs
WHERE (AvailableOnCD='N' AND Released > 2000) OR Released = 1989
while
SELECT *
FROM Songs
WHERE AvailableOnCD='N' AND (Released > 2000 OR Released = 1989)
gives a different result
I'm going to assume you're using SQL Server, as in SQL Server AND has a higher order of precedence than IN. So, yes, the AND is evaluated first, but the rule for evaluating AND, is to check the expression on the left (in your sample, the IN part) and, if that is true, the expression on the right. In short, the AND clause is evaluated first, but the IN clause is evaluated as part of the AND evaluation.
It may be simpler to understand the order of precedence here as referring to how the statement is parsed, rather than how it is executed (even if MS's documentation equivocates on this).
Edit in response to comment from the OP:
I'm not all together certain that IN being classified as a logical operator is not specific to SQL Server. I've never read the ISO standard, but I would note that the MySQL and Oracle docs define IN as a comparison operator, Postgres as a subquery expression, and Sybase itself as a "list operator". In my view, Sybase is the nearest to the mark here since the expression a IN (...) asks whether the value of attribute a is an element of the list of items between the parentheses.
That said, I might imagine the reason that SQL Server chose to classify IN as a logical operator is two-fold:
IN and the like do not have the type restrictions of the SQL Server comparison operators (=, !=, etc. cannot apply to text, ntext or image types; IN and other subset operators can be used against any type, except, in strict ISO SQL, NULL)
The result of an IN, etc. operation is a boolean value just like the other "logical operators"
Again, to my mind, this is not a sensible classification, but it is what Microsoft chose. Maybe someone else has further insight into why they may have so decided?
Call me a n00b, but I always use parentheses in nontrivial compound conditions.
SELECT Song, Released, Rating
FROM Songs
WHERE
(Released IN (1967, 1977, 1987))
AND
SongName = ’WTTJ’
Edited (Corrected, the point remains the same.)
Just yesterday I got caught by this. Started with working code:
WHERE x < 0 or x > 10
Changed it in haste:
WHERE x < 0 or x > 10 AND special_case = 1
Broke, because this is what I wanted:
WHERE (x < 0 or x > 10) AND special_case = 1
But this is what I got:
WHERE x < 0 or (x > 10 AND special_case = 1)
In Mysql at least, it has a lower precedence. See http://dev.mysql.com/doc/refman/5.0/en/operator-precedence.html
I think of IN is a comparison operator whereas AND is a logical operator. So it's a bit apples and oranges, since the comparison operator must be evaluated first to see if the condition is true, then the logical operator is used to evaluate the conditions.