Why in MS-SQL-SERVER documentation BETWEEN has lower precedence than AND? - sql

Documentation shows that the AND operator has higher precedence than ALL, ANY, BETWEEN, IN, LIKE, OR, and SOME.
How can the AND operator, that is used to combine conditions,have higher precedence than operators such as IN, BETWEEN, and LIKE that are used within conditions? It seems impossible to combine conditions before the conditions have been evaluated individually.
What would that mean?

Thanks to Jeroen Mostert, Martin Smith and allmhuran for the clarification.
The crucial point to understand is that : the "evaluation" in the documentation is really with respect to parsing, not actual execution.
Suppose having the following table
Test
[col1, col2, col3]
1 2 'a'
0 4 'e'
0 2 'f'
The following query is then executed against the above table:
SELECT *
FROM Test
WHERE col3 LIKE 'a%' AND col1*col1*col1<>0 AND col2/col1>0;
Now the Operator Precedence doesn't determine the evaluation order of conditions, but, as Martin Smith said in his comment, it is just about where the implied parentheses would go.
In this case the two AND have the same precedence and the query is parsed in one of the two following ways:
(col3 LIKE 'a%' AND col1*col1*col1<>0) AND col2/col1>0 (1)
or
col3 LIKE 'a%' AND (col1*col1*col1<>0 AND col2/col1>0) (2)
(in my case the first one (1)) and then the optimizer (always in my case!!) chooses to execute the division col2/col1>0 first and a division by zero error is thrown.
Keep in mind the declarative nature of SQL: you say what you want not how it's done: this is left to the dbms engine which will performs optimizations, create the optimal query plan and execute it returning the desired result.

Related

PostgresSQL: Performance of using CASE inside WHERE

With PostgreSQL, I needed to define a query that would conditionally SELECT some data based on a parameter. In this statement, I have a condition that only needs to be evaluated only if another condition evaluates to TRUE.
To solve this, my first idea was to use a CASE inside a WHERE, which worked fine. For example, using a table test_table with columns id, name, and value:
SELECT
name
FROM test_table
WHERE
CASE
WHEN $1 > 10 THEN test_table.value > $1
ELSE TRUE
END
;
However, other peers suggested to use regular boolean logic instead, as it would perform faster. For example:
SELECT
name
FROM test_table
WHERE
(
$1 <= 10
OR test_table.value > $1
)
;
After testing with around 100k rows and using EXPLAIN ANALYSE, the speed difference seemed to average at about 1ms difference:
With CASE: ~19.5ms Execution Time
With Regular Booleans: ~18.5ms Execution Time
Question is: Is the second approach really faster? Or was my testing flawed, and thus the results incorrect?
It is not so much that not using the case is faster all the time. The issue is that case prevents the optimizer from choosing different execution paths. It basically says that the conditions have to be executed in order, because case expressions impose ordering.
With fewer options, the resulting query plan might be slower.
There are some cases where the execution order is an advantage. This is particularly true when you know that some condition is very expensive to evaluate, so you want to be sure that other conditions are evaluated first:
where (case when x = 1 then true
when <some expensive function> then true
end)
could be more performant than:
where (x = 1) or <some expensive function>
(although in this simple example I would expect the Postgres compiler to be smart enough to do the right thing in the second case).
The second reason for avoiding case in a where clause is aesthetic. There are already sufficient boolean operators to generate just about any condition you need -- so case is syntactic sugar that usually provides no additional nutrients

Can the order of individual operands affect whether a SQL expression is sargable?

A colleague of mine who is generally well-versed in SQL told me that the order of operands in a > or = expression could determine whether or not the expression was sargable. In particular, with a query whose case statement included:
CASE
when (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) > 1 THEN 2
ELSE 1
and was told to reverse the order of the operands to the equivalent
CASE
when 1 < (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) THEN 2
ELSE 1
for sargability concerns. I found no difference in query plans, though ultimately I made the change for the sake of sticking to team coding standards. Is what my co-worker said true in some cases? Does the order of operands in an expression have potential impact on its execution time? This doesn't mesh with how I understand sargability to work.
For Postgres, the answer is definitely: "No." (sql-server was added later.)
The query planner can flip around left and right operands of an operator as long as a COMMUTATOR is defined, which is the case for all instance of < and >. (Operators are actually defined by the operator itself and their accepted operands.) And the query planner will do so to make an expression "sargable". Related answer with detailed explanation:
Can PostgreSQL index array columns?
It's different for other operators without COMMUTATOR. Example for ~~ (LIKE):
LATERAL JOIN not using trigram index
If you're talking about the most popular modern databases like Microsoft SQL, Oracle, Postgres, MySql, Teradata, the answer is definitely NO.
What is a SARGable query?
A SARGable query is the one that strive to narrow the number of rows a database has to process in order to return you the expected result. What I mean, for example:
Consider this query:
select * from table where column1 <> 'some_value';
Obviously, using an index in this case is useless, because a database most certainly would have to look through all rows in a table to give you expected rows.
But what if we change the operator?
select * from table where column1 = 'some_value';
In this case an index can give good performance and return expected rows almost in a flash.
SARGable operators are: =, <, >, <= ,>=, LIKE (without %), BETWEEN
Non-SARGable operators are: <>, IN, OR
Now, back to your case.
Your problem is simple. You have X and you have Y. X > Y or Y < X - in both cases you have to determine the values of both variables, so this switching gives you nothing.
P.S. Of course, I concede, there could be databases with very poor optimizers where this kind of swithing could play role. But, as I said before, in modern databases you should not worry about it.

Order of filtering operations with AND & OR in Hive

Let's say I have a Hive query that looks something like this:
SELECT COUNT(*) FROM my_table WHERE
col1 LIKE "%str1%" -- matches 1% of rows
OR col1 LIKE "%str2%" -- matches 1% of rows
OR col1 LIKE "%str3%" -- matches 0 rows
OR col1 LIKE "%str4%" -- matches 90% of rows
(...more...);
If some of these strings that I'm matching are far more common than the others, I'm wondering what (if any) performance gains I would get from moving col 1 LIKE "%str4%" to the top of this list.
The above example is somewhat simple, but if each of these OR operations is a regular expression match on a long string, I would imagine the time to perform 3 matches (str1, str2, str3) that fail almost all of the time would become quite expensive.
Does Hive loop through these operations sequentially and break when it determines a true match? I suppose it would be worth asking if the equivalent Pig operation works this way as well.
For Pig the following from Programming Pig should clarify:
Pig will short-circuit Boolean operations when possible. If the first
(left) predicate of an and is false, the second (right) will not be
evaluated. So in 1 == 2 and udf(x), the UDF will never be invoked.
Similarly, if the first predicate of an or is true, the second
predicate will not be evaluted. 1 == 1 or udf(x) will never invoke the
UDF.
So if each of your logical operators performs some heavy weight operation then reordering them so they will short-circuit on the first condition for 90% of the records should buy you some performance gains. Note that YMMV for "some performance gains" as it will vary on the total number of logical operations (the example only gives 4, there could be more), the complexity of the regex's being short circuited, and the size/characteristics of the data being matched.

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

Behavior of SQL OR and AND operator

We have following expression as T-Sql Query:
Exp1 OR Exp2
Is Exp2 evaluated when Exp1 is True? I think there is no need to evaluate it.
Similarly; for,
Exp1 AND Exp2
is Exp2 evaluated when Exp1 is false?
Unlike in some programming languages, you cannot count on short-circuiting in T-SQL WHERE clauses. It might happen, or it might not.
SQL Server doesn't necessarily evaluate expressions in left to right order. Evaluation order is controlled by the execution plan and the plan is chosen based on the overall estimated cost for the whole of a query. So there is no certainty that SQL will perform the kind of short circuit optimisation you are describing. That flexibility is what makes the opimiser useful. For example it could be that the second expression in each case can be evaluated more efficiently than the first (if it is indexed or subject to some constraint for example).
SQL also uses three-value logic, which means that some of the equivalence rules used in two-value logic don't apply (although that doesn't alter the specific example you describe).
SQL Server query operators OR and AND are commutative. There is no inherent order and the query optimizer is free to choose the path of least cost to begin evaluation. Once the plan is set, the other part is not evaluated if a result is pre-determined.
This knowledge allows queries like
select * from master..spt_values
where (type = 'P' or 1=#param1)
and (1=#param2 or number < 1000)
option (recompile)
Where the pattern of evaluation is guaranteed to short circuit when #param is set to 1. This pattern is typical of optional filters. Notice that it does not matter whether the #params are tested before or after the other part.
If you are very good with SQL and know for a fact that the query is best forced down a certain plan, you can game SQL Server using CASE statements, which are always evaluated in nested order. Example below will force type='P' to always be evaluated first.
select *
from master..spt_values
where
case when type='P' then
case when number < 100 then 1
end end = 1
If you don't believe order of evaluation of the last query, try this
select *
from master..spt_values
where
case when type='P' then
case when number < 0 then
case when 1/0=1 then 1
end end end = 1
Even though the constants in the expression 1/0=1 is the least cost to evaluate, it is NEVER evaluated - otherwise the query would have resulted in divide-by-zero instead of returning no rows (there are no rows in master..spt_values matching both conditions).
SQL Server sometimes performs boolean short circuiting, and sometimes does not.
It depends upon the query execution plan that is generated. The execution plan chosen depends on several factors, including the selectivity of the columns in the WHERE clause, table size, available indexes etc.