Behavior of SQL OR and AND operator - sql

We have following expression as T-Sql Query:
Exp1 OR Exp2
Is Exp2 evaluated when Exp1 is True? I think there is no need to evaluate it.
Similarly; for,
Exp1 AND Exp2
is Exp2 evaluated when Exp1 is false?

Unlike in some programming languages, you cannot count on short-circuiting in T-SQL WHERE clauses. It might happen, or it might not.

SQL Server doesn't necessarily evaluate expressions in left to right order. Evaluation order is controlled by the execution plan and the plan is chosen based on the overall estimated cost for the whole of a query. So there is no certainty that SQL will perform the kind of short circuit optimisation you are describing. That flexibility is what makes the opimiser useful. For example it could be that the second expression in each case can be evaluated more efficiently than the first (if it is indexed or subject to some constraint for example).
SQL also uses three-value logic, which means that some of the equivalence rules used in two-value logic don't apply (although that doesn't alter the specific example you describe).

SQL Server query operators OR and AND are commutative. There is no inherent order and the query optimizer is free to choose the path of least cost to begin evaluation. Once the plan is set, the other part is not evaluated if a result is pre-determined.
This knowledge allows queries like
select * from master..spt_values
where (type = 'P' or 1=#param1)
and (1=#param2 or number < 1000)
option (recompile)
Where the pattern of evaluation is guaranteed to short circuit when #param is set to 1. This pattern is typical of optional filters. Notice that it does not matter whether the #params are tested before or after the other part.
If you are very good with SQL and know for a fact that the query is best forced down a certain plan, you can game SQL Server using CASE statements, which are always evaluated in nested order. Example below will force type='P' to always be evaluated first.
select *
from master..spt_values
where
case when type='P' then
case when number < 100 then 1
end end = 1
If you don't believe order of evaluation of the last query, try this
select *
from master..spt_values
where
case when type='P' then
case when number < 0 then
case when 1/0=1 then 1
end end end = 1
Even though the constants in the expression 1/0=1 is the least cost to evaluate, it is NEVER evaluated - otherwise the query would have resulted in divide-by-zero instead of returning no rows (there are no rows in master..spt_values matching both conditions).

SQL Server sometimes performs boolean short circuiting, and sometimes does not.
It depends upon the query execution plan that is generated. The execution plan chosen depends on several factors, including the selectivity of the columns in the WHERE clause, table size, available indexes etc.

Related

PostgresSQL: Performance of using CASE inside WHERE

With PostgreSQL, I needed to define a query that would conditionally SELECT some data based on a parameter. In this statement, I have a condition that only needs to be evaluated only if another condition evaluates to TRUE.
To solve this, my first idea was to use a CASE inside a WHERE, which worked fine. For example, using a table test_table with columns id, name, and value:
SELECT
name
FROM test_table
WHERE
CASE
WHEN $1 > 10 THEN test_table.value > $1
ELSE TRUE
END
;
However, other peers suggested to use regular boolean logic instead, as it would perform faster. For example:
SELECT
name
FROM test_table
WHERE
(
$1 <= 10
OR test_table.value > $1
)
;
After testing with around 100k rows and using EXPLAIN ANALYSE, the speed difference seemed to average at about 1ms difference:
With CASE: ~19.5ms Execution Time
With Regular Booleans: ~18.5ms Execution Time
Question is: Is the second approach really faster? Or was my testing flawed, and thus the results incorrect?
It is not so much that not using the case is faster all the time. The issue is that case prevents the optimizer from choosing different execution paths. It basically says that the conditions have to be executed in order, because case expressions impose ordering.
With fewer options, the resulting query plan might be slower.
There are some cases where the execution order is an advantage. This is particularly true when you know that some condition is very expensive to evaluate, so you want to be sure that other conditions are evaluated first:
where (case when x = 1 then true
when <some expensive function> then true
end)
could be more performant than:
where (x = 1) or <some expensive function>
(although in this simple example I would expect the Postgres compiler to be smart enough to do the right thing in the second case).
The second reason for avoiding case in a where clause is aesthetic. There are already sufficient boolean operators to generate just about any condition you need -- so case is syntactic sugar that usually provides no additional nutrients

Can the order of individual operands affect whether a SQL expression is sargable?

A colleague of mine who is generally well-versed in SQL told me that the order of operands in a > or = expression could determine whether or not the expression was sargable. In particular, with a query whose case statement included:
CASE
when (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) > 1 THEN 2
ELSE 1
and was told to reverse the order of the operands to the equivalent
CASE
when 1 < (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) THEN 2
ELSE 1
for sargability concerns. I found no difference in query plans, though ultimately I made the change for the sake of sticking to team coding standards. Is what my co-worker said true in some cases? Does the order of operands in an expression have potential impact on its execution time? This doesn't mesh with how I understand sargability to work.
For Postgres, the answer is definitely: "No." (sql-server was added later.)
The query planner can flip around left and right operands of an operator as long as a COMMUTATOR is defined, which is the case for all instance of < and >. (Operators are actually defined by the operator itself and their accepted operands.) And the query planner will do so to make an expression "sargable". Related answer with detailed explanation:
Can PostgreSQL index array columns?
It's different for other operators without COMMUTATOR. Example for ~~ (LIKE):
LATERAL JOIN not using trigram index
If you're talking about the most popular modern databases like Microsoft SQL, Oracle, Postgres, MySql, Teradata, the answer is definitely NO.
What is a SARGable query?
A SARGable query is the one that strive to narrow the number of rows a database has to process in order to return you the expected result. What I mean, for example:
Consider this query:
select * from table where column1 <> 'some_value';
Obviously, using an index in this case is useless, because a database most certainly would have to look through all rows in a table to give you expected rows.
But what if we change the operator?
select * from table where column1 = 'some_value';
In this case an index can give good performance and return expected rows almost in a flash.
SARGable operators are: =, <, >, <= ,>=, LIKE (without %), BETWEEN
Non-SARGable operators are: <>, IN, OR
Now, back to your case.
Your problem is simple. You have X and you have Y. X > Y or Y < X - in both cases you have to determine the values of both variables, so this switching gives you nothing.
P.S. Of course, I concede, there could be databases with very poor optimizers where this kind of swithing could play role. But, as I said before, in modern databases you should not worry about it.

SQL - HAVING (execution vs structure)

I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!
The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.
the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.
There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)

Is there a standard for SQL aggregate function calculation?

Is there a standard on SQL implementaton for multiple calls to the same aggregate function in the same query?
For example, consider the following example, based on a popular example schema:
SELECT Customer,SUM(OrderPrice) FROM Orders
GROUP BY Customer
HAVING SUM(OrderPrice)>1000
Presumably, it takes computation time to calculate the value of SUM(OrderPrice). Is this cost incurred for each reference to the aggregate function, or is the result stored for a particular query?
Or, is there no standard for SQL engine implementation for this case?
Although I have worked with many different DBMS, I will only show you the result of proving this on SQL Server. Consider this query, which even includes a CAST in the expression. Looking at the query plan, the expression sum(cast(number as bigint)) is only taken once, which is defined as DEFINE:([Expr1005]=SUM([Expr1006])).
set showplan_text on
select type, sum(cast(number as bigint))
from master..spt_values
group by type
having sum(cast(number as bigint)) > 100000
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Filter(WHERE:([Expr1005]>(100000)))
|--Hash Match(Aggregate, HASH:([Expr1004]), RESIDUAL:([Expr1004] = [Expr1004]) DEFINE:([Expr1005]=SUM([Expr1006])))
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT(nchar(3),[mssqlsystemresource].[sys].[spt_values].[type],0), [Expr1006]=CONVERT(bigint,[mssqlsystemresource].[sys].[spt_values].[number],0)))
|--Index Scan(OBJECT:([mssqlsystemresource].[sys].[spt_values].[ix2_spt_values_nu_nc]))
It may not be very obvious above, since it doesn't show the SELECT result, so I have added a *10 to the query below. Notice that it now includes one extra step DEFINE:([Expr1006]=[Expr1005]*(10)) (steps run bottom to top) which demonstrates that the new expression required it to perform an extra calculation. Yet, even this is optimized, as it doesn't recalculate the entire expression - merely, it is taking Expr1005 and multiplying that by 10!
set showplan_text on
select type, sum(cast(number as bigint))*10
from master..spt_values
group by type
having sum(cast(number as bigint)) > 100000
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Compute Scalar(DEFINE:([Expr1006]=[Expr1005]*(10)))
|--Filter(WHERE:([Expr1005]>(100000)))
|--Hash Match(Aggregate, HASH:([Expr1004]), RESIDUAL:([Expr1004] = [Expr1004]) DEFINE:([Expr1005]=SUM([Expr1007])))
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT(nchar(3),[mssqlsystemresource].[sys].[spt_values].[type],0), [Expr1007]=CONVERT(bigint,[mssqlsystemresource].[sys].[spt_values].[number],0)))
|--Index Scan(OBJECT:([mssqlsystemresource].[sys].[spt_values].[ix2_spt_values_nu_nc]))
This is very likely how all the other DBMS work as well, at least considering the major ones i.e. PostgreSQL, Sybase, Oracle, DB2, Firebird, MySQL.

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.