Netezza does not do lazy evaluation of case statements? - sql

I'm performing a computation which might contain division by 0, in which case I want the result to be an arbitrary value (55). To my surprise, wrapping the computation with a case statement did not do the job!
select case when 1=0 then 3/0 else 55 end
ERROR HY000: Divide by 0
Why is that? Is there another workaround?

ok, I was being inaccurate. This is the exact query that fails with "divide by 0":
select case when min(baba) = 0 then 55 else sum(1/baba) end from t group by baba
This looks like a lazy evaluation failure out of Netezza, as notice that I group by baba, so whenever baba is 0, it also means that min(baba) is 0, and the evaluation should have been gracefully stopped without ever getting to the 1/baba term and failing on division by 0. Right? well, no.
What I guess is the gotcha here and the reason for the failure is that Netezza evaluates the rows terms before it can evaluate the aggregate terms. So it must evaluate 1/baba and baba at every row, and only then can it evaluate the aggregate terms min(baba) and sum(1/baba)
so, the workaround (for me) was: select case when min(baba) = 0 then 55 else 1/min(baba) end from t group by baba, which has the same meaning.

Related

Hive case when clause efficiency

Is there difference between:
select
case some_calculation()
when 'a' then 1
when 'b' then 2
else 0
end
and
select
case
when some_calculation() ='a' then 1
when some_calculation() ='b' then 2
else 0
end
I assume in the second version, the some_calculation() function will be evaluated twice.
I don't know how to verify that. Any input will be appreciated.
This is not well-documented in Hive. But it is well understood in the standard and in other databases.
In the second version, some_calculation() will definitely be evaluated multiple times. A case expression is evaluated sequentially, so the second when is not evaluated until the first evaluates to false.
In the first version, the value should be evaluated once.
You can get a flavor for the difference by using a volatile functions, such as random(). This db<>fiddle illustrates the difference between the two approaches. The first version never returns NULL. The second returns lots of them.

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

Behavior of SQL OR and AND operator

We have following expression as T-Sql Query:
Exp1 OR Exp2
Is Exp2 evaluated when Exp1 is True? I think there is no need to evaluate it.
Similarly; for,
Exp1 AND Exp2
is Exp2 evaluated when Exp1 is false?
Unlike in some programming languages, you cannot count on short-circuiting in T-SQL WHERE clauses. It might happen, or it might not.
SQL Server doesn't necessarily evaluate expressions in left to right order. Evaluation order is controlled by the execution plan and the plan is chosen based on the overall estimated cost for the whole of a query. So there is no certainty that SQL will perform the kind of short circuit optimisation you are describing. That flexibility is what makes the opimiser useful. For example it could be that the second expression in each case can be evaluated more efficiently than the first (if it is indexed or subject to some constraint for example).
SQL also uses three-value logic, which means that some of the equivalence rules used in two-value logic don't apply (although that doesn't alter the specific example you describe).
SQL Server query operators OR and AND are commutative. There is no inherent order and the query optimizer is free to choose the path of least cost to begin evaluation. Once the plan is set, the other part is not evaluated if a result is pre-determined.
This knowledge allows queries like
select * from master..spt_values
where (type = 'P' or 1=#param1)
and (1=#param2 or number < 1000)
option (recompile)
Where the pattern of evaluation is guaranteed to short circuit when #param is set to 1. This pattern is typical of optional filters. Notice that it does not matter whether the #params are tested before or after the other part.
If you are very good with SQL and know for a fact that the query is best forced down a certain plan, you can game SQL Server using CASE statements, which are always evaluated in nested order. Example below will force type='P' to always be evaluated first.
select *
from master..spt_values
where
case when type='P' then
case when number < 100 then 1
end end = 1
If you don't believe order of evaluation of the last query, try this
select *
from master..spt_values
where
case when type='P' then
case when number < 0 then
case when 1/0=1 then 1
end end end = 1
Even though the constants in the expression 1/0=1 is the least cost to evaluate, it is NEVER evaluated - otherwise the query would have resulted in divide-by-zero instead of returning no rows (there are no rows in master..spt_values matching both conditions).
SQL Server sometimes performs boolean short circuiting, and sometimes does not.
It depends upon the query execution plan that is generated. The execution plan chosen depends on several factors, including the selectivity of the columns in the WHERE clause, table size, available indexes etc.

SQL GROUP BY CASE statement to aviod erroring out

I am tryin to write the following case statement in my SELECT list:
CASE
WHEN SUM(up.[Number_Of_Stops]) = 0 THEN 0
ELSE SUM(up.[Total_Trip_Time] / up.[Number_Of_Stops])
END
I keep getting divde by zero errors though. This is the whole point of thise case statement to avoid this. Any other ideas?
You're checking for a different case than the one that's causing the error. Note:
WHEN SUM(up.[Number_Of_Stops]) = 0
Will only be true when all records in the grouping have Number_Of_Stops = 0. When that isn't the case, but some records do have Number_Of_Stops = 0, you'll divide by zero.
Instead, try this:
SUM(CASE
WHEN up.[Number_Of_Stops] = 0 THEN 0
ELSE up.[Total_Trip_Time] / up.[Number_Of_Stops]
END)
The zero check is based on SUM, an aggregate function, which means it is not executing per row -- which is when the division is occurring.
You're going to have to review the GROUP BY clause, or run the division (and zero check) in a subquery before applying SUM to result. IE:
SELECT SUM(x.result)
FROM (SELECT CASE
WHEN up.[Number_Of_Stops]) > 0 THEN
up.[Total_Trip_Time] / up.[Number_Of_Stops]
ELSE
0
END AS result
FROM TABLE) x

Difference between not equal and greater than for a number

I have a query:
select *
from randomtable
where randomnumber <> 0
The column "random_number" will never be a negative number.
So, if I write the query as:
select *
from randomtable
where randomnumber > 0
Is there any major difference?
No, there is no difference at all (in your specific situation). All numeric comparisons take the same time.
What's done at the lowest level is that zero is subtracted from randomnumber, and then the result is examined. The > operator looks for a positive non-zero result while the <> operator looks for a non-zero result. Those comparisons are trivial, and take the same amount of time to perform.
If it will never be less than zero, then NO
The important thing is to determine how you know that random_number will never be a negative number. Is there a constraint that guarantees it? If not, what do you want your code to do if a bug somewhere else causes it to be negative?
The result set should never be different. The query path might be as the second might choose an index range scan starting at randomnumber=0 and looking for the records in sequence. As such the order of the results may differ.
If the order of the results is important, then put in an ORDER BY
What does 'greater than' mean? It means 'not equal' AND 'not less than'. In this case, if the number cannot be less than zero, 'greater than' is equivalent to 'not equal'.