Hive case when clause efficiency - sql

Is there difference between:
select
case some_calculation()
when 'a' then 1
when 'b' then 2
else 0
end
and
select
case
when some_calculation() ='a' then 1
when some_calculation() ='b' then 2
else 0
end
I assume in the second version, the some_calculation() function will be evaluated twice.
I don't know how to verify that. Any input will be appreciated.

This is not well-documented in Hive. But it is well understood in the standard and in other databases.
In the second version, some_calculation() will definitely be evaluated multiple times. A case expression is evaluated sequentially, so the second when is not evaluated until the first evaluates to false.
In the first version, the value should be evaluated once.
You can get a flavor for the difference by using a volatile functions, such as random(). This db<>fiddle illustrates the difference between the two approaches. The first version never returns NULL. The second returns lots of them.

Related

Is this statement quicker than the previous?

I am running through some old code if I changed the logic of this CASE statement:
CASE WHEN ClaimNo.ClaimNo IS NULL THEN '0'
WHEN ClaimNo.ClaimNo = 1 THEN '1'
WHEN ClaimNo.ClaimNo = 2 THEN '2'
WHEN ClaimNo.ClaimNo = 3 THEN '3'
WHEN ClaimNo.ClaimNo = 4 THEN '4'
ELSE '5+'
END AS ClaimNo ,
If I changed it to:
CASE WHEN ClaimNo.ClaimNo >= 5 THEN '5+'
ELSE COALESCE(ClaimNo.ClaimNo,0) END 'ClaimNo' ,
Would the statement technically be quicker? Its obviously a lot shorter as a statement and appears that it wouldn't run as many statements to obtain the same result.
These are not the same! The case expression returns one type and in this case you want the type to be a string (because '5+' is a string). However, mixing strings and integers in the wheres will result in a type conversion error.
Which is faster depends on the distribution of the data. If most of the data consists of 5 or more, then the second method would be faster . . . and work if written as:
(CASE WHEN ClaimNo.ClaimNo >= 5 THEN '5+'
ELSE CAST(COALESCE(ClaimNo.ClaimNo, 0) as VARCHAR(255))
END) as ClaimNo,
In fact, there is only one comparison, so from the perspective of doing the comparisons it will be faster.
The next question is whether the conversion from a number to a string is faster than the multiple comparisons with each value listed separately. Let me be honest: I do not know. And I have been concerned about query performance for a long time.
Why don't I know? Such micro-optimizations generally have basically no impact in the real world. You should use the version of the logic that works; readability and maintainability are also important. Of course performance is an issue, but the bit fiddling techniques that are important in other languages often have no place in SQL which is designed to handle much larger quantities of data, spread across multiple processors and disks.

SQL Case inside WHEN

May be this is silly but can we write a case inside another case's WHEN?
Below code working for me but I am not sure if this is correct.
SELECT
(SUM(CASE
WHEN (
CASE
WHEN r.status < b.status
THEN r.status
ELSE b.status
END
) = '4'
THEN 1
ELSE 0
END)
) AS WORKED
FROM
tbl1 r, tbl2 b
All the examples on nested cases are like CASE inside a THEN so I am not sure if this a good practice. Is there a better way to get the same results?
Yes you can. MSDN also informs us that in SQL SERVER, you can only have a maximum of 10 CASE expressions embedded into each other. Oddly enough, a search for ORACLE up negative about this potential limitation. Probably important to note.
Of course, you can also just use more WHEN (up to 255 in ORACLE) statements, too, but that only works if you do not need to nest your logic (such as comparing two different columns values)
Sources:
https://msdn.microsoft.com/en-us/library/ms181765.aspx
http://www.techonthenet.com/oracle/functions/case.php

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

Selecting COUNT from different criteria on a table

I have a table named 'jobs'. For a particular user a job can be active, archived, overdue, pending, or closed. Right now every page request is generating 5 COUNT queries and in an attempt at optimization I'm trying to reduce this to a single query. This is what I have so far but it is barely faster than the 5 individual queries. Note that I've simplified the conditions for each subquery to make it easier to understand, the full query acts the same however.
Is there a way to get these 5 counts in the same query without using the inefficient subqueries?
SELECT
(SELECT count(*)
FROM "jobs"
WHERE
jobs.creator_id = 5 AND
jobs.status_id NOT IN (8,3,11) /* 8,3,11 being 'inactive' related statuses */
) AS active_count,
(SELECT count(*)
FROM "jobs"
WHERE
jobs.creator_id = 5 AND
jobs.due_date < '2011-06-14' AND
jobs.status_id NOT IN(8,11,5,3) /* Grabs the overdue active jobs
('5' means completed successfully) */
) AS overdue_count,
(SELECT count(*)
FROM "jobs"
WHERE
jobs.creator_id = 5 AND
jobs.due_date BETWEEN '2011-06-14' AND '2011-06-15 06:00:00.000000'
) AS due_today_count
This goes on for 2 more subqueries but I think you get the idea.
Is there an easier way to collect this data since it's basically 5 different COUNT's off of the same subset of data from the jobs table?
The subset of data is 'creator_id = 5', after that each count is basically just 1-2 additional conditions. Note that right now we're using Postgres but may be moving to MySQL in the near future. So if you can provide an ANSI-compatible solution I'd be gratetful :)
This is the typical solution. Use a case statement to break out the different conditions. If a record meets it gets a 1 else a 0. Then do a SUM on the values
SELECT
SUM(active_count) active_count,
SUM(overdue_count) overdue_count
SUM(due_today_count) due_today_count
FROM
(
SELECT
CASE WHEN jobs.status_id NOT IN (8,3,11) THEN 1 ELSE 0 END active_count,
CASE WHEN jobs.due_date < '2011-06-14' AND jobs.status_id NOT IN(8,11,5,3) THEN 1 ELSE 0 END overdue_count,
CASE WHEN jobs.due_date BETWEEN '2011-06-14' AND '2011-06-15 06:00:00.000000' THEN 1 ELSE 0 END due_today_count
FROM "jobs"
WHERE
jobs.creator_id = 5 ) t
UPDATE
As noted when 0 records are returned as t this result in as single result of Nulls in all the values. You have three options
1) Add A Having clause so that you have No records returned rather than result of all NULLS
HAVING SUM(active_count) is not null
2) If you want all zeros returned than you could add coalesce to all your sums
For example
SELECT
COALESCE(SUM(active_count)) active_count,
COALESCE(SUM(overdue_count)) overdue_count
COALESCE(SUM(due_today_count)) due_today_count
3) Take advantage of the fact that COUNT(NULL) = 0 as sbarro's demonstrated. You should note that the not-null value could be anything it doesn't have to be a 1
for example
SELECT
COUNT(CASE WHEN
jobs.status_id NOT IN (8,3,11) THEN 'Manticores Rock' ELSE NULL
END) as [active_count]
I would use this approach, use COUNT in combination with CASE WHEN.
SELECT
COUNT(CASE WHEN
jobs.status_id NOT IN (8,3,11) THEN 1
END) as [Count1],
COUNT(CASE WHEN
jobs.due_date < '2011-06-14'
AND jobs.status_id NOT IN(8,11,5,3) THEN 1
END) as [COUNT2],
COUNT(CASE WHEN
jobs.due_date BETWEEN '2011-06-14' AND '2011-06-15 06:00:00.000000'
END) as [COUNT3]
FROM
"jobs"
WHERE
jobs.creator_id = 5
Brief
SQL Server 2012 introduced the IIF logical function. Using SQL Server 2012 or greater you can now use this new function instead of a CASE expression. The IIF function also works with Azure SQL Database (but at the moment it does not work with Azure SQL Data Warehouse or Parallel Data Warehouse). It's shorthand for the CASE expression.
I find myself using the IIF function rather than the CASE expression when there is only one case. This alleviates the pain of having to write CASE WHEN condition THEN x ELSE y END and instead writing it as IIF(condition, x, y). If multiple conditions may be met (multiple WHENs), you should instead consider using the regular CASE expression rather than nested IIF functions.
Returns one of two values, depending on whether the Boolean expression
evaluates to true or false in SQL Server.
Syntax
IIF ( boolean_expression, true_value, false_value )
Arguments
boolean_expression A valid Boolean expression.
If this argument is not a Boolean expression, then a syntax error is
raised.
true_value Value to return if boolean_expression evaluates to
true.
false_value Value to return if boolean_expression evaluates
to false.
Remarks
IIF is a shorthand way for writing a CASE expression. It evaluates
the Boolean expression passed as the first argument, and then returns
either of the other two arguments based on the result of the
evaluation. That is, the true_value is returned if the Boolean
expression is true, and the false_value is returned if the Boolean
expression is false or unknown. true_value and false_value can be
of any type. The same rules that apply to the CASE expression for
Boolean expressions, null handling, and return types also apply to
IIF. For more information, see CASE (Transact-SQL).
The fact that IIF is translated into CASE also has an impact on
other aspects of the behavior of this function. Since CASE
expressions can be nested only up to the level of 10, IIF statements
can also be nested only up to the maximum level of 10. Also, IIF is
remoted to other servers as a semantically equivalent CASE
expression, with all the behaviors of a remoted CASE expression.
Code
Implementation of the IIF function in SQL would resemble the following (using the same logic presented by #rsbarro in his answer):
SELECT
COUNT(
IIF(jobs.status_id NOT IN (8,3,11), 1, 0)
) as active_count,
COUNT(
IIF(jobs.due_date < '2011-06-14' AND jobs.status_id NOT IN(8,11,5,3), 1, 0)
) as overdue_count,
COUNT(
IIF(jobs.due_date BETWEEN '2011-06-14' AND '2011-06-15 06:00:00.000000', 1, 0)
) as due_today_count
FROM
"jobs"
WHERE
jobs.creator_id = 5

Netezza does not do lazy evaluation of case statements?

I'm performing a computation which might contain division by 0, in which case I want the result to be an arbitrary value (55). To my surprise, wrapping the computation with a case statement did not do the job!
select case when 1=0 then 3/0 else 55 end
ERROR HY000: Divide by 0
Why is that? Is there another workaround?
ok, I was being inaccurate. This is the exact query that fails with "divide by 0":
select case when min(baba) = 0 then 55 else sum(1/baba) end from t group by baba
This looks like a lazy evaluation failure out of Netezza, as notice that I group by baba, so whenever baba is 0, it also means that min(baba) is 0, and the evaluation should have been gracefully stopped without ever getting to the 1/baba term and failing on division by 0. Right? well, no.
What I guess is the gotcha here and the reason for the failure is that Netezza evaluates the rows terms before it can evaluate the aggregate terms. So it must evaluate 1/baba and baba at every row, and only then can it evaluate the aggregate terms min(baba) and sum(1/baba)
so, the workaround (for me) was: select case when min(baba) = 0 then 55 else 1/min(baba) end from t group by baba, which has the same meaning.