Is this statement quicker than the previous? - sql

I am running through some old code if I changed the logic of this CASE statement:
CASE WHEN ClaimNo.ClaimNo IS NULL THEN '0'
WHEN ClaimNo.ClaimNo = 1 THEN '1'
WHEN ClaimNo.ClaimNo = 2 THEN '2'
WHEN ClaimNo.ClaimNo = 3 THEN '3'
WHEN ClaimNo.ClaimNo = 4 THEN '4'
ELSE '5+'
END AS ClaimNo ,
If I changed it to:
CASE WHEN ClaimNo.ClaimNo >= 5 THEN '5+'
ELSE COALESCE(ClaimNo.ClaimNo,0) END 'ClaimNo' ,
Would the statement technically be quicker? Its obviously a lot shorter as a statement and appears that it wouldn't run as many statements to obtain the same result.

These are not the same! The case expression returns one type and in this case you want the type to be a string (because '5+' is a string). However, mixing strings and integers in the wheres will result in a type conversion error.
Which is faster depends on the distribution of the data. If most of the data consists of 5 or more, then the second method would be faster . . . and work if written as:
(CASE WHEN ClaimNo.ClaimNo >= 5 THEN '5+'
ELSE CAST(COALESCE(ClaimNo.ClaimNo, 0) as VARCHAR(255))
END) as ClaimNo,
In fact, there is only one comparison, so from the perspective of doing the comparisons it will be faster.
The next question is whether the conversion from a number to a string is faster than the multiple comparisons with each value listed separately. Let me be honest: I do not know. And I have been concerned about query performance for a long time.
Why don't I know? Such micro-optimizations generally have basically no impact in the real world. You should use the version of the logic that works; readability and maintainability are also important. Of course performance is an issue, but the bit fiddling techniques that are important in other languages often have no place in SQL which is designed to handle much larger quantities of data, spread across multiple processors and disks.

Related

SQL Case inside WHEN

May be this is silly but can we write a case inside another case's WHEN?
Below code working for me but I am not sure if this is correct.
SELECT
(SUM(CASE
WHEN (
CASE
WHEN r.status < b.status
THEN r.status
ELSE b.status
END
) = '4'
THEN 1
ELSE 0
END)
) AS WORKED
FROM
tbl1 r, tbl2 b
All the examples on nested cases are like CASE inside a THEN so I am not sure if this a good practice. Is there a better way to get the same results?
Yes you can. MSDN also informs us that in SQL SERVER, you can only have a maximum of 10 CASE expressions embedded into each other. Oddly enough, a search for ORACLE up negative about this potential limitation. Probably important to note.
Of course, you can also just use more WHEN (up to 255 in ORACLE) statements, too, but that only works if you do not need to nest your logic (such as comparing two different columns values)
Sources:
https://msdn.microsoft.com/en-us/library/ms181765.aspx
http://www.techonthenet.com/oracle/functions/case.php

Combining output rows by date

I have a following problem with PL/SQL..
I need to retrieve data from the table for various parameters, for the certain time period, but on the output I have duplicates for dates providing me an output for each column, but not combining them together. Could I please borrow your geniuses for this issue?
Here is my code (part of it, as it repeats the same for other parameters I need to deliver):
select /*+FULL(k)*/ k.date_n,
SUM(decode(bucket_flag_n,
'1',
(DECODE(type_s,
'MOC',
decode(on_off_net_s, 'On net', duration_sum),
'MOC_4',
decode(on_off_net_s, 'On net', duration_sum),
'MOC CF_4',
decode(on_off_net_s, 'On net', duration_sum),0)))) test1,
SUM(decode(bucket_flag_n,
'0',
(DECODE(type_s,
'MOC',
decode(on_off_net_s, 'Off net', duration_sum),
'MOC_4',
decode(on_off_net_s, 'Off net', duration_sum),
'MOC CF_4',
decode(on_off_net_s, 'Off net', duration_sum),0)))) test2
from (select /*+FULL(a)*/
a.d_timestamp date_n,
a.service_s type_s,
a.country_s,
a.on_off_net_s,
a.bucket_flag_n,
round(SUM(a.duration_n / 60)) duration_sum, --minutes rounded
SUM(a.count_n) sms_count, -- sms count
round(SUM(a.volume_n / 1024 / 1024)) volume_sum -- volume mb rounded
from database a, database2 b
where a.country_s = 'Country'
and a.free_of_charge_flag_n = '1'
and a.d_timestamp between b.date_from and b.date_to
group by a.d_timestamp,
a.service_s,
a.country_s,
a.on_off_net_s,
a.bucket_flag_n) k
group by k.date_n, bucket_flag_n
order by 1
Here is what I gt on the output:
Thank you in advance!
Your group by clause is:
group by k.date_n, bucket_flag_n
If you only want one row per date, then change it to:
group by k.date_n
I would also suggest that you learn modern join syntax ("never use commas in the from clause") and replace decode() with case. However, those are syntactic conventions and don't affect the results from the query.
There's several strange things going on here.
First off, you say:
Here is my code (part of it, as it repeats the same for other
parameters I need to deliver):
Which implies that all aggregate, non-grouped columns have the DECODE(...) containing 'MOC', 'MOC_4', and 'MOC CF_4' - if so, you can actually make those part of the WHERE clause, which may actually speed up your query (Assuming service_s has other codes not used in the query, and relevant indices).
The next thing is, you're using an inclusive upper-bound (<=, found in BETWEEN) with what appears to be a timestamp. This will give you wrong results - often, midnight of the next day is incorrectly included, although there are other possibilities too. When dealing with positive, contiguous-range types, you must use an exclusive upper-bound (<), or suffer the consequences: this is an inherent property of representation of numbers, and has nothing to do with implementation in a computer, or specific applications. (I also find the names somewhat poor, especially as d_timestamp doesn't really tell me anything about what it represents)
Math, and rounding issues:
Assuming duration_n, count_n, and volumn_n (...what does _n stand for? Why the suffix?) are INTEGER types, ROUND(...) is unnecessary, as all math performed will be integer-based, and return non-fractional amounts in the first place. The commutative property of addition can potentially be exploited - you can rewrite SUM(a.duration_n / 60) as SUM(a.duration_n) / 60 (performance gains, if any, would be low) - however if the given column is an INTEGER type you will get different results (which is correct is up to you - actually, given computer limitations it gives different answers no matter what the type is, but would be most pronounced with an integral type).
Given some of the mentioned assumptions (namely, that all aggregate columns have the same DECODE(..), we can simplify the query somewhat:
SELECT A.d_timestamp AS date_n,
SUM(CASE WHEN A.bucket_flag_n = '1' AND A.on_off_net_s = 'On net'
THEN A.duration_n END) / 60 AS test1,
SUM(CASE WHEN A.bucket_flag_n = '0' AND A.on_off_net_s = 'Off net'
THEN A.duration_n END) / 60 AS test2
FROM Database A
JOIN Database2 B
ON A.d_timestamp >= B.date_from
AND A.d_timestamp < B.date_to
WHERE A.country_s = 'Country'
AND A.free_of_charge_flag_n = '1'
AND A.service_s IN ('MOC', 'MOC_4', 'MOC CF_4')
AND ((bucket_flag_n = '1' AND on_off_net_s = 'On net')
OR (bucket_flag_n = '0' AND on_off_net_s = 'Off net'))
GROUP BY A.d_timestamp
ORDER BY A.d_timestamp
... adding the remaining aggregate columns is left as an exercise to the reader.
A couple of notes: If the relationship between bucket_flag_n and on_off_net_s is as indicated in all cases, you can actually remove the conditions from the WHERE clause. If you have other that you're bucketing you may have to anyways. I'm also suspicious of the usefulness of grouping by something that claims to be a timestamp, as these are usually too high resolution for useful groups in aggregation (ie - each value tends to be on its own line). If the value is a date you have a different problem...

sql server group by with an alias

I am new to sql server, transitioning from mysql.
I have a complicated case statement that I would like to group on 6 whens and an else. Likely to get larger. To be able to run it, I need to copy the statement into the group by each time there is a modification. In mySql I would just group by the column number. Is there any work around for this? Making the code very ugly.
Is there going to be a performance penalty in creating a sub query for my case, then just grouping on the result field. Seems like trying to make the code more elegant will cause the query to use more resources.
Thanks
Below is a field I am grouping on. As I make a modification to the field for more edge cases, then I need to change code in up to 3 places. Makes for some very ugly code, and I need no extra help doing that myself.
dz_code = case
when isnull(dz.dz_code,'N/A') in ('GAB', 'MAB', 'N/A') and dc.howdidyouhear = 'Television' then 'Television'
when isnull(dz.dz_code,'N/A') in ('GAB', 'MAB', 'N/A') and dc.howdidyouhear in ('Other', 'N/A') then 'Other'
WHEN dz.dz_code = 'irs,irs' THEN 'irs'
when dz.dz_code like '%SDE%' THEN 'SDE'
when dz.dz_code like 'referral,' then REPLACE(dz.dz_code, 'referral','')
when charindex(',',dz.dz_code) = 4 then left(dz.dz_code,3)
else
dz.dz_code
END,
Maybe you can wrap the query in a subquery and use the alias in the select and the group by. It looks a little bulky in this example, but if you've got more complex case switches, or more than one of them, then this solution will probably much smaller and more readable.
select
CaseField
from
(select
case when 1 = 2 then
3
else 4 end as CaseField
from
YourTable t) c
group by
CaseField

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

Which is faster: Sum(Case When) Or Group By/Count(*)?

I can write
Select
Sum(Case When Resposta.Tecla = 1 Then 1 Else 0 End) Valor1,
Sum(Case When Resposta.Tecla = 2 Then 1 Else 0 End) Valor2,
Sum(Case When Resposta.Tecla = 3 Then 1 Else 0 End) Valor3,
Sum(Case When Resposta.Tecla = 4 Then 1 Else 0 End) Valor4,
Sum(Case When Resposta.Tecla = 5 Then 1 Else 0 End) Valor5
From Resposta
Or
Select
Count(*)
From Resposta Group By Tecla
I tried this over a large number of rows and it seems like taking the same time.
Anyone can confirm this?
I believe the Group By is better because there are no specific treatments.
It can be optimized by the database engine.
I think the results may depend on the database engine you use.
Maybe the one you are using optimizes the first query anderstanding it is like a group by !
You can try the "explain / explain plan" command to see how the engine is computing your querys but with my Microsoft SQL Server 2008, I just can see a swap between 2 operations ("Compute scalar" and "agregate").
I tried such queries on a database table :
SQL Server 2k8
163000 rows in the table
12 cathegories (Valor1 -> Valor12)
the results are quite differents :
Group By : 2seconds
Case When : 6seconds !
So My choice is "Group By".
Another benefit is the query is simplyer to write !
What the DB does internally with the second query is practically the same as what you explicitly tell it to do with the first. There should be no difference in the execution plan and thus in the time the query takes. Taking this into account, clearly using the second query is better:
it's much more flexible, when there are more values of Tecla you
don't need to change your query
it's easier to understand. If you have a lot of values for Tecla
it'll be harder to read the first query and realize it just counts
distinct values
it's smaller - you're sending less information to the DB server and it will probably parse the query faster, which is the only performance difference I see in this queries. This makes a difference, albeit small
Either one is going to have to read all rows from Resposta, so for any reasonably sized table, I'd expect the I/O cost to dominate - giving approximately the same overall runtime.
I'd generally use:
Select
Tecla,
Count(*)
From Resposta
Group By Tecla
If there's a reasonable chance that the range of Tecla values will change in the future.
In my opinion GROUP BY statement will always be faster than SUM(CASE WHEN ...) because in your example for SUM ... there would be 5 different calculations while when using GROUP BY, DB will simply sort and calculate.
Imagine, you have a bag with different coins and you need to know, how much of earch type of coins do you have. You can do it this ways:
The SUM(CASE WHEN ...) way would be to compare each coin with predefined sample coins and do the math for each sample (add 1 or 0);
The GROUP BY way would be to sort coins by their types and then count earch group.
Which method would you prefer?
To fairly compete with count(*), Your first SQL should probably be:
Select
Sum(Case When Resposta.Tecla >= 1 AND Resposta.Tecla <=5 Then 1 Else 0 End) Valor
From Resposta
And to answer your question, I'm not noticing a difference at all in speed between SUM CASE WHEN and COUNT. I'm querying over 250,000 rows in POSTGRESQL.