Why use the BETWEEN operator when we can do without it? - sql

As seen below the two queries, we find that they both work well. Then I am confused why should we ever use BETWEEN because I have found that BETWEEN behaves differently in different databases as found in w3school
SELECT *
FROM employees
WHERE salary BETWEEN 5000 AND 15000;
SELECT *
FROM employees
WHERE salary >= 5000
AND salary <= 15000;

BETWEEN can help to avoid unnecessary reevaluation of the expression:
SELECT AVG(RAND(20091225) BETWEEN 0.2 AND 0.4)
FROM t_source;
---
0.1998
SELECT AVG(RAND(20091225) >= 0.2 AND RAND(20091225) <= 0.4)
FROM t_source;
---
0.3199
t_source is just a dummy table with 1,000,000 records.
Of course this can be worked around using a subquery, but in MySQL it's less efficient.
And of course, BETWEEN is more readable. It takes 3 times to use it in a query to remember the syntax forever.
In SQL Server and MySQL, LIKE against a constant with non-leading '%' is also a shorthand for a pair of >= and <:
SET SHOWPLAN_TEXT ON
GO
SELECT *
FROM master
WHERE name LIKE 'string%'
GO
SET SHOWPLAN_TEXT OFF
GO
|--Index Seek(OBJECT:([test].[dbo].[master].[ix_name_desc]), SEEK:([test].[dbo].[master].[name] < 'strinH' AND [test].[dbo].[master].[name] >= 'string'), WHERE:([test].[dbo].[master].[name] like 'string%') ORDERED FORWARD)
However, LIKE syntax is more legible.

Using BETWEEN has extra merits when the expression that is compared is a complex calculation rather than just a simple column; it saves writing out that complex expression twice.

BETWEEN in T-SQL supports NOT operator, so you can use constructions like
WHERE salary not between 5000 AND 15000;
In my opinion it's more clear for a human then
WHERE salary < 5000 OR salary > 15000;
And finally if you type column name just one time it gives you less chances to make a mistake

The version with "between" is easier to read. If I were to use the second version I'd probably write it as
5000 <= salary and salary <= 15000
for the same reason.

I vote #Quassnoi - correctness is a big win.
I usually find literals more useful than the syntax symbols like <, <=, >, >=, != etc. Yes, we need (better, accurate) results. And at least I get rid of probabilities of mis-interpreting and reverting meanings of the symbols visually. If you use <= and sense logically incorrect output coming from your select query, you may wander some time and only arrive to the conclusion that you did write <= in place of >= [visual mis-interpretation?]. Hope I am clear.
And aren't we shortening the code (along with making it more higher-level-looking), which means more concise and easy to maintain?
SELECT *
FROM emplyees
WHERE salary between 5000 AND 15000;
SELECT *
FROM emplyees
WHERE salary >= 5000 AND salary <= 15000;
First query uses only 10 words and second uses 12!

Personally, I wouldn't use BETWEEN, simply because there seems no clear definition of whether it should include, or exclude, the values which serve to bound the condition, in your given example:
SELECT *
FROM emplyees
WHERE salary between 5000 AND 15000;
The range could include the 5000 and 15000, or it could exclude them.
Syntactically I think it should exclude them, since the values themselves are not between the given numbers. But my opinion is precisely that, whereas using operators such as >= is very specific. And less likely to change between databases, or between incremements/versions of the same.
Edited in response to Pavel and Jonathan's comments.
As noted by Pavel, ANSI SQL (http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt) as far back as 1992, mandates the end-points should be considered within the returned date and equivalent to X >= lower_bound AND X <= upper_bound:
8.3
Function
Specify a range comparison.
Format
<between predicate> ::=
<row value constructor> [ NOT ] BETWEEN
<row value constructor> AND <row value constructor>
Syntax Rules
1) The three <row value constructor>s shall be of the same degree.
2) Let respective values be values with the same ordinal position
in the two <row value constructor>s.
3) The data types of the respective values of the three <row value
constructor>s shall be comparable.
4) Let X, Y, and Z be the first, second, and third <row value con-
structor>s, respectively.
5) "X NOT BETWEEN Y AND Z" is equivalent to "NOT ( X BETWEEN Y AND
Z )".
6) "X BETWEEN Y AND Z" is equivalent to "X>=Y AND X<=Z".

If the endpoints are inclusive, then BETWEEN is the preferred syntax.
Less references to a column means less spots to update when things change. It's the engineering principle, that less things means less stuff can break.
It also means less possibility of someone putting the wrong bracket for things like including an OR. IE:
WHERE salary BETWEEN 5000 AND (15000
OR ...)
...you'll get an error if you put the bracket around the AND part of a BETWEEN statement. Versus:
WHERE salary >= 5000
AND (salary <= 15000
OR ...)
...you'd only know there's a problem when someone reviews the data returned from the query.

Semantically, the two expressions have the same result.
However, BETWEEN is a single predicate, instead of two comparison predicates combined with AND. Depending on the optimizer provided by your RDBMS, a single predicate may be easier to optimize than two predicates.
Although I expect most modern RDBMS implementations should optimize the two expressions identically.

worse if it's
SELECT id FROM entries
WHERE
(SELECT COUNT(id) FROM anothertable WHERE something LEFT JOIN something ON...)
BETWEEN entries.max AND entries.min;
Rewrite this one with your syntax without using temporary storage.

I'd better use the 2nd one, as you always know if it's <= or <

In SQL, I agree that BETWEEN is mostly unnecessary, and can be emulated syntactically with 5000 <= salary AND salary <= 15000. It is also limited; I often want to apply an inclusive lower bound and an exclusive upper bound: #start <= when AND when < #end, which you can't do with BETWEEN.
OTOH, BETWEEN is convenient if the value being tested is the result of a complex expression.
It would be nice if SQL and other languages would follows Python's lead in using proper mathematical notation: 5000 <= salary <= 15000.
One small tip that wil make your code more readable: use < and <= in preference to > and >=.

Related

How to limit the character amount you are selecting from

I'm trying to average a field and it's very simple to do but there are some problems with some values. There are values I know are way too big and I was hoping to exclude them by the number of characters (I would probably put 4 characters max).
I'm unfamiliar with a sql clause that could execute this. If there is one that would be great.
select avg(convert(float,duration)) as averageduration
from AsteriskCalls where ISNUMERIC(duration) = 1
I expect the output to be around 500-1000 but it comes up as an 8 digit number.
That is easy enough:
select avg(convert(float,duration)) as averageduration
from AsteriskCalls
where ISNUMERIC(duration) = 1 and length(duration) <= 4;
This will not necessarily work, of course, because you could have '1E30', which would be a pretty big number. And it would miss '0.001', which is a pretty small number.
The more accurate method uses try_convert():
select avg(try_convert(float, duration)) as averageduration
from AsteriskCalls
where try_convert(float, duration) <= 1000.0
And that should probably really be:
where abs(try_convert(float, duration)) <= 1000.0

Can the order of individual operands affect whether a SQL expression is sargable?

A colleague of mine who is generally well-versed in SQL told me that the order of operands in a > or = expression could determine whether or not the expression was sargable. In particular, with a query whose case statement included:
CASE
when (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) > 1 THEN 2
ELSE 1
and was told to reverse the order of the operands to the equivalent
CASE
when 1 < (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) THEN 2
ELSE 1
for sargability concerns. I found no difference in query plans, though ultimately I made the change for the sake of sticking to team coding standards. Is what my co-worker said true in some cases? Does the order of operands in an expression have potential impact on its execution time? This doesn't mesh with how I understand sargability to work.
For Postgres, the answer is definitely: "No." (sql-server was added later.)
The query planner can flip around left and right operands of an operator as long as a COMMUTATOR is defined, which is the case for all instance of < and >. (Operators are actually defined by the operator itself and their accepted operands.) And the query planner will do so to make an expression "sargable". Related answer with detailed explanation:
Can PostgreSQL index array columns?
It's different for other operators without COMMUTATOR. Example for ~~ (LIKE):
LATERAL JOIN not using trigram index
If you're talking about the most popular modern databases like Microsoft SQL, Oracle, Postgres, MySql, Teradata, the answer is definitely NO.
What is a SARGable query?
A SARGable query is the one that strive to narrow the number of rows a database has to process in order to return you the expected result. What I mean, for example:
Consider this query:
select * from table where column1 <> 'some_value';
Obviously, using an index in this case is useless, because a database most certainly would have to look through all rows in a table to give you expected rows.
But what if we change the operator?
select * from table where column1 = 'some_value';
In this case an index can give good performance and return expected rows almost in a flash.
SARGable operators are: =, <, >, <= ,>=, LIKE (without %), BETWEEN
Non-SARGable operators are: <>, IN, OR
Now, back to your case.
Your problem is simple. You have X and you have Y. X > Y or Y < X - in both cases you have to determine the values of both variables, so this switching gives you nothing.
P.S. Of course, I concede, there could be databases with very poor optimizers where this kind of swithing could play role. But, as I said before, in modern databases you should not worry about it.

Number between a and b - non-inclusive on a, inclusive on b

(I'm a little new to SQL) I have a lot of queries I'm re-writing which have a where clause like this:
where some_number > A
and some_number <= B
I want to use a single where clause (fewer lines, it isn't faster/slower is it?) like this:
where some_number between A and B
The problem is the first clause is exclusive on A and inclusive on B. Is there any way I can specify "inclusivisity" on a single line like the second query? Thanks.
A couple of points...
Firstly, it's only "fewer lines" if you use fewer lines. I would format it like this:
where some_number > A and some_number <= B
because it's really one range condition with each end of the range coded separately.
Secondly, it's actually no faster or slower than the between version, because under the covers between A and B gets converted to:
where (some_number >= A) and (some_number <= B)
so the performance is identical.
Basically, don't worry about it.
You can just offset your a by "+1"
Or just use your first syntax, it's easier to read.

SQL and logical operators and null checks

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:
... WHERE the_column IS NULL OR the_column < 10 ...
It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:
... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...
(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)
So my questions here are:
Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
What exactly is the issue? Does it have to do with how indexes work or something?
My grounding in SQL is pretty weak.
I don't know SQL Server so I can't speak to that.
Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:
Expression Evaluation Rules
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.
As far as an expression of the form:
the_column IS NULL OR the_column < 10
is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL isn't true so
null is null or null < 10
is just a complicated way of saying true or null and that's true regardless of which sub-expression is evaluated first.
The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:
When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:
Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.
So given this:
case when A then Ra
when B then Rb
when C then Rc
...
A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.
In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.
Instead of
the_column IS NULL OR the_column < 10
I'd do
isnull(the_column,0) < 10
or for the first example
WHERE 1 = CASE WHEN isnull(the_column,0) < 10 THEN 1 ELSE 0 END ...
I've never heard of such a problem, and this bit of SQL Server 2000 documentation uses WHERE advance < $5000 OR advance IS NULL in an example, so it must not have been a very stern rule. My only concern with OR is that it has lower precedence than AND, so you might accidentally write something like WHERE the_column IS NULL OR the_column < 10 AND the_other_column > 20 when that's not what you mean; but the usual solution is parentheses rather than a big CASE expression.
I think that in most RDBMSes, indices don't include null values, so an index on the_column wouldn't be terribly useful for this query; but even if that weren't the case, I don't see why a big CASE expression would be any more index-friendly.
(Of course, it's hard to prove a negative, and maybe someone else will know what you're referring to?)
Well, I've repeatedly written queries like the first example since about forever (heck, I've written query generators that generate queries like that), and I've never had a problem.
I think you may be remembering some admonishment somebody gave you sometime against writing funky join conditions that use OR. In your first example, the conditions joined by the OR restrict the same one column of the same table, which is OK. If your second condition was a join condition (i.e., it restricted columns from two different tables), then you could get into bad situations where the query planner just has no choice but to use a Cartesian join (bad, bad, bad!!!).
I don't think your CASE function is really doing anything there, except perhaps hamper your query planner's attempts at finding a good execution plan for the query.
But more generally, just write the straightforward query first and see how it performs for realistic data. No need to worry about a problem that might not even exist!
Nulls can be confusing. The " ... WHERE 1 = CASE ... " is useful if you are trying to pass a Null OR a Value as a parameter ex. "WHERE the_column = #parameter. This post may be helpful Passing Null using OLEDB .
Another example where CASE is useful is when using date functions on the varchar columns. adding ISDATE before using say convert(colA,datetime) might not work, and when colA has non-date data the query can error out.

SQL comparison operators

are these two statements the same?
query 1: WHERE salary > 999;
query 2: WHERE salary >= 1000;
I thought they were, but apparently according to my peers they are not (Though they have failed to explain why).
That's not necessarily the same. If you're storing doubles, when you do:
WHERE salary >= 1000;
you're not taking in count all values that are between 999 and 1000 (eg. 999.50)
Otherwise, if you're working with integers, that's true, not only in programming, but also in maths.
n > k <=> n >= k+1
The two queries will give the same results, if salary is an integral type. But if salary is some real type, then the results will be different.
This is entirely dependant on the datatype of salary.
If salary is an INT or a BIGINT then yes they will yield the same results.
If salary is pretty much any other datatype, the first once will return results for 999.9, but the second one won't.