Using BETWEEN clause - sql

Whenever you write a query where you need to filter out rows on a range of values - then should I use the BETWEEN clause or <= and >= ?
Which one is better in performance?

Neither. They create exactly the same execution plan.
The times where I use them depends not on performance, but on the data.
If the data are Discrete Values, then I use BETWEEN...
x BETWEEN 0 AND 9
But if the data are Continuous Values, then that doesn't work so well...
x BETWEEN 0.000 AND 9.999999999999999999
Instead, I use >= AND <...
x >= 0 AND x < 10
Interestingly, however, the >= AND < technique actually works for both Continuous and Discrete data types. So, in general, I rarely use BETWEEN at all.

Also, don't use BETWEEN for date/time range queries.
What does the following really mean?
BETWEEN '20120201' AND '20120229'
Some people think that means get me the all the data from February, including all of the data anytime on February 29th. The above gets translated to:
BETWEEN '20120201 00:00:00.000' AND '20120229 00:00:00.000'
So if there is data on the 29th any time after midnight, your report is going to be incomplete.
People also try to be clever and pick the "end" of the day:
BETWEEN '00:00:00.000' AND '23:59:59.997'
That works if the data type is datetime. If it is smalldatetime the end of the range gets rounded up, and you may include data from the next day that you didn't mean to. If it's datetime2 you might actually miss a small portion of data that happened in the last 2+ milliseconds of the day. In most cases statistically irrelevant, but if the query is wrong, the query is wrong.
So for date range queries I always strongly recommend using an open-ended range, e.g. to report on the month of February the WHERE clause would say "on or after February 1st, and before March 1st" as follows:
WHERE date_col >= '20120201' AND date_col < '20120301'
BETWEEN can work as expected using the date type only, but I still prefer an open-ended range in queries because later someone may change that underlying data type to allow it to include time.
I blogged a lot more details here:
What do BETWEEN and the devil have in common?

Related

Difference between mydatetime BETWEEN mindate AND maxdate ...or... mydatetime >= mindate and mydatetime <= maxdate [duplicate]

In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

SQL finding overlapping of times pass midnight (across 2 days)

I know there are lots of these types of questions, but i didn't see one that was similar enough to my criteria. So i'd like to ask for your help please. The fields i have are just start and end which are of time types. I cannot involve any specific dates in this. If the time ranges don't go pass midnight across day, i'd just compare two tuples as such:
end1 > start2 AND start1 < end2
(end points touching are not considered overlapped here.)
But when I involve time range that pass (or at) midnight, this obviously doesn't work. For example, given:
start | end
--------+--------
06:00PM | 01:00AM
03:00PM | 09:00PM
Without involving dates, how can i achieve this, please. My assumption is, if end is less than start, then we're involving 2 days.
I'm trying to do this in plain standard SQL, so just a simple and concise logic in the WHERE clause.
Thank you everyone!
Added:
Also, how would I test if one time range completely envelopes another? thanks again!
If your SQL supports time differences:
(end1 - start1) > (start2 - start1) AND (end2 - start2) > (start1 - start2)
Unfortunately, "plain" SQL will be too general to use against an actual database. The reason is that the various database products have different levels of support for calculating the duration between two times. For example, in SQL Server 2008, it would be substantially simpler to convert the time values to DateTime and then do the comparison since many comparison operators are not supported on the Time data type.
Select ...
From (
Select Cast(T.Start1 As DateTime) As Start1
, Case
When Cast(T.Start1 As DateTime) > Cast(T.End1 As DateTime) Then DateAdd(d,1,Cast(T.End1 As DateTime))
Else Cast(T.End1 As DateTime)
End As End1
From ...
) As T
Where T.End1 > T2.Start2 And T1.Start2 < T2.End2
use start time and duration (in minutes or whatever unit is appropriate)
The program Transtar had this problem. The time data was not associated with any date, nor was the time data in a date time field. The program initially was designed to issue transit itinaries from about 4AM to midnight which worked fine as long as the transit wasn't around the clock. I built a function which did a sliding test for the times so that if you asked for 5AM it would look at times from 1AM to 12:59AM. I wrote it in FORTRAN, but the algorythm would be the same regardless of language.

Is SQL DATEDIFF(year, ..., ...) an Expensive Computation?

I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.
It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.
DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.
One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).

SQL : BETWEEN vs <= and >=

In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo