In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo
Related
I'm running the following code on a dataset of 100M to test some things out before I eventually join the entire range (not just the top 10) on another table to make it even smaller.
SELECT TOP 10 *
FROM Table
WHERE CONVERT(datetime, DATE, 112) BETWEEN '2020-07-04 00:00:00' AND '2020-07-04 23:59:59'
The table isn't mine but a client's, so unfortunately I'm not responsible for the data types of the columns. The DATE column, along with the rest of the data, is in varchar. As for the dates in the BETWEEN clause, I just put in a relatively small range for testing.
I have heard that CONVERT shouldn't be in the WHERE clause, but I need to convert it to dates in order to filter. What is the proper way of going about this?
Going to summarise my comments here, as they are "second class citizens" and thus could be removed.
Firstly, the reason your query is slow is because of theCONVERT on the column DATE in your WHERE. Applying functions to a column in your WHERE will almost always make your query non-SARGable (there are some exceptions, but that doesn't make them a good idea). As a result, the entire table must be scanned to find rows that are applicable for your WHERE; it can't use an index to help it.
The real problem, therefore, is that you are storing a date (and time) value in your table as a non-date (and time) datatype; presumably a (n)varchar. This is, in truth, a major design flaw and needs to be fixed. String type values aren't validated to be valid dates, so someone could easily insert the "date" '20210229' or even 20211332'. Fixing the design not only stops this, but also makes your data smaller (a date is 3 bytes in size, a varchar(8) would be 10 bytes), and you could pass strongly typed date and time values to your query and it would be SARGable.
"Fortunately" it appears your data is in the style code 112, which is yyyyMMdd; this at least means that the ordering of the dates is the same as if it were a strongly typed date (and time) data type. This means that the below query will work and return the results you want:
SELECT TOP 10 * --Ideally don't use * and list your columns properly
FROM dbo.[Table]
WHERE [DATE] >= '20210704' AND [DATE] < '20210705'
ORDER BY {Some Column};
you can use like this to get better performance:
SELECT TOP 10 *
FROM Table
WHERE cast(DATE as date) BETWEEN '2020-07-04' AND '2020-07-04' and cast(DATE as time) BETWEEN '00:00:00' AND '23:59:59'
No need to include time portion if you want to search full day.
In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo
Whenever you write a query where you need to filter out rows on a range of values - then should I use the BETWEEN clause or <= and >= ?
Which one is better in performance?
Neither. They create exactly the same execution plan.
The times where I use them depends not on performance, but on the data.
If the data are Discrete Values, then I use BETWEEN...
x BETWEEN 0 AND 9
But if the data are Continuous Values, then that doesn't work so well...
x BETWEEN 0.000 AND 9.999999999999999999
Instead, I use >= AND <...
x >= 0 AND x < 10
Interestingly, however, the >= AND < technique actually works for both Continuous and Discrete data types. So, in general, I rarely use BETWEEN at all.
Also, don't use BETWEEN for date/time range queries.
What does the following really mean?
BETWEEN '20120201' AND '20120229'
Some people think that means get me the all the data from February, including all of the data anytime on February 29th. The above gets translated to:
BETWEEN '20120201 00:00:00.000' AND '20120229 00:00:00.000'
So if there is data on the 29th any time after midnight, your report is going to be incomplete.
People also try to be clever and pick the "end" of the day:
BETWEEN '00:00:00.000' AND '23:59:59.997'
That works if the data type is datetime. If it is smalldatetime the end of the range gets rounded up, and you may include data from the next day that you didn't mean to. If it's datetime2 you might actually miss a small portion of data that happened in the last 2+ milliseconds of the day. In most cases statistically irrelevant, but if the query is wrong, the query is wrong.
So for date range queries I always strongly recommend using an open-ended range, e.g. to report on the month of February the WHERE clause would say "on or after February 1st, and before March 1st" as follows:
WHERE date_col >= '20120201' AND date_col < '20120301'
BETWEEN can work as expected using the date type only, but I still prefer an open-ended range in queries because later someone may change that underlying data type to allow it to include time.
I blogged a lot more details here:
What do BETWEEN and the devil have in common?
I have a table where each row has a start and stop date-time. These can be arbitrarily short or long spans.
I want to query the sum duration of the intersection of all rows with two start and stop date-times.
How can you do this in MySQL?
Or do you have to select the rows that intersect the query start and stop times, then calculate the actual overlap of each row and sum it client-side?
To give an example, using milliseconds to make it clearer:
Some rows:
ROW START STOP
1 1010 1240
2 950 1040
3 1120 1121
And we want to know the sum time that these rows were between 1030 and 1100.
Lets compute the overlap of each row:
ROW INTERSECTION
1 70
2 10
3 0
So the sum in this example is 80.
If your example should have said 70 in the first row then
assuming #range_start and #range_end as your condition paramters:
SELECT SUM( LEAST(#range_end, stop) - GREATEST(#range_start, start) )
FROM Table
WHERE #range_start < stop AND #range_end > start
using the greatest/least and date functions you should be able to get what you need directly operating on the date type.
I fear you're out of luck.
Since you don't know the number of rows that you will be "cumulatively intersecting", you need either a recursive solution, or an aggregation operator.
The aggregation operator you need is no option because SQL does not have the data type that it is supposed to operate on (that type being an interval type, as described in "Temporal Data and the Relational Model").
The recursive solution may be possible, but it is likely to be difficult to write, difficult to read to other programmers, and it is also questionable whether the optimizer can turn that query into the optimal data access strategy.
Or I misunderstood your question.
There's a fairly interesting solution if you know the maximum time you'll ever have. Create a table with all the numbers in it from one to your maximum time.
millisecond
-----------
1
2
3
...
1240
Call it time_dimension (this technique is often used in dimensional modelling in data warehousing.)
Then this:
SELECT
COUNT(*)
FROM
your_data
INNER JOIN time_dimension ON time_dimension.millisecond BETWEEN your_data.start AND your_data.stop
WHERE
time_dimension.millisecond BETWEEN 1030 AND 1100
...will give you the total number of milliseconds of running time between 1030 and 1100.
Of course, whether you can use this technique depends on whether you can safely predict the maximum number of milliseconds that will ever be in your data.
This is often used in data warehousing, as I said; it fits well with some kinds of problems -- for example, I've used it for insurance systems, where a total number of days between two dates was needed, and where the overall date range of the data was easy to estimate (from the earliest customer date of birth to a date a couple of years into the future, beyond the end date of any policies that were being sold.)
Might not work for you, but I figured it was worth sharing as an interesting technique!
After you added the example, it is clear that indeed I misunderstood your question.
You are not "cumulatively intersecting rows".
The steps that will bring you to a solution are :
intersect each row's start and end point with the given start and end points. This should be doable using CASE expressions or something of that nature, something in the style of :
SELECT (CASE startdate < givenstartdate : givenstartdate, CASE startdate >= givenstartdate : startdate) as retainedstartdate, (likewise for enddate) as retainedenddate FROM ... Cater for nulls and that sort of stuff as needed.
With the retainedstartdate and retainedenddate, use a date function to compute the length of the retained interval (which is the overlap of your row with the given time section).
SELECT the SUM() of those.
I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.
It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.
DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.
One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).