Is SQL DATEDIFF(year, ..., ...) an Expensive Computation? - sql

I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.

It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.

DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.

One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).

Related

Why does using CONVERT(DATETIME, [date], [format]) in WHERE clause take so long?

I'm running the following code on a dataset of 100M to test some things out before I eventually join the entire range (not just the top 10) on another table to make it even smaller.
SELECT TOP 10 *
FROM Table
WHERE CONVERT(datetime, DATE, 112) BETWEEN '2020-07-04 00:00:00' AND '2020-07-04 23:59:59'
The table isn't mine but a client's, so unfortunately I'm not responsible for the data types of the columns. The DATE column, along with the rest of the data, is in varchar. As for the dates in the BETWEEN clause, I just put in a relatively small range for testing.
I have heard that CONVERT shouldn't be in the WHERE clause, but I need to convert it to dates in order to filter. What is the proper way of going about this?
Going to summarise my comments here, as they are "second class citizens" and thus could be removed.
Firstly, the reason your query is slow is because of theCONVERT on the column DATE in your WHERE. Applying functions to a column in your WHERE will almost always make your query non-SARGable (there are some exceptions, but that doesn't make them a good idea). As a result, the entire table must be scanned to find rows that are applicable for your WHERE; it can't use an index to help it.
The real problem, therefore, is that you are storing a date (and time) value in your table as a non-date (and time) datatype; presumably a (n)varchar. This is, in truth, a major design flaw and needs to be fixed. String type values aren't validated to be valid dates, so someone could easily insert the "date" '20210229' or even 20211332'. Fixing the design not only stops this, but also makes your data smaller (a date is 3 bytes in size, a varchar(8) would be 10 bytes), and you could pass strongly typed date and time values to your query and it would be SARGable.
"Fortunately" it appears your data is in the style code 112, which is yyyyMMdd; this at least means that the ordering of the dates is the same as if it were a strongly typed date (and time) data type. This means that the below query will work and return the results you want:
SELECT TOP 10 * --Ideally don't use * and list your columns properly
FROM dbo.[Table]
WHERE [DATE] >= '20210704' AND [DATE] < '20210705'
ORDER BY {Some Column};
you can use like this to get better performance:
SELECT TOP 10 *
FROM Table
WHERE cast(DATE as date) BETWEEN '2020-07-04' AND '2020-07-04' and cast(DATE as time) BETWEEN '00:00:00' AND '23:59:59'
No need to include time portion if you want to search full day.

Difference between mydatetime BETWEEN mindate AND maxdate ...or... mydatetime >= mindate and mydatetime <= maxdate [duplicate]

In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo

SQL Performance with Distinct and Count

I have a stored procedure which contain so many SET statements. That is taking so long to execute. What I can do for increase the performance. One statement I have included here.
SET #VisitedOutlets=(select count (distinct CustomerId) from dbo.VisitDetail
where RouteId = #intRouteID
and CONVERT(VARCHAR(10),VisitDate,111) between CONVERT(VARCHAR(10),#FromDate,111)
and CONVERT(VARCHAR(10),#ToDate,111));
I think your problem comes from the fact that you are using variables in your query. Normally, the optimizer will ... optimize (!) the query for a given (hard coded) value (let's say id = 123) for instance, whereas it cannot optimize since it is a variable.
Let's take a great example from here :
OK,
You are the Optimizer and the Query Plan is a vehicle.
I will give you a query and you have to choose the vehicle.
All the books in the library have a sequential number
My query is Go to the library and get me all the books between 3 and 5
You'd pick a bike right, quick, cheap, efficient and big enough to
carry back 3 books.
New query.
Go to the library and get all the books between #x and #y.
Pick the vehicle.
Go ahead.
That's what happens. Do you pick a dump truck in case I ask for books
between 1 and Maxvalue? That's overkill if x=3 and y=5. SQL has to
pick the plan before it sees the numbers.
So your problem is that the optimizer cannot do its job correctly. To allow him doing his job, you can make him recompile, or update statistics. See here, here, and here.
So my 2 solutions to your problem would be :
Recompile : OPTION(RECOMPILE)
Update statistics : EXEC sp_updatestats
Your query is essentially:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail
where RouteId = #intRouteID and
CONVERT(VARCHAR(10), VisitDate, 111) between
CONVERT(VARCHAR(10), #FromDate, 111) and CONVERT(VARCHAR(10), #ToDate, 111);
I think this query can be optimized to take advantage of indexes. One major problem is the date comparison. You should not be doing any conversion for the comparison on VisitDate. So, I would rewrite the query as:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail vd
where vd.RouteId = #intRouteID and
vd.VisitDate >= cast(#FromDate as date) and
vd.VisitDate < dateadd(day, 1, cast(#ToDate as date))
For this query, you want an index on VisitDetail(RouteId, VisitDate, CustomerId). I would also store the constants in the appropriate format, so conversions are not needed in the query itself.
between is dangerous when using dates. Here is an interesting discussion on the topic by Aaron Bertrand.

SQL : BETWEEN vs <= and >=

In SQL Server 2000 and 2005:
what is the difference between these two WHERE clauses?
which one I should use on which scenarios?
Query 1:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate BETWEEN '10/15/2009' AND '10/18/2009'
Query 2:
SELECT EventId, EventName
FROM EventMaster
WHERE EventDate >='10/15/2009'
AND EventDate <='10/18/2009'
(Edit: the second Eventdate was originally missing, so the query was syntactically wrong)
They are identical: BETWEEN is a shorthand for the longer syntax in the question that includes both values (EventDate >= '10/15/2009' and EventDate <= '10/19/2009').
Use an alternative longer syntax where BETWEEN doesn't work because one or both of the values should not be included e.g.
Select EventId,EventName from EventMaster
where EventDate >= '10/15/2009' and EventDate < '10/19/2009'
(Note < rather than <= in second condition.)
They are the same.
One thing to be careful of, is if you are using this against a DATETIME, the match for the end date will be the beginning of the day:
<= 20/10/2009
is not the same as:
<= 20/10/2009 23:59:59
(it would match against <= 20/10/2009 00:00:00.000)
Although BETWEEN is easy to read and maintain, I rarely recommend its use because it is a closed interval and as mentioned previously this can be a problem with dates - even without time components.
For example, when dealing with monthly data it is often common to compare dates BETWEEN first AND last, but in practice this is usually easier to write dt >= first AND dt < next-first (which also solves the time part issue) - since determining last usually is one step longer than determining next-first (by subtracting a day).
In addition, another gotcha is that lower and upper bounds do need to be specified in the correct order (i.e. BETWEEN low AND high).
Typically, there is no difference - the BETWEEN keyword is not supported on all RDBMS platforms, but if it is, the two queries should be identical.
Since they're identical, there's really no distinction in terms of speed or anything else - use the one that seems more natural to you.
As mentioned by #marc_s, #Cloud, et al. they're basically the same for a closed range.
But any fractional time values may cause issues with a closed range (greater-or-equal and less-or-equal) as opposed to a half-open range (greater-or-equal and less-than) with an end value after the last possible instant.
So to avoid that the query should be rewritten as:
SELECT EventId, EventName
FROM EventMaster
WHERE (EventDate >= '2009-10-15' AND
EventDate < '2009-10-19') /* <<<== 19th, not 18th */
Since BETWEEN doesn't work for half-open intervals I always take a hard look at any date/time query that uses it, since its probably an error.
I have a slight preference for BETWEEN because it makes it instantly clear to the reader that you are checking one field for a range. This is especially true if you have similar field names in your table.
If, say, our table has both a transactiondate and a transitiondate, if I read
transactiondate between ...
I know immediately that both ends of the test are against this one field.
If I read
transactiondate>='2009-04-17' and transactiondate<='2009-04-22'
I have to take an extra moment to make sure the two fields are the same.
Also, as a query gets edited over time, a sloppy programmer might separate the two fields. I've seen plenty of queries that say something like
where transactiondate>='2009-04-17'
and salestype='A'
and customernumber=customer.idnumber
and transactiondate<='2009-04-22'
If they try this with a BETWEEN, of course, it will be a syntax error and promptly fixed.
I think the only difference is the amount of syntactical sugar on each query. BETWEEN is just a slick way of saying exactly the same as the second query.
There might be some RDBMS specific difference that I'm not aware of, but I don't really think so.
Logically there are no difference at all.
Performance-wise there are -typically, on most DBMSes- no difference at all.
There are infinite logically equivalent statements, but I'll consider three(ish).
Case 1: Two Comparisons in a standard order (Evaluation order fixed)
A >= MinBound AND A <= MaxBound
Case 2: Syntactic sugar (Evaluation order is not chosen by author)
A BETWEEN MinBound AND MaxBound
Case 3: Two Comparisons in an educated order (Evaluation order chosen at write time)
A >= MinBound AND A <= MaxBound
Or
A <= MaxBound AND A >= MinBound
In my experience, Case 1 and Case 2 do not have any consistent or notable differences in performance as they are dataset ignorant.
However, Case 3 can greatly improve execution times. Specifically, if you're working with a large data set and happen to have some heuristic knowledge about whether A is more likely to be greater than the MaxBound or lesser than the MinBound you can improve execution times noticeably by using Case 3 and ordering the comparisons accordingly.
One use case I have is querying a large historical dataset with non-indexed dates for records within a specific interval. When writing the query, I will have a good idea of whether or not more data exists BEFORE the specified interval or AFTER the specified interval and can order my comparisons accordingly. I've had execution times cut by as much as half depending on the size of the dataset, the complexity of the query, and the amount of records filtered by the first comparison.
In this scenario col BETWEEN ... AND ... and col <= ... and col >= ... are equivalent.
SQL Standard defines also T461 Symmetric BETWEEN predicate:
<between predicate part 2> ::=
[ NOT ] BETWEEN [ ASYMMETRIC | SYMMETRIC ]
<row value predicand> AND <row value predicand>
Transact-SQL does not support this feature.
BETWEEN requires that values are sorted. For instance:
SELECT 1 WHERE 3 BETWEEN 10 AND 1
-- no rows
<=>
SELECT 1 WHERE 3 >= 10 AND 3 <= 1
-- no rows
On the other hand:
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 1 AND 10;
-- 1
SELECT 1 WHERE 3 BETWEEN SYMMETRIC 10 AND 1
-- 1
It works exactly as the normal BETWEEN but after sorting the comparison values.
db<>fiddle demo

How Does Dateadd Impact the Performance of a SQL Query?

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?
Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same
Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries
I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.
most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.
As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.
Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200