How Does Dateadd Impact the Performance of a SQL Query? - sql

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?

Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same

Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries

I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.

most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.

As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.

Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200

Related

DB2 - optimized query - apply function to current date or db column?

I need to write a query that will be as efficient as possible returning rows that have mycol (timestamp value) equal to today's date minus 100 or 200 days. (exactly 100 or 200 days ago - not a range)
Note that mycol always has 00.00.000000 time. (Ignore why that is)
Here is one example of how this can be written:
select * from mytable mt where
date(mycol) in (current date - 100 days, current date - 200 days)
I'm thinking this may be more efficient:
select * from mytable mt where
mycol in (timestampadd(16,-100,timestamp(current date,'00:00:00')),
timestampadd(16,-200,timestamp(current date,'00:00:00')))
The reason I believe it is more efficient is because I'm not calling a function on mycol (as I did in the first example) and the calculations on current date happen only once per execution of this query and not for every row.
Am I correct in my assumption?
I would write the second version as:
select *
from mytable mt
where mycol in (timestamp(current date - 100 days, '00:00:00'),
timestamp(current date - 200 days, '00:00:00')
);
If you care about performance, then you should have an index on mytable(mycol), because this will speed the query. Without an index, the additional overhead is the call to date() in each row. You would need to run timings in your environment to determine whether that is an issue in your environment.
I think second option is more eficient but for other reason.
If you declare an index over your column mycol the IN operator will be much faster.
But at the moment you perform date(mycol) db can't use the index anymore.
The best way is test both query unsing analyze/ explain.
ADD INFO:
You should consider using EXISTS instead of IN as show HERE
Test your two querys using EXPLAIN de DB2 or DB2 SQL Performance Analyzer
And then try the same using the EXISTS version.

SQL Performance with Distinct and Count

I have a stored procedure which contain so many SET statements. That is taking so long to execute. What I can do for increase the performance. One statement I have included here.
SET #VisitedOutlets=(select count (distinct CustomerId) from dbo.VisitDetail
where RouteId = #intRouteID
and CONVERT(VARCHAR(10),VisitDate,111) between CONVERT(VARCHAR(10),#FromDate,111)
and CONVERT(VARCHAR(10),#ToDate,111));
I think your problem comes from the fact that you are using variables in your query. Normally, the optimizer will ... optimize (!) the query for a given (hard coded) value (let's say id = 123) for instance, whereas it cannot optimize since it is a variable.
Let's take a great example from here :
OK,
You are the Optimizer and the Query Plan is a vehicle.
I will give you a query and you have to choose the vehicle.
All the books in the library have a sequential number
My query is Go to the library and get me all the books between 3 and 5
You'd pick a bike right, quick, cheap, efficient and big enough to
carry back 3 books.
New query.
Go to the library and get all the books between #x and #y.
Pick the vehicle.
Go ahead.
That's what happens. Do you pick a dump truck in case I ask for books
between 1 and Maxvalue? That's overkill if x=3 and y=5. SQL has to
pick the plan before it sees the numbers.
So your problem is that the optimizer cannot do its job correctly. To allow him doing his job, you can make him recompile, or update statistics. See here, here, and here.
So my 2 solutions to your problem would be :
Recompile : OPTION(RECOMPILE)
Update statistics : EXEC sp_updatestats
Your query is essentially:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail
where RouteId = #intRouteID and
CONVERT(VARCHAR(10), VisitDate, 111) between
CONVERT(VARCHAR(10), #FromDate, 111) and CONVERT(VARCHAR(10), #ToDate, 111);
I think this query can be optimized to take advantage of indexes. One major problem is the date comparison. You should not be doing any conversion for the comparison on VisitDate. So, I would rewrite the query as:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail vd
where vd.RouteId = #intRouteID and
vd.VisitDate >= cast(#FromDate as date) and
vd.VisitDate < dateadd(day, 1, cast(#ToDate as date))
For this query, you want an index on VisitDetail(RouteId, VisitDate, CustomerId). I would also store the constants in the appropriate format, so conversions are not needed in the query itself.
between is dangerous when using dates. Here is an interesting discussion on the topic by Aaron Bertrand.

Converting a string to a date time in SQL

I'm importing data from a different system and the datetime is stored as string in this format:
20061105084755ES
yyyymmddhhmmss(es/ed) where es is EST and ed is EDT.
I will have to query this table for the last 30 days. I'm using the conversion query:
select convert(
datetime,
left(cdts, 4)+'-'+substring(cdts, 5,2)+'-'substring(cdts, 7,2)+' '+substring(cdts, 9,2) +':'+substring(cdts, 11,2)+':'+substring(cdts, 13,2)
as dt
from tb1
where dt < getdate()-30
I'm looking for a more efficient query that will reduce the time taken. This table has around 90 million records and the query runs forever.
No calculation at runtime is going to speed this query up if you are performing the calculation and then need to filter against the result of the calculation - SQL Server will be forced to perform a table scan. The main problem is that you've chosen to store your dates as a string. For a variety of reasons, this is a terrible decision. Is the string column indexed at least? If so, then this may help get the data only from the last 30 days:
DECLARE #ThirtyDays CHAR(8);
SET #ThirtyDays = CONVERT(CHAR(8),DATEADD(DAY,DATEDIFF(DAY,0,GETDATE()),0)-30,112);
SELECT ...
WHERE cdts >= #ThirtyDays;
If you need to return all the data from all of history except the past 30 days, this isn't going to help either, because unless you are only pulling data from the indexed column, the most efficient approach for retrieving most of the data in the table is to use a clustered index scan. (If you are retrieving a narrow set of columns, it may opt for an index scan, if you have a covering index.) So, your bottleneck in much of these scenarios is not something a formula can fix, but rather the time it takes to actually retrieve a large volume of data, transmit it over the network, and render it on the client.
Also, as an aside, you can't do this:
SELECT a + b AS c FROM dbo.somewhere
WHERE c > 10;
c doesn't exist in dbo.somewhere, it is an expression derived in the SELECT list. The SELECT list is parsed second last (right before ORDER BY), so you can't reference something in the WHERE clause that doesn't exist yet. Typical workarounds are to repeat the expression or use a subquery / CTE.
One potential option is to add a date column to your table and populate that information on load. This way the conversion is all done before you need to query for it.
Then, make sure you have an index on that field which the actual query can take advantage of.
convert(datetime,stuff(stuff(stuff(datevalue, 9, 0, ' '), 12, 0, ':'), 15, 0, ':'))
or
Convert(time,Dateadd(SECOND,
Right(DateValue,2)/1,
Dateadd(MINUTE,
Right(DateValue,4)/100,
Dateadd(hour,
Right(DateValue,6)/10000,
'1900-01-01')))) +
convert(datetime,LEFT(datevalue,8))
Link

How to get all rows from a table inserted in a particular date.

I am trying to write a query that gets all the rows of a table for a particular date.
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE='2013-05-07'
However that does not work, because in the table the COLUMN_CONTAINING_DATE contains data like '2013-05-07 00:00:01' etc. So, this would work
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE>='2013-05-07' AND COLUMN_CONTAINING_DATE<'2013-05-08'
However, I dont want to go for option 2 because that feels like a hacky way. I would rather put a query that says get me all the rows for a give date and somehow not bother about the minutes and hours in the COLUMN_CONTAINING_DATE.
I am trying to have this query run on both H2 and DB2.
Any suggestions?
You can do:
select *
from MY_Table
where trunc(COLUMN_CONTAINING_DATE) = '2013-05-07';
However, the version that you describe as a "hack" is actually better. By wrapping a function around the data, many SQL optimizers will not use indexes. With just direct comparisons, an index would definitely be used.
Use something like this
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE=DATE('2013-05-07')
You can ease this if you use the Temporal data management capability from DB2 10.1.
For more information:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
If your concerns are related to the different data types (timestamp in the column, and a string containing a date), you can do this:
SELECT * FROM MY_TABLE
WHERE
COLUMN_CONTAINING_DATE >= '2013-05-07 00:00:00'
and COLUMN_CONTAINING_DATE < '2013-05-08 00:00:00'
and I'd pay attention to the formatting of the where clause, because this will improve readability a lot, if you have to look at your queries two months later. Just pick a style you prefer for ranges like "a <= x < b". Unfortunately SQL's between does not support this.
One could argue that the milliseconds are still missing, so perfectionists may append another ".0" in the timestamp ...

Is SQL DATEDIFF(year, ..., ...) an Expensive Computation?

I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.
It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.
DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.
One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).