I have a table that registers visitors to a website. I want to count how many have visited my website on each day.
My problem is that I can't figure out how to typecast the datetime value so that it doesn't use the entire date field when making a distinct count.
Can anyone explain this?
SELECT
DateWithNoTimePortion = DateAdd(Day, DateDiff(Day, '19000101', DateCol), '19000101'),
VisitorCount = Count(*)
FROM Log
GROUP BY DateDiff(Day, 0, DateCol);
For some reason I assumed you were using SQL Server. If that is not true, please let us know. I think the DateDiff method could work for you in other DBMSes depending on the functions they support, but they may have better ways to do the job (such as TRUNC in Oracle).
In SQL Server the above method is one of the fastest ways of doing the job. There are only two faster ways:
Intrinsic int-conversion rounding :
Convert(datetime, Convert(int, DateCol - '12:00:00.003'))
If using SQL Server 2008 and up, this is the fastest of all (and you should use it if that's what you have):
Convert(date, DateCol)
When SQL Server 2008 is not available, I think the method I posted is the best mix of speed and clarity for future developers looking at the code, avoiding doing magic stuff that isn't clear. You can see the tests backing up my speed claims.
I would do the Group By method (already posted as an answer by #Erik)
However, You can also use OVER and Partition By to accomplish this.
SELECT
DISTINCT CONVERT(Date, VisitDate),
COUNT(*) OVER (PARTITION BY CONVERT(Date, VisitDate)) as Visitors -- Convert Datetime to Date
FROM
MyLogTable
ORDER BY
CONVERT(Date, VisitDate)
It helps to convert your DateTime to a Date in these situations.
This is not as efficient as #Erik's solution; however, it's a good idea to learn the method, because you can do things with Over and Partition By that you can't do with Group By (at least, not efficiently). With that said, this is probably overkill for your situation.
Related
I'm running the following code on a dataset of 100M to test some things out before I eventually join the entire range (not just the top 10) on another table to make it even smaller.
SELECT TOP 10 *
FROM Table
WHERE CONVERT(datetime, DATE, 112) BETWEEN '2020-07-04 00:00:00' AND '2020-07-04 23:59:59'
The table isn't mine but a client's, so unfortunately I'm not responsible for the data types of the columns. The DATE column, along with the rest of the data, is in varchar. As for the dates in the BETWEEN clause, I just put in a relatively small range for testing.
I have heard that CONVERT shouldn't be in the WHERE clause, but I need to convert it to dates in order to filter. What is the proper way of going about this?
Going to summarise my comments here, as they are "second class citizens" and thus could be removed.
Firstly, the reason your query is slow is because of theCONVERT on the column DATE in your WHERE. Applying functions to a column in your WHERE will almost always make your query non-SARGable (there are some exceptions, but that doesn't make them a good idea). As a result, the entire table must be scanned to find rows that are applicable for your WHERE; it can't use an index to help it.
The real problem, therefore, is that you are storing a date (and time) value in your table as a non-date (and time) datatype; presumably a (n)varchar. This is, in truth, a major design flaw and needs to be fixed. String type values aren't validated to be valid dates, so someone could easily insert the "date" '20210229' or even 20211332'. Fixing the design not only stops this, but also makes your data smaller (a date is 3 bytes in size, a varchar(8) would be 10 bytes), and you could pass strongly typed date and time values to your query and it would be SARGable.
"Fortunately" it appears your data is in the style code 112, which is yyyyMMdd; this at least means that the ordering of the dates is the same as if it were a strongly typed date (and time) data type. This means that the below query will work and return the results you want:
SELECT TOP 10 * --Ideally don't use * and list your columns properly
FROM dbo.[Table]
WHERE [DATE] >= '20210704' AND [DATE] < '20210705'
ORDER BY {Some Column};
you can use like this to get better performance:
SELECT TOP 10 *
FROM Table
WHERE cast(DATE as date) BETWEEN '2020-07-04' AND '2020-07-04' and cast(DATE as time) BETWEEN '00:00:00' AND '23:59:59'
No need to include time portion if you want to search full day.
Which method is most efficient when comparing PARTS of date/datetime values? Example for comparing month of datetimes:
where insdate =DATEADD(month, DATEDIFF(month, 0, #insdate), 0)
or
where year(insdate)=year(#insdate) and month(insdate)=month(#insdate)
I'm using sql server
I disagree with Damien_The_Unbeliever's assertion that you should just use whichever reads cleaner, as there are objective reasons why one approach will be better than the other. The most pertinent of these is what is known as SARGability.
In essence this refers to whether SQL Server can use your values in the efficient manners it is designed to do, such as utilising indexes.
The differences in your two examples are nicely outlines here.
In short, if you have functions or calculated values on both sides of your equality conditions, SQL Server is definitely going to have to check every single value returned, whereas if you apply the principles of SARGability from the off, even if you don't see any significant benefits immediately you are at least in a better position to realise those benefits later on if required.
In my opinion, the best way to implement Year or YearMonth check is to cast date in this format YYYYMMDD and then work with that.
This is an example:
Filter by YearMonthDay
SELECT * FROM myTable
WHERE CONVERT(VARCHAR,MyField,112) = 20170607
Filter by YearMonth
SELECT * FROM myTable
WHERE CONVERT(VARCHAR,MyField,112) / 100 = 201706
Filter by Year
SELECT * FROM myTable
WHERE CONVERT(VARCHAR,MyField,112) / 10000 = 2017
For sure this perfomrs better than using Year() ,Month() , DateAdd(), DateDiff() functions.
I have a stored procedure which contain so many SET statements. That is taking so long to execute. What I can do for increase the performance. One statement I have included here.
SET #VisitedOutlets=(select count (distinct CustomerId) from dbo.VisitDetail
where RouteId = #intRouteID
and CONVERT(VARCHAR(10),VisitDate,111) between CONVERT(VARCHAR(10),#FromDate,111)
and CONVERT(VARCHAR(10),#ToDate,111));
I think your problem comes from the fact that you are using variables in your query. Normally, the optimizer will ... optimize (!) the query for a given (hard coded) value (let's say id = 123) for instance, whereas it cannot optimize since it is a variable.
Let's take a great example from here :
OK,
You are the Optimizer and the Query Plan is a vehicle.
I will give you a query and you have to choose the vehicle.
All the books in the library have a sequential number
My query is Go to the library and get me all the books between 3 and 5
You'd pick a bike right, quick, cheap, efficient and big enough to
carry back 3 books.
New query.
Go to the library and get all the books between #x and #y.
Pick the vehicle.
Go ahead.
That's what happens. Do you pick a dump truck in case I ask for books
between 1 and Maxvalue? That's overkill if x=3 and y=5. SQL has to
pick the plan before it sees the numbers.
So your problem is that the optimizer cannot do its job correctly. To allow him doing his job, you can make him recompile, or update statistics. See here, here, and here.
So my 2 solutions to your problem would be :
Recompile : OPTION(RECOMPILE)
Update statistics : EXEC sp_updatestats
Your query is essentially:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail
where RouteId = #intRouteID and
CONVERT(VARCHAR(10), VisitDate, 111) between
CONVERT(VARCHAR(10), #FromDate, 111) and CONVERT(VARCHAR(10), #ToDate, 111);
I think this query can be optimized to take advantage of indexes. One major problem is the date comparison. You should not be doing any conversion for the comparison on VisitDate. So, I would rewrite the query as:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail vd
where vd.RouteId = #intRouteID and
vd.VisitDate >= cast(#FromDate as date) and
vd.VisitDate < dateadd(day, 1, cast(#ToDate as date))
For this query, you want an index on VisitDetail(RouteId, VisitDate, CustomerId). I would also store the constants in the appropriate format, so conversions are not needed in the query itself.
between is dangerous when using dates. Here is an interesting discussion on the topic by Aaron Bertrand.
I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.
It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.
DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.
One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).
Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?
Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same
Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries
I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.
most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.
As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.
Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200