SQL Performance with Distinct and Count - sql

I have a stored procedure which contain so many SET statements. That is taking so long to execute. What I can do for increase the performance. One statement I have included here.
SET #VisitedOutlets=(select count (distinct CustomerId) from dbo.VisitDetail
where RouteId = #intRouteID
and CONVERT(VARCHAR(10),VisitDate,111) between CONVERT(VARCHAR(10),#FromDate,111)
and CONVERT(VARCHAR(10),#ToDate,111));

I think your problem comes from the fact that you are using variables in your query. Normally, the optimizer will ... optimize (!) the query for a given (hard coded) value (let's say id = 123) for instance, whereas it cannot optimize since it is a variable.
Let's take a great example from here :
OK,
You are the Optimizer and the Query Plan is a vehicle.
I will give you a query and you have to choose the vehicle.
All the books in the library have a sequential number
My query is Go to the library and get me all the books between 3 and 5
You'd pick a bike right, quick, cheap, efficient and big enough to
carry back 3 books.
New query.
Go to the library and get all the books between #x and #y.
Pick the vehicle.
Go ahead.
That's what happens. Do you pick a dump truck in case I ask for books
between 1 and Maxvalue? That's overkill if x=3 and y=5. SQL has to
pick the plan before it sees the numbers.
So your problem is that the optimizer cannot do its job correctly. To allow him doing his job, you can make him recompile, or update statistics. See here, here, and here.
So my 2 solutions to your problem would be :
Recompile : OPTION(RECOMPILE)
Update statistics : EXEC sp_updatestats

Your query is essentially:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail
where RouteId = #intRouteID and
CONVERT(VARCHAR(10), VisitDate, 111) between
CONVERT(VARCHAR(10), #FromDate, 111) and CONVERT(VARCHAR(10), #ToDate, 111);
I think this query can be optimized to take advantage of indexes. One major problem is the date comparison. You should not be doing any conversion for the comparison on VisitDate. So, I would rewrite the query as:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail vd
where vd.RouteId = #intRouteID and
vd.VisitDate >= cast(#FromDate as date) and
vd.VisitDate < dateadd(day, 1, cast(#ToDate as date))
For this query, you want an index on VisitDetail(RouteId, VisitDate, CustomerId). I would also store the constants in the appropriate format, so conversions are not needed in the query itself.
between is dangerous when using dates. Here is an interesting discussion on the topic by Aaron Bertrand.

Related

SQL Server - Weird Index Usage

So here is the original query I'm working with
SELECT TOP(10) *
FROM Orders o
WHERE (o.DateAdded >= DATEADD(DAY, - 30, getutcdate())) AND (o.DateAdded <= GETUTCDATE())
ORDER BY o.DateAdded ASC,
o.Price ASC
o.Quantity DESC
Datatype:
DateAdded - smalldatetime
Price - decimal(19,8)
Quantity - int
I have an index on the Orders table with the same 3 columns in the same order, so when I run this, it's perfect. Time < 0ms, Live Query Statistics shows it only reads the 10 rows. Awesome.
However, as soon as I add this line to the WHERE clause
AND o.Price BETWEEN convert(decimal(19,8), 0) AND #BuyPrice
It all goes to hell (and unfortunately I need that line). It also behaves the same if it's just o.Price <= #BuyPrice. Live Query Statistics shows the number of rows read is ~30k. It also shows that the o.Price comparison isn't being used as a seek predicate, and I'm having a hard time understanding why it isn't. I've verified #BuyPrice is the right datatype, as I found several articles that discuss issues with implicit conversions. At first I thought it was because I had two ranges: first the dateAdded then Price, but I have other queries doing with multi column indexes and multiple ranges and they all perform just fine. I'm absolutely baffled as to why this one has decided to be a burden. I've tried changing the order of columns in the index, changing them from ASC to DESC, but nada.
Would highly appreciate anyone telling me what I'm missing. Thanks
It is impossible for the optimizer to use two range predicates at the same time.
Think about it: It starts scanning from a certain spot in the index sorted by DateAdded. It now needs, within each individual DateAdded value to seek to a particular Price, start scanning, and stop at another Price, then jump to the next DateAdded.
This is called skip-scanning, it is only efficient when the first predicate is not very many values, otherwise it is inefficient, and because of this, only Oracle has implemented it, not SQL Server.
I think this is due to the TOP 10 which cannot take place before the ORDER BY.
And this ORDER BY must wait until the result set is ready.
Without your additional price range, the TOP 10 can be taken from the existing index directly. But adding the second range will force another operation to be run first.
In short:
First your filter must get the rows for the price range together with the date range.
The resulting set is sorted and the top 10 rows are taken.
Did you try to add a separate index on your price column? This should speed up the first filter.
We cannot predict the execution plan in many cases, but you might try to
write an intermediate set, filtered by the date range, into a temp table and proceed from there. You might even create an index on the price column there (Depends on the expected row count. Probably the best option).
use a CTE to define a set filtered by the the date range and use this set to apply your price range. But a CTE is not the same as a temp table. The final execution plan might be the same as before...
use two CTEs to define two sets (one per range) and use INNER JOIN as a way to get the same as with WHERE condition1 AND condition2.

Understanding why SQL query is taking so long

I have a fairly large SQL query written. Below is a simplification of the issue i am seeing.
SELECT *
FROM dbo.MyTransactionDetails TDTL
JOIN dbo.MyTransactions TRANS
on TDTL.ID = TRANS.ID
JOIN dbo.Customer CUST
on TRANS.CustID = CUST.CustID
WHERE TDTL.DetailPostTime > CONVERT(datetime, '2015-05-04 10:25:53', 120)
AND TDTL.DetailPostTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
The MyTransactionDetails contains about 7 million rows and MyTransactions has about 300k rows.
The above query takes about 10 minutes to run which is insane. All indexes have been reindexed and there is an index on all the ID columns.
Now if i add the below lines to the WHERE clause the query the query takes about 1 second.
AND TRANS.TransBeginTime > CONVERT(datetime, '2015-05-05 10:25:53', 120)
AND TRANS.TransBeginTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
I know the contents of the database and the TransBeginTime is almost identical to the DetailPostTime so these extra where clauses shouldnt filter much more then the JOIN.
Why is the addition of these so much faster?
The problem is that i cannot use the filter on TransBeginTime as it is not guaranteed that the transaction detail will be posted on the same date.
EDIT: I should also add that the execution plan says that 50% of the time is taken up by MyTransactionDetails
The percentages shown in the plan (both estimated and actual) are estimates that are based on the assumption that estimated row counts are correct. On bad cases the percentages can be totally wrong, even so that 1% can actually be 95%.
To figure out what is actually happening, turn on "statistics io". That will tell you the logical I/O count per table -- and getting that down usually means that also the time goes down.
You can also look at the actual plan, and there's a lot of things that can cause slowness, like scans, sorts, key lookups, spools etc. If you include both statistics I/O and execution plan (preferably the actual xml, not just the picture) it is a lot easier to figure out what's going wrong.

TSQL counting how many occurrences on each day

I have a table that registers visitors to a website. I want to count how many have visited my website on each day.
My problem is that I can't figure out how to typecast the datetime value so that it doesn't use the entire date field when making a distinct count.
Can anyone explain this?
SELECT
DateWithNoTimePortion = DateAdd(Day, DateDiff(Day, '19000101', DateCol), '19000101'),
VisitorCount = Count(*)
FROM Log
GROUP BY DateDiff(Day, 0, DateCol);
For some reason I assumed you were using SQL Server. If that is not true, please let us know. I think the DateDiff method could work for you in other DBMSes depending on the functions they support, but they may have better ways to do the job (such as TRUNC in Oracle).
In SQL Server the above method is one of the fastest ways of doing the job. There are only two faster ways:
Intrinsic int-conversion rounding :
Convert(datetime, Convert(int, DateCol - '12:00:00.003'))
If using SQL Server 2008 and up, this is the fastest of all (and you should use it if that's what you have):
Convert(date, DateCol)
When SQL Server 2008 is not available, I think the method I posted is the best mix of speed and clarity for future developers looking at the code, avoiding doing magic stuff that isn't clear. You can see the tests backing up my speed claims.
I would do the Group By method (already posted as an answer by #Erik)
However, You can also use OVER and Partition By to accomplish this.
SELECT
DISTINCT CONVERT(Date, VisitDate),
COUNT(*) OVER (PARTITION BY CONVERT(Date, VisitDate)) as Visitors -- Convert Datetime to Date
FROM
MyLogTable
ORDER BY
CONVERT(Date, VisitDate)
It helps to convert your DateTime to a Date in these situations.
This is not as efficient as #Erik's solution; however, it's a good idea to learn the method, because you can do things with Over and Partition By that you can't do with Group By (at least, not efficiently). With that said, this is probably overkill for your situation.

Is SQL DATEDIFF(year, ..., ...) an Expensive Computation?

I'm trying to optimize up some horrendously complicated SQL queries because it takes too long to finish.
In my queries, I have dynamically created SQL statements with lots of the same functions, so I created a temporary table where each function is only called once instead of many, many times - this cut my execution time by 3/4.
So my question is, can I expect to see much of a difference if say, 1,000 datediff computations are narrowed to 100?
EDIT:
The query looks like this :
SELECT DISTINCT M.MID, M.RE FROM #TEMP INNER JOIN M ON #TEMP.MID=M.MID
WHERE ( #TEMP.Property1=1 ) AND
DATEDIFF( year, M.DOB, #date2 ) >= 15 AND DATEDIFF( year, M.DOB, #date2 ) <= 17
where these are being generated dynamically as strings (put together in bits and pieces) and then executed so that various parameters can be changed along each iteration - mainly the last lines, containing all sorts of DATEDIFF queries.
There are about 420 queries like this where these datediffs are being calculated like so. I know that I can pull them all into a temp table easily (1,000 datediffs becomes 50) - but is it worth it, will it make any difference in seconds? I'm hoping for an improvement better than in the tenths of seconds.
It depends on exactly what you are doing to be honest as to the extent of the performance hit.
For example, if you are using DATEDIFF (or indeed any other function) within a WHERE clause, then this will be a cause of poorer performance as it will prevent an index being used on that column.
e.g. basic example, finding all records in 2009
WHERE DATEDIFF(yyyy, DateColumn, '2009-01-01') = 0
would not make good use of an index on DateColumn. Whereas a better solution, providing optimal index usage would be:
WHERE DateColumn >= '2009-01-01' AND DateColumn < '2010-01-01'
I recently blogged about the difference this makes (with performance stats/execution plan comparisons), if you're interested.
That would be costlier than say returning DATEDIFF as a column in the resultset.
I would start by identifying the individual queries that are taking the most time. Check the execution plans to see where the problem lies and tune from there.
Edit:
Based on the example query you've given, here's an approach you could try out to remove the use of DATEDIFF within the WHERE clause. Basic example to find everyone who was 10 years old on a given date - I think the maths is right, but you get the idea anyway! Gave it a quick test, and seems fine. Should be easy enough to adapt to your scenario. If you want to find people between (e.g.) 15 and 17 years old on a given date, then that's also possible with this approach.
-- Assuming #Date2 is set to the date at which you want to calculate someone's age
DECLARE #AgeAtDate INTEGER
SET #AgeAtDate = 10
DECLARE #BornFrom DATETIME
DECLARE #BornUntil DATETIME
SELECT #BornFrom = DATEADD(yyyy, -(#AgeAtDate + 1), #Date2)
SELECT #BornUntil = DATEADD(yyyy, -#AgeAtDate , #Date2)
SELECT DOB
FROM YourTable
WHERE DOB > #BornFrom AND DOB <= #BornUntil
An important note to add, is for age caculates from DOB, this approach is more accurate. Your current implementation only takes the year of birth into account, not the actual day (e.g. someone born on 1st Dec 2009 would show as being 1 year old on 1st Jan 2010 when they are not 1 until 1st Dec 2010).
Hope this helps.
DATEDIFF is quite efficient compared to other methods of handling of datetime values, like strings. (see this SO answer).
In this case, it sounds like you going over and over the same data, which is likely more expensive than using a temp table. For example, statistics will be generated.
One thing you might be able do to improve performance might be to put an index on the temp table on MID.
Check your execution plan to see if it helps (may depend on the number of rows in the temp table).

How Does Dateadd Impact the Performance of a SQL Query?

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?
Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same
Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries
I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.
most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.
As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.
Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200