Understanding why SQL query is taking so long - sql

I have a fairly large SQL query written. Below is a simplification of the issue i am seeing.
SELECT *
FROM dbo.MyTransactionDetails TDTL
JOIN dbo.MyTransactions TRANS
on TDTL.ID = TRANS.ID
JOIN dbo.Customer CUST
on TRANS.CustID = CUST.CustID
WHERE TDTL.DetailPostTime > CONVERT(datetime, '2015-05-04 10:25:53', 120)
AND TDTL.DetailPostTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
The MyTransactionDetails contains about 7 million rows and MyTransactions has about 300k rows.
The above query takes about 10 minutes to run which is insane. All indexes have been reindexed and there is an index on all the ID columns.
Now if i add the below lines to the WHERE clause the query the query takes about 1 second.
AND TRANS.TransBeginTime > CONVERT(datetime, '2015-05-05 10:25:53', 120)
AND TRANS.TransBeginTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
I know the contents of the database and the TransBeginTime is almost identical to the DetailPostTime so these extra where clauses shouldnt filter much more then the JOIN.
Why is the addition of these so much faster?
The problem is that i cannot use the filter on TransBeginTime as it is not guaranteed that the transaction detail will be posted on the same date.
EDIT: I should also add that the execution plan says that 50% of the time is taken up by MyTransactionDetails

The percentages shown in the plan (both estimated and actual) are estimates that are based on the assumption that estimated row counts are correct. On bad cases the percentages can be totally wrong, even so that 1% can actually be 95%.
To figure out what is actually happening, turn on "statistics io". That will tell you the logical I/O count per table -- and getting that down usually means that also the time goes down.
You can also look at the actual plan, and there's a lot of things that can cause slowness, like scans, sorts, key lookups, spools etc. If you include both statistics I/O and execution plan (preferably the actual xml, not just the picture) it is a lot easier to figure out what's going wrong.

Related

SQL Server - Weird Index Usage

So here is the original query I'm working with
SELECT TOP(10) *
FROM Orders o
WHERE (o.DateAdded >= DATEADD(DAY, - 30, getutcdate())) AND (o.DateAdded <= GETUTCDATE())
ORDER BY o.DateAdded ASC,
o.Price ASC
o.Quantity DESC
Datatype:
DateAdded - smalldatetime
Price - decimal(19,8)
Quantity - int
I have an index on the Orders table with the same 3 columns in the same order, so when I run this, it's perfect. Time < 0ms, Live Query Statistics shows it only reads the 10 rows. Awesome.
However, as soon as I add this line to the WHERE clause
AND o.Price BETWEEN convert(decimal(19,8), 0) AND #BuyPrice
It all goes to hell (and unfortunately I need that line). It also behaves the same if it's just o.Price <= #BuyPrice. Live Query Statistics shows the number of rows read is ~30k. It also shows that the o.Price comparison isn't being used as a seek predicate, and I'm having a hard time understanding why it isn't. I've verified #BuyPrice is the right datatype, as I found several articles that discuss issues with implicit conversions. At first I thought it was because I had two ranges: first the dateAdded then Price, but I have other queries doing with multi column indexes and multiple ranges and they all perform just fine. I'm absolutely baffled as to why this one has decided to be a burden. I've tried changing the order of columns in the index, changing them from ASC to DESC, but nada.
Would highly appreciate anyone telling me what I'm missing. Thanks
It is impossible for the optimizer to use two range predicates at the same time.
Think about it: It starts scanning from a certain spot in the index sorted by DateAdded. It now needs, within each individual DateAdded value to seek to a particular Price, start scanning, and stop at another Price, then jump to the next DateAdded.
This is called skip-scanning, it is only efficient when the first predicate is not very many values, otherwise it is inefficient, and because of this, only Oracle has implemented it, not SQL Server.
I think this is due to the TOP 10 which cannot take place before the ORDER BY.
And this ORDER BY must wait until the result set is ready.
Without your additional price range, the TOP 10 can be taken from the existing index directly. But adding the second range will force another operation to be run first.
In short:
First your filter must get the rows for the price range together with the date range.
The resulting set is sorted and the top 10 rows are taken.
Did you try to add a separate index on your price column? This should speed up the first filter.
We cannot predict the execution plan in many cases, but you might try to
write an intermediate set, filtered by the date range, into a temp table and proceed from there. You might even create an index on the price column there (Depends on the expected row count. Probably the best option).
use a CTE to define a set filtered by the the date range and use this set to apply your price range. But a CTE is not the same as a temp table. The final execution plan might be the same as before...
use two CTEs to define two sets (one per range) and use INNER JOIN as a way to get the same as with WHERE condition1 AND condition2.

SQL to group time intervals by arbitrary time period

I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?
Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

SQL Performance with Distinct and Count

I have a stored procedure which contain so many SET statements. That is taking so long to execute. What I can do for increase the performance. One statement I have included here.
SET #VisitedOutlets=(select count (distinct CustomerId) from dbo.VisitDetail
where RouteId = #intRouteID
and CONVERT(VARCHAR(10),VisitDate,111) between CONVERT(VARCHAR(10),#FromDate,111)
and CONVERT(VARCHAR(10),#ToDate,111));
I think your problem comes from the fact that you are using variables in your query. Normally, the optimizer will ... optimize (!) the query for a given (hard coded) value (let's say id = 123) for instance, whereas it cannot optimize since it is a variable.
Let's take a great example from here :
OK,
You are the Optimizer and the Query Plan is a vehicle.
I will give you a query and you have to choose the vehicle.
All the books in the library have a sequential number
My query is Go to the library and get me all the books between 3 and 5
You'd pick a bike right, quick, cheap, efficient and big enough to
carry back 3 books.
New query.
Go to the library and get all the books between #x and #y.
Pick the vehicle.
Go ahead.
That's what happens. Do you pick a dump truck in case I ask for books
between 1 and Maxvalue? That's overkill if x=3 and y=5. SQL has to
pick the plan before it sees the numbers.
So your problem is that the optimizer cannot do its job correctly. To allow him doing his job, you can make him recompile, or update statistics. See here, here, and here.
So my 2 solutions to your problem would be :
Recompile : OPTION(RECOMPILE)
Update statistics : EXEC sp_updatestats
Your query is essentially:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail
where RouteId = #intRouteID and
CONVERT(VARCHAR(10), VisitDate, 111) between
CONVERT(VARCHAR(10), #FromDate, 111) and CONVERT(VARCHAR(10), #ToDate, 111);
I think this query can be optimized to take advantage of indexes. One major problem is the date comparison. You should not be doing any conversion for the comparison on VisitDate. So, I would rewrite the query as:
select #VisitedOutlets= count(distinct CustomerId)
from dbo.VisitDetail vd
where vd.RouteId = #intRouteID and
vd.VisitDate >= cast(#FromDate as date) and
vd.VisitDate < dateadd(day, 1, cast(#ToDate as date))
For this query, you want an index on VisitDetail(RouteId, VisitDate, CustomerId). I would also store the constants in the appropriate format, so conversions are not needed in the query itself.
between is dangerous when using dates. Here is an interesting discussion on the topic by Aaron Bertrand.

SQL: Minimising rows in subqueries/partitioning

So here's an odd thing. I have limited SQL access to a database - the most relevant restriction here being that if I create a query, a maximum of 10,000 rows is returned.
Anyway, I've been trying to have a query return individual case details, but only at busy times - say when 50+ cases are attended to in an hour. So, I inserted the following line:
COUNT(CaseNo) OVER (PARTITION BY DATEADD(hh,
DATEDIFF(hh, 0, StartDate), 0)) AS CasesInHour
... And then used this as a subquery, selecting only those cases where CasesInHour >= 50
However, it turns out that the 10,000 rows limit affects the partitioning - when I tried to run over a longer period nothing came up, as it was counting the cases in any given hour from only a (fairly random) much smaller selection.
Can anyone think of a way to get around this limit? The final total returned will be much lower than 10,000 rows, but it will be looking at far more than 10,000 as a starting point.
If this is really MySQL we're talking about, sql_big_selects and max_join_size affects the number of rows examined, not the number of rows "returned". So, you'll need to reduce the number of rows examined by being more selective and using proper indexes.
For example, the following query may be examining over 10,000 rows:
SELECT * FROM stats
To limit the selectivity, you might want to grab only the rows from the last 30 days:
SELECT * FROM stats
WHERE created > DATESUB(NOW(), INTERVAL 30 DAY)
However, this only reduces the number of rows examined if there is an index on the created column and the cardinality of the index is sufficient to reduce the rows examined.

How Does Dateadd Impact the Performance of a SQL Query?

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?
Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same
Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries
I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.
most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.
As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.
Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200