I have a simple select statement which selects data from a SQL Server 2000 (so old) table with about 10-20 million rows like this -
#startDate = '2014-01-25' -- yyyy-mm-dd
#endDate = '2014-02-20'
SELECT
Id, 6-7 other columns
FROM
Table1 as t1
LEFT OUTER JOIN
Table2 as t2 ON t1.Code = t2.Code
WHERE
t1.Id = 'G59' -- yes, its a varchar
AND (t1.Entry_Date >= #startDate AND t1.Entry_Date < #endDate)
This gives me about 40 K rows in about 10 seconds. But, if I set #startDate = '2014-01-30', keeping #endDate same ALWAYS, then the query takes about 2 min 30 sec
To produce the same number of rows, I tried it with 01-30 again and it took 2 min 48 seconds.
I am surprised to see the difference. I was not expecting the difference to be so big. Rather, I was expecting it to take the same time or lesser for a smaller date range.
What could be the reason for this and how do I fix it ?
Have you recently inserted and/or deleted a large number of rows? It could be that the statistics on the table's indices are out of date, and thus the query optimizer will go for a "index seek + key lookup" scenario on the smaller date range - but that turns out to be slower than just doing a table/clustered index scan.
I would recommend to update the statistics (see this TechNEt article on how to update the statistics) and try again - any improvement?
The query optimizer uses statistics to determine whether it's faster to just do a table scan (just read all the table's data pages and select the rows that match), or whether it's faster to search for the search value in an index; that index typically doesn't contain all the data - so once a match is found, a key lookup needs to be performed on the table to get at the data - which is an expensive operation, so it's only viable for small sets of data. If out-of-date statistics "mislead" the query optimizer, it might choose a suboptimal execution plan
Related
I have the following query that takes approximately 2 minutes to return an output.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
I noticed that if I use this, it takes less than a second
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= '2015-06-22'
Why does that happen? I need the first query since the date can vary from day to day.
When you use a literal value in a query SQL Server will generate a query plan that is optimised for that specific value.
For example if you have a non clustered index on createdDate, the query optimiser can get a good estimate on the number of rows with createdDate >= '2015-06-22'. If this a small proportion of the rows on the table the optimal query will find the matching records in the createdDate index, then lookup the rest of the selected columns for the matching rows from the table.
When you use a variable in the WHERE clause SQL Server generates a single query plan that will be used for all possible values of #today. The plan generated is intended to be optimal for an arbitrary value of #today, but a single plan can only be optimal for a certain number of rows being selected.
Say the estimate for the average number of rows selected is half the number of rows in the table, in this case it is estimated to be more efficient to scan the entire table to filter the records rather than filtering the createdDate index and then having to do a large number of lookups on the table to get the rest of the selected columns.
The issue is that SQL Server is using a single query plan for queries that can have radically different row counts. The reason a single query plan is used for all values of #today is often it is more expensive to compile a optimal query than to just run a sub-optimal one. In your example this obviously isn't the case.
There are ways to change this behaviour:
1) You can also get Sql Server to generate a single plan but optimize it for a pre-determined value of #today
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (OPTIMIZE FOR (#today = '2015-06-22'))
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate and could provide a nice solution if your application is always going to be querying for records after a pre determined date.
2) Specifying the recompile option will cause a new query plan to be generated every time it is run, this allows a query optimised for the specific #today value.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (RECOMPILE)
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate.
WARNING: RECOMPLING QUERIES CAN ALSO BE SLOW. USING THE RECOMPILE OPTION CAN DEGRADE PERFOMANCE. USE AS LAST RESORT. ETC.
This is just a workaround. Apply this commnad:
UPDATE STATISTICS GAS_consumptions WITH FULLSCAN, NORECOMPUTE
Do some querys for some minutes.
Mesure again.
Take in mind that , NORECOMPUTE will requiere the execution of this command on a regular basis, because it stops the way that sqlserver generates statistics in the future.
In a similar case I got, this was "the solution". I don't know why it happens
I wrote a TSQL query for a ad-hoc report that is reading off a very large table (500million records) that's indexed (clustered) on Date/Time.
The query runs terribly slow on certain date ranges versus others where it's lightning fast. I'm trying to figure out why it's doing that.
I took 2 sets of date ranges. One for (04-03-2014 to 04-04-2014) and the other for (05-03-2014 to 05-04-2014). Basically one month apart from both test results. The first range is fast, returning in a mere 10 seconds or so where the other hangs forever.
Looking at the data sets to see if one is significantly larger than the other, I analyze 2 tables in my query as a form of unit testing each segment. The TableA is the first table I'm selecting with the big data. TableB is the joined table later on the query where I LEFT JOIN TableA ON TableB:
TableA (04-03) = 239,806 Records (1 Second Query Time)
TableB (04-03) = 6,569 Records (0 Second Query Time)
TableA (05-03) = 203,535 Records (8 Second Query Time)
TableB (05-03) = 3,388 Records (0 Second Query Time)
As you can see, TableA of the 04 date month is faster and more records than the TableA of the 05 date month, which has less records and slower times.
Now for the query itself, but I'm working on updating that. Here is some pseudo code:
CTE Query
SELECT PRODUCTS (TableA - 100K+ Records)
LEFT JOIN PRODUCT TABLE (1K Records)
FILTERED BY [Time], LIKE Statement off LEFT JOIN
SELECT FROM ( --SUBQUERY
SELECT FROM CTE Query
LEFT JOIN SALES (TableB - 1K+ Records)
JOIN ON [User-ID]
)
PIVOT SUBQUERY (18 Columns in Pivot)
Product is indexed (Clustered) on [Time], which is used in the query.
Sales is joined on [Users-ID] which is NON-CLustered INDEX on SALES (TableB)
Bottleneck looks to be when I join SALES within the SUBQUERY.
Optimizations
I looked at the fragmented indexes to see if that was the cause. I noticed the product table has a 85% fragmented index that could be the cause on a NON-CLUSTERED. I rebuilt that last night and no change. The Sales table also had a smaller one that was rebuilt too.
Rebuilt the clustered index where there was a low percentage fragmented on disk. After rebuilding the index, I had to restart the SQL Server for an unrelated task and the query was running the same speeds on the bad date range as all other ranges. I will assume the fix is attributed to the rebuild of the index as that makes the most sense if the same query is faster with other date ranges than others where the record sets were larger.
I'm using the SQL Server 2008.
I need your advice for why these two queries get similar time (around 52 seconds for over 2 millions rows):
Query 1:
DBCC DROPCLEANBUFFERS
DECLARE #curr INT
SET #curr = YEAR(GETDATE())
SELECT MAX([Date])
FROM DB_Item
WHERE YEAR([Date]) = #curr
Query 2:
DBCC DROPCLEANBUFFERS
SELECT MAX([Date])
FROM DB_Item
With using Actual Execution Plan, I see it scans with Clustered Index scan.
So, why is it and do we have another way to get Date's maximum in 1 table quickly?
Your help is highly appreciated.
Thanks.
For the second query, you can speed it up by adding an index on the date column.
For the first query, you need to make two changes. First create an index on the date column, and then change the query to use a between instead of a function on the left side of the equals. Search for the date between January 1 12:00am and December 31 11:59:59 pm of the target year. That way SQL Server can use the index.
A clustered index scan is a table scan because the actual data is the lowest level of the clustered index. So in this case both queries are looking at all rows.
Therefore, a nonclustered index on the Date column will help the 2nd query
In this case, it will also help the first query because YEAR is SARGable (? Can't find where I read this). This is quite rare in SQL Server: usually functions on predicate columns means indexes won't be used.
I have this query:
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
the biggest table here is RESULT, contains 11.1M records. The left 2 tables about 1M.
this query works slowly (more than 10 minutes) and returns about 800 records. executing plan shows clustered index scan (over it's PRIMARY KEY (result.result_number, which actually doesn't take part in query)) over all 11M records.
RESULT.TEST_NUMBER is a clustered primary key.
if I change 2010-03-17 09:00 to 2010-03-17 10:00 - i get about 40 records. it executes for 300ms. and plan shows index seek (over result.test_number index)
if i replace * in SELECT clause to result.test_number (covered with index) - then all become fast in first case too. this points to hdd IO issues, but doesn't clarifies changing plan.
so, any ideas?
UPDATE:
sampled_date is in table sample and covered by index.
other fields from this query: test.sample_number is covered by index and result.test_number too.
UPDATE 2:
obviously than sql server in any reasons don't want to use index.
i did a small experiment: i remove INNER JOIN with result, select all test.test_number and after that do
SELECT * FROM RESULT WHERE TEST_NUMBER IN (...)
this, of course, works fast. but i cannot get what is the difference and why query optimizer choose such inappropriate way to select data in 1st case.
UPDATE 3:
after backing up database and restoring to database with new name - both requests work fast as expected even on much more ranges...
so - are there any special commands to clean or optimize, whatever, that can be relevant to this? :-(
A couple things to try:
Update statistics
Add hints to the query about what index to use (in SQL Server you might say WITH (INDEX(myindex)) after specifying a table)
EDIT: You noted that copying the database made it work, which tells me that the index statistics were out of date. You can update them with something like UPDATE STATISTICS mytable on a regular basis.
Use EXEC sp_updatestats to update the whole database.
The first thing I would do is specify the exact columns I want, and see if the problems persists. I doubt you would need all the columns from all three tables.
It sounds like it has trouble getting all the rows out of the result table. How big is a row? Look at how big all the data in the table is and divide it by the number of rows. Right click on the table -> properties..., Storage tab.
Try putting where clause into a subquery to force it to do that first?
SELECT *
FROM
(SELECT * FROM sample
WHERE sampled_date
BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00') s
INNER JOIN test ON s.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
OR this might work better if you expect a small number of samples
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sample.sample_ID in (
SELECT sample_ID
FROM sample
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
)
If you do a SELECT *, you want all the data from the table. The data for the table is in the clustered index - the leaf nodes of the clustered index are the data pages.
So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows (1 out of 11 isn't very selective for SQL Server), using an index to find the rows, and then do bookmark lookups into the actual data pages for each of those rows found, might just not be very efficient, and thus SQL Server uses the clustered index scan instead.
So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data.
If you only select three, four columns, then the chances that SQL Server will find and use an index that contains those columns are just so much higher than if you ask for all the data from all the tables involved.
Another option would be to try and find a way to express a subquery, using e.g. a Common Table Expression, that would grab data from the two smaller tables, and reduce that number of rows even more, and join the hopefully quite small result against the main table. If you have a small result set of only 40 or 800 results (rather than two tables with 1 mio. rows each), then SQL Server might be more inclined to use a Clustered Index Seek and do bookmark lookups on 40 or 800 rows, rather than doing a full Clustered Index Scan.
What is the best index or ideas to improve date range lookups? (or query change)
begin tran t1
delete from h1 where histdate between #d1 and #d2 and HistRecordType = 'L'
insert into h1
select * from v_l WHERE HistDate between #d1 and #d2
commit tran t1
it is far slower than histdate = #d1
I have a clustered, non-unique index on the date column
however, the perf is the same switching to a non-clustered
if #d1 = #d2.. the query takes 8mins to run, histdate=#d1 runs in 1 second
(so that should sort of be equiv right?)
A clustered index is the best index for between queries.
Are you doing anything else in the WHERE part, then there may be ways to improve the query.
A quick why; a clustered index will sort the rows in the table (that is why you can only have one clustered index per table). So SQL server first needs to find the first row to return (#d1) then all the rows are stored in order, and are fast to retreive. The histdate = #d1 is quicker, as all it needs to do then is find the first row, it doesn't have to continue finding all the other rows until #d2.
A non-clustered index on the column will yield the same performance as the clustered key, as long as the non-clustered index contains all the fields (in the index or as INCLUDE columns) for any SELECT statement against it (a so-called "covering index").
The only reason a clustered index would be faster for range queries in many cases is the fact that the clustered index IS the table data, so any column you might need is right there in the clustered index. With a covering index, you achieve the same result - if the index plus any INCLUDE columns contains all the columns your SELECT statement needs to retrieve, you can satisfy the query just by looking at the index - there's no need to jump from the non-clustered index into the actual table to fetch more data ("bookmark lookup", which tends to be slow for lots of lookups).
For a DELETE, I don't think it makes any difference, really, as long as you just have an index on that column used in the WHERE clause (histdate).
Marc
DateTime is stored as a float internally, so your current example should be pretty efficient. One potential optimization would be have another column like DayOffset that is calculated along the lines of DATEDIFF(day, 0, histDate). This would go as the first column in the clustered key.
Then if you delete entire days at a time, you could just delete based on the DayOffset. If you do not want to delete on midnight boundaries you could delete based on the pair of DayOffset and date range.
WHERE DayOffset between DATEDIFF(day, 0, #d1) and DATEDIFF(day, 0, #d2)
and histdate between #d1 and #d2
Another possible option is to use partitioning. It is far far more efficient to age out data by dropping an old partition than it is to delete rows.
try each of these with "SET SHOWPLAN_ON ALL"
delete from h1 where histdate between CONVERT(datetime,'1/1/2009') and CONVERT(datetime,'5/30/2009')
delete from h1 where histdate =CONVERT(datetime,'1/1/2009')
delete from h1 where histdate>= CONVERT(datetime,'1/1/2009') and histdate<CONVERT(datetime,'6/01/2009')
I'll bet the execute plans are all about the same, and use the index. The difference may be the number of rows in the range is greater than a exact "=" match
Since indices are stored in a binary tree structure (sorted rather than a hash), the engine can equate a where clause range to an index range by seeking to two locations.
However, sometimes the engine may choose to scan the whole index even if the range specified results in retrieval of only a few records. Explain plan should show whether a index seek or index scan was chosen for the query.
As an example, there can be slow performance with a select range request that returns small number of <10 records from a million row table with an index (happens to be non-clustered but not the point) on the value range (index is on v1 + v2):
select * from
tbl t
where t.v1 = 123
and t.v2 < 4;
Since an index seek is probably a better choice for this case (small return set) it can be specified using the "with (forceseek)."
select *
from tbl t with (forceseek) -- is 6x faster for this case
where t.v1 = 123
and t.v2 < 4;
Optimally it would be nice if the engine would make the smarter decision without the force perhaps by doing the seek and counting the records especially when the table size is large.