What is the best index or ideas to improve date range lookups? (or query change)
begin tran t1
delete from h1 where histdate between #d1 and #d2 and HistRecordType = 'L'
insert into h1
select * from v_l WHERE HistDate between #d1 and #d2
commit tran t1
it is far slower than histdate = #d1
I have a clustered, non-unique index on the date column
however, the perf is the same switching to a non-clustered
if #d1 = #d2.. the query takes 8mins to run, histdate=#d1 runs in 1 second
(so that should sort of be equiv right?)
A clustered index is the best index for between queries.
Are you doing anything else in the WHERE part, then there may be ways to improve the query.
A quick why; a clustered index will sort the rows in the table (that is why you can only have one clustered index per table). So SQL server first needs to find the first row to return (#d1) then all the rows are stored in order, and are fast to retreive. The histdate = #d1 is quicker, as all it needs to do then is find the first row, it doesn't have to continue finding all the other rows until #d2.
A non-clustered index on the column will yield the same performance as the clustered key, as long as the non-clustered index contains all the fields (in the index or as INCLUDE columns) for any SELECT statement against it (a so-called "covering index").
The only reason a clustered index would be faster for range queries in many cases is the fact that the clustered index IS the table data, so any column you might need is right there in the clustered index. With a covering index, you achieve the same result - if the index plus any INCLUDE columns contains all the columns your SELECT statement needs to retrieve, you can satisfy the query just by looking at the index - there's no need to jump from the non-clustered index into the actual table to fetch more data ("bookmark lookup", which tends to be slow for lots of lookups).
For a DELETE, I don't think it makes any difference, really, as long as you just have an index on that column used in the WHERE clause (histdate).
Marc
DateTime is stored as a float internally, so your current example should be pretty efficient. One potential optimization would be have another column like DayOffset that is calculated along the lines of DATEDIFF(day, 0, histDate). This would go as the first column in the clustered key.
Then if you delete entire days at a time, you could just delete based on the DayOffset. If you do not want to delete on midnight boundaries you could delete based on the pair of DayOffset and date range.
WHERE DayOffset between DATEDIFF(day, 0, #d1) and DATEDIFF(day, 0, #d2)
and histdate between #d1 and #d2
Another possible option is to use partitioning. It is far far more efficient to age out data by dropping an old partition than it is to delete rows.
try each of these with "SET SHOWPLAN_ON ALL"
delete from h1 where histdate between CONVERT(datetime,'1/1/2009') and CONVERT(datetime,'5/30/2009')
delete from h1 where histdate =CONVERT(datetime,'1/1/2009')
delete from h1 where histdate>= CONVERT(datetime,'1/1/2009') and histdate<CONVERT(datetime,'6/01/2009')
I'll bet the execute plans are all about the same, and use the index. The difference may be the number of rows in the range is greater than a exact "=" match
Since indices are stored in a binary tree structure (sorted rather than a hash), the engine can equate a where clause range to an index range by seeking to two locations.
However, sometimes the engine may choose to scan the whole index even if the range specified results in retrieval of only a few records. Explain plan should show whether a index seek or index scan was chosen for the query.
As an example, there can be slow performance with a select range request that returns small number of <10 records from a million row table with an index (happens to be non-clustered but not the point) on the value range (index is on v1 + v2):
select * from
tbl t
where t.v1 = 123
and t.v2 < 4;
Since an index seek is probably a better choice for this case (small return set) it can be specified using the "with (forceseek)."
select *
from tbl t with (forceseek) -- is 6x faster for this case
where t.v1 = 123
and t.v2 < 4;
Optimally it would be nice if the engine would make the smarter decision without the force perhaps by doing the seek and counting the records especially when the table size is large.
Related
I have the following sql. When I check the execution plan of this query we observe an index scan. How do I replace it with index seek. I have non clustered index on IdDeleted column.
SELECT Cast(Format(Sum(COALESCE(InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet,
BP.BoundProjectId AS ProjectId
FROM BoundProducts BP
WHERE BP.IsDeleted=0 or BP.IsDeleted is null
GROUP BY BP.BoundProjectId
I tried like this and got index seek, but the result was wrong.
SELECT Cast(Format(Sum(COALESCE(InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet,
BP.BoundProjectId AS ProjectId
FROM BoundProducts BP
WHERE BP.IsDeleted=0
GROUP BY BP.BoundProjectId
Can anyone kindly suggest me to get the right result set with index seek.
I mean how to I replace (BP.IsDeleted=0 or BP.IsDeleted is null) condition to make use of index seek.
Edit, added row counts from comments of one of the answers:
null: 254962 rows
0: 392002 rows
1: 50211 rows
You're not getting an index seek because you're fetching almost 93% of the rows in the table and in that kind of scenario, just scanning the whole index is faster and cheaper to do.
If you have performance issues, you should look into removing format() -function, especially if the query returns a lot of rows. Read more from this blog post
Other option might be to create an indexed view and pre-aggregate your data. This of course adds an overhead to update / insert operations, so consider that only if this is done really often vs. how often the table is updated.
have you tried a UNION ALL?
select ...
Where isdeleted = 0
UNION ALL
select ...
Where isdeleted is null
Or you can add an Query hint (index= [indexname])
also be aware that the cardinality will determine if SQL uses a seek or a scan - seeks are quicker but can require key lookups if the index doesnt cover all the columns required. A good rule of thumb is that if the query will return more than 5% of the table then SQL will prefer a scan.
I have an update query like below to update AccessDate only when the current date is less then the passed one. The table has a clustered index on Id.
Is there any use to have another non clustered index on Id, AccessDate?
Update Person
Set AccessDate = #NewAccessDate
Where Id = #Id
And AccessDate < #NewAccessDate
Under most circumstances, I would say that the update would be faster without the index. The key consideration is that the index itself would also need to be updated by the statement.
The one mitigating factor is when each id has lots and lots of AccessDates, and very, very few that are less than #NewAccessdate. For instance, if there were 10,000 rows per id and only 1 matched the condition, then updating the index is probably faster than scanning all the access dates.
Or, similarly, if most ids had no matching records for the WHERE clause.
I'm not sure what the cutoff value is for when one is better or not -- it would depend on other factors, such as your hardware and the number of records per page. But given that there is a trade-off, you are probably safe not putting in the index.
I have the following SQL statement, which I would like to make more efficient. Looking through the execution plan I can see that there is a Clustered Index Scan on #newWebcastEvents. Is there a way I can make this into a seek? Or are there any other ways I can make the below more efficient?
declare #newWebcastEvents table (
webcastEventId int not null primary key clustered (webcastEventId) with (ignore_dup_key=off)
)
insert into #newWebcastEvents
select wel.WebcastEventId
from WebcastChannelWebcastEventLink wel with (nolock)
where wel.WebcastChannelId = 1178
Update WebcastEvent
set WebcastEventTitle = LEFT(WebcastEventTitle, CHARINDEX('(CLONE)',WebcastEventTitle,0) - 2)
where
WebcastEvent.WebcastEventId in (select webcastEventId from #newWebcastEvents)
The #newWebcastEvents table variable only contains the only single column, and you're asking for all rows of that table variable in this where clause:
where
WebcastEvent.WebcastEventId in (select webcastEventId from #newWebcastEvents)
So doing a seek on this clustered index is usually rather pointless - SQL Server query optimizer will need all columns, all rows of that table variable anyway..... so it chooses an index scan.
I don't think this is a performance issue, anyway.
An index seek is useful if you need to pick a very small number of rows (<= 1-2% of the original number of rows) from a large table. Then, going through the clustered index navigation tree and finding those few rows involved makes a lot more sense than scanning the whole table. But here, with a single int column and 15 rows --> it's absolutely pointless to seek, it will be much faster to just read those 15 int values in a single scan and be done with it...
Update: no sure if it makes any difference in terms of performance, but I personally typically prefer to use joins rather than subselects for "connecting" two tables:
UPDATE we
SET we.WebcastEventTitle = LEFT(we.WebcastEventTitle, CHARINDEX('(CLONE)', we.WebcastEventTitle, 0) - 2)
FROM dbo.WebcastEvent we
INNER JOIN #newWebcastEvents nwe ON we.WebcastEventId = nwe.webcastEventId
So, we have a table, InventoryListItems, that has several columns. Because we're going to be looking for rows at times based on a particlar column (g_list_id, a foreign key), we have that foreign key column placed into a non-clustered index we'll call MYINDEX.
So when I search for data like this:
-- fake data for example
DECLARE #ListId uniqueidentifier
SELECT #ListId = '7BCD0E9F-28D9-4F40-BD67-803005179B04'
SELECT *
FROM [dbo].[InventoryListItems]
WHERE [g_list_id] = #ListId
I expected that it would use the MYINDEX index to find just the needed rows, and then look up the information in those rows. So not as good as just finding everything we need in the index itself, but still a big win over doing a full scan of the table.
But instead it seems that I'm still getting a clustered index scan. I can't figure out why that would happen.
If I do something like SELECTing only the values in the included columns of the index, it does what I would expect, an index seek, and just pulls everything from the index.
But if I SELECT *, why does it just bail on the index and do a scan when it seems like it would still benefit greatly from using it because it's referenced in the WHERE clause?
Since you're doing a SELECT * and thus you retrieve all columns, SQL Server's query optimizer may have decided it's easier and more efficient to just do a clustered index scan - since it needs to go to the clustered index leaf level to get all the columns anyway (and doing a seek first, and then a key lookup to actually get the whole data page, is quite an expensive operation - scan might just be more efficient in this setup).
I'm almost sure if you try
SELECT g_list_id
FROM [dbo].[InventoryListItems]
WHERE [g_list_id] = #ListId
then there will be an index seek instead (since you're only retrieving a single column - not everything).
That's one of the reasons why I would recommend to be extra careful when using SELECT * .... - try to avoid it if ever possible.
I have a table
Archive(VarId SMALLINT, Timestamp DATETIME, Value FLOAT)
VarId is not unique. The table contains measurements. I have a clustered index on Timestamp. Now i have the requirement of finding a measurement for a specific VarId before a specific date. So I do:
SELECT TOP(1) *
FROM Archive
WHERE VarId = 135
AND Timestamp < '2012-06-01 14:21:00'
ORDER BY Timestamp DESC;
If there is no such measurement this query searches the whole table. So I introduced another index on (VarId, Timestamp).
My problem is: SQL Server doesn't seem to care about it, the query still takes forever. When I explicitly state 'WITH (INDEX = <id>)' it works as it should. What can I do so SQL Server uses my index automatically?
I'm using SQL Server 2005.
There are different possibilities with this.
I'll try help you to isolate them:
It could be SQL Server is favouring your Clustered Index (very likely it's the Primary Key) over your newly created index. One way to solve this is to have a NonClustered Primary Key and cluster the index on the other two fields (varid and timestamp). That is, if you don't want varid and timestamp to be the PK.
Also, looking at the (estimated) execution plan might help.
But I believe #1 only works nicely if those 2 fields are the most commonly used (queried) index. To find out if this is the case, it would be good to analyse which index users are most likely use (from http://sqlblog.com/blogs/louis_davidson/archive/2007/07/22/sys-dm-db-index-usage-stats.aspx):
select
ObjectName = object_schema_name(indexes.object_id) + '.' + object_name(indexes.object_id),
indexes.name,
case when is_unique = 1 then 'UNIQUE ' else '' end + indexes.type_desc,
ddius.user_seeks,
ddius.user_scans,
ddius.user_lookups,
ddius.user_updates
from
sys.indexes
left join sys.dm_db_index_usage_stats ddius on (
indexes.object_id = ddius.object_id
and indexes.index_id = ddius.index_id
and ddius.database_id = db_id()
)
WHERE
object_schema_name(indexes.object_id) != 'sys' -- exclude sys objects
AND object_name(indexes.object_id) LIKE 'Archive'
order by
ddius.user_seeks + ddius.user_scans + ddius.user_lookups
desc
Good luck
My guess is that your index design is the issue. You have a CLUSTERED index on a DATETIME field and I suspect that it is not unique data, much like VarId, and hence you did not declare it as UNIQUE. Because it is not unique there is a hidden, 4-byte "uniqueifier" field (so that each row can by physically unique regardless of you not giving it unique data) and the rows with the same DATETIME value are essentially random within the group of same DATETIME values (so even narrowing down a time still requires scanning through that grouping). You also have a NONCLUSTERED index on VarId, Timestamp. NONCLUSTERED indexes include the data from the CLUSTERED index so internally your NONCLUSTERED index is really: VarId, Timestamp, Timestamp (from the CLUSTERED index). So you could have left off the Timestamp column in the NONCLUSTERED index and it would have all been the same to the optimizer, but in a sense it would have been better as it would be a smaller index.
So your physical layout is based on a date while the VarId values are spread across those dates. Hence VarId = 135 can be spread very far apart in terms of data pages. Yes, your non-clustered index does group them together, but the optimizer is probably looking at the fact that you are wanting all fields (the "SELECT *" part) and the Timestamp < '2012-06-01 14:21:00' condition in addition to that seems to get most of what you need as opposed to finding a few rows and doing a bookmark lookup to get the "Value" field to fulfill the "SELECT *". Quite possibly if you do just "SELECT TOP(1) VarId, Timestamp" it would more likely use your NONCLUSTERED index without needing the "INDEX =" hint.
Another issue affecting performance overall could be that the ORDER BY is requesting the Timestamp in DESC order and if you have the CLUSTERED index in ASC order then it would be the opposite direction of what you are looking for (at least in this query). Of course, in that case then it might be ok to have Timestamp in the NONCLUSTERED index if it was in DESC order.
My advice is to rethink the CLUSTERED index. Judging on just this query alone (other queries/uses might alter the recommendation), try dropping the NONCLUSTERED index and recreate the CLUSTERED index with the Timestamp field first, in DESC order, and also with the VarId so it can be delcared UNIQUE. So:
CREATE UNIQUE CLUSTERED INDEX [UIX_Archive_Timestamp_VarId]
ON Archive (Timestamp DESC, VarId ASC)
This, of course, assumes that the Timestamp and VarId combination is unique. If not, then still try this without the UNIQUE keyword.
Update:
To pull all of this info and advice together:
When designing indexes you need to consider the distribution of the data and the use-cases for interacting with it. More often than not there is A LOT to consider and several different approaches will appear good in theory. You need to try a few approaches, profile/test them, and see which works best in reality. There is no "always do this" approach without knowing all aspects of what you are doing and what else is going on and what else is planned to use and/or modify this table which I suspect has not been presented in the original question.
So to start the journey, you are ordering records by date and are looking at ranges of dates AND dates naturally occur in order so putting Timestamp first benefits more of what you are doing and has less fragmentation, especially if defined as DESC in the CREATE. Having an NC index on just VarId at that point will then be fine, even if spread out, for looking at a set of rows for a particular VarId. So maybe start there (change order of direction of CLUSTERED index and remove Timestamp from the NC index). See how those changes compare to the existing structure. Then try moving the VarId field into the CLUSTERED index and remove the NC index. You say that the combination is also not unique but does increase the predictability of the ordering of the rows. See how that works. Does this table ever get updated? If not and if the Value field along with Timestamp and VarId would be unique, then try adding that to the CLUSTERED index and be sure to create with the UNIQUE keyword. See how these different approaches work by looking at the Actual Execution Plan and use SET STATISTICS IO ON before running the query and see how the logical reads between the different approaches compare.
Hope this helps :)
You might need to analyze your table to collect statistics, so the optimizer can determine whether to use the index or not.