I am investigating indexes and have read through many articles and would like some expert advice. As a warning, index fields are fairly new to me and a bit confusing even after reading up on the subject!
To simplify, I have a table that has a guid (transaction id), event id and an updt_tmstmp (there are many other fields but unimportant to this question).
My PK is the transaction_id and event_id and the table is ordered by these keys. Since the transaction_id is a guid, the updt_tmstmp field is very randomized. As the table has grown to 6 million records the query has slowed. My idea was to add an index on the updt_tmstmp field. Our extracts search often on the table and look for the transaction_id's that have had updates in the past 24 hours. The query is scanning the entire table to find the records that have updated. Average time 1 minute
Details Current:
Table size: 6.2 million records
Index: transaction_id + event_id (clustered)
Details Attempted:
Additional Index: updt_tmstmp (non-unique, non-clustered)
When I did this update and ran the query it improved by about 10% and the explain plan indicates I am still table scanning an index. I expected a little bit better performance than this. My updt_tmstmp is not guaranteed to be unique (I blame the application programmer for doing this :) ).
The query I am using to access this is a standard start_time - end_time. updt_tmstmp >= #start_time and updt_tmstmp < #end_time
Thanks in advance and have a great day!
Chris
Question: Given a clustered index on event_ID, and non_clustered indexes on transaction_id and updt_tmstmp, I would recommend the following:
-- drop the clustered index on event_id and the non-clustered index on updt_tmstmp.
-- Create the clustered index on updt_tmstmp.
Logic: The SQL Server query optimizer has always favored clustered indexes over non-clustered indexes. The showplan for the query most likes shows a clustered index SCAN when using event_id as the clustered index. By moving the clustered index to updt-tmstmp, the query showplan should show a range search for all transactions in the last 24 hours and should do it quickly because the cluster key is sorted and is physically adjacent on disk... perfect for a range scan of a cluster.
By doing this, you will have accomplished a few key design goals.
Defined the clustered index key with as few columns as possible, considering columns that have one that are very selective or contain many distinct values
-- Also... the updt-tmpstmp field data will be accessed sequentially and rarely if at all updated. Perfect for a clustered index. The updt-tmstmp field will be used frequently to sort the data retrieved from the table.
I suggest that you use the following before the query to assist you in understand query optimizing behavior.
set statistics io on
go
set statistics time on
go
Table
Name of the table.
Scan count
Number of index or table scans performed.
logical reads
Number of pages read from the data cache.
physical reads
Number of pages read from disk.
read-ahead reads
Number of pages placed into the cache for the query.
lob logical reads
Number of text, ntext, image, or large value type (varchar(max), nvarchar(max), varbinary(max)) pages read from the data cache.
lob physical reads
Number of text, ntext, image or large value type pages read from disk.
lob read-ahead reads
Number of text, ntext, image or large value type pages placed into the cache for the query.
SQL Server parse and compile elapsed time and cpu
SQL Server Execution Time and cpu
Related
I created a non-clustered, non-unique index on a column (date) on a large table (16 million rows), but am getting very similar query speeds when compared to the exact same query that's being forced to not use any indexes.
Query 1 (uses index):
SELECT *
FROM testtable
WHERE date BETWEEN '01/01/2017' AND '03/01/2017'
ORDER BY date
Query 2 (no index):
SELECT *
FROM testtable WITH(INDEX(0))
WHERE date BETWEEN '01/01/2017' AND '03/01/2017'
ORDER BY date
Both queries take the same amount of time to run, and return the same result. When looking at the Execution plan for each, Query 1's number of rows read is
~ 4 million rows, where as Query 2 is reading 106 million rows. It appears that the index is working, but I'm not gaining any performance benefits from it.
Any ideas as to why this is, or how to increase my query speed in this case would be much appreciated.
Create Indexes with Included Columns: Cover index
This topic describes how to add included (or nonkey) columns to extend the functionality of nonclustered indexes in SQL Server by using SQL Server Management Studio or Transact-SQL. By including nonkey columns, you can create nonclustered indexes that cover more queries. This is because the nonkey columns have the following benefits:
They can be data types not allowed as index key columns.
They are not considered by the Database Engine when calculating the
number of index key columns or index key size.
An index with nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.
CREATE NONCLUSTERED INDEX IX_your_index_name
ON testtable (date)
INCLUDE (col1,col2,col3);
GO
You need to build an index around the need of your query - this quick and free video course should bring you up to speed really quick.
https://www.brentozar.com/archive/2016/10/think-like-engine-class-now-free-open-source/
I have a query that fails to execute with "Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'".
On the way of trouble shooting I am examining the execution plan. There are two costly steps labeled "Clustered Index Scan (Clustered)". I have a hard time find out what this means?
I would appreciate any explanations to "Clustered Index Scan (Clustered)" or suggestions on where to find the related document?
I would appreciate any explanations to "Clustered Index Scan
(Clustered)"
I will try to put in the easiest manner, for better understanding you need to understand both index seek and scan.
SO lets build the table
use tempdb GO
create table scanseek (id int , name varchar(50) default ('some random names') )
create clustered index IX_ID_scanseek on scanseek(ID)
declare #i int
SET #i = 0
while (#i <5000)
begin
insert into scanseek
select #i, 'Name' + convert( varchar(5) ,#i)
set #i =#i+1
END
An index seek is where SQL server uses the b-tree structure of the index to seek directly to matching records
you can check your table root and leaf nodes using the DMV below
-- check index level
SELECT
index_level
,record_count
,page_count
,avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),OBJECT_ID('scanseek'),NULL,NULL,'DETAILED')
GO
Now here we have clustered index on column "ID"
lets look for some direct matching records
select * from scanseek where id =340
and look at the Execution plan
you've requested rows directly in the query that's why you got a clustered index SEEK .
Clustered index scan: When Sql server reads through for the Row(s) from top to bottom in the clustered index.
for example searching data in non key column. In our table NAME is non key column so if we will search some data in the name column we will see clustered index scan because all the rows are in clustered index leaf level.
Example
select * from scanseek where name = 'Name340'
please note: I made this answer short for better understanding only, if you have any question or suggestion please comment below.
Expanding on Gordon's answer in the comments, a clustered index scan is scanning one of the tables indexes to find the values you are doing a where clause filter, or for a join to the next table in your query plan.
Tables can have multiple indexes (one clustered and many non-clustered) and SQL Server will search the appropriate one based upon the filter or join being executed.
Clustered Indexes are explained pretty well on MSDN. The key difference between clustered and non-clustered is that the clustered index defines how rows are stored on disk.
If your clustered index is very expensive to search due to the number of records, you may want to add a non-clustered index on the table for fields that you search for often, such as date fields used for filtering ranges of records.
A clustered index is one in which the terminal (leaf) node of the index is the actual data page itself. There can be only one clustered index per table, because it specifies how records are arranged within the data page. It is generally (and with some exceptions) considered the most performant index type (primarily because there is one less level of indirection before you get to your actual data record).
A "clustered index scan" means that the SQL engine is traversing your clustered index in search for a particular value (or set of values). It is one of the most efficient methods for locating a record (beat by a "clustered index seek" in which the SQL Engine is looking to match a single selected value).
The error message has absolutely nothing to do with the query plan. It just means that you are out of space on TempDB.
I have been having issues with performance and timeouts due to a clustered index scan. However another seemingly identical database did not have the same issue.
Turns out the COMPATIBILITY_LEVEL flag on the db was different... the version with COMPATIBILITY_LEVEL 100 was using the scan, the db with level 130 wasn't. Performance difference is huge (from more than 1 minute to less that 1 second for same query)
ALTER DATABASE [mydb] SET COMPATIBILITY_LEVEL = 130
If you hover over the step in the query plan, SSMS displays a description of what the step does. That will give you a baseline understanding of "Clustered Index Scan (Clustered)" and all other steps involved.
As I know, heap tables are tables without clustered index and has no physical order.
I have a heap table "scan" with 120k rows and I am using this select:
SELECT id FROM scan
If I create a non-clustered index for the column "id", I get 223 physical reads.
If I remove the non-clustered index and alter the table to make "id" my primary key (and so my clustered index), I get 515 physical reads.
If the clustered index table is something like this picture:
Why Clustered Index Scans workw like the table scan? (or worse in case of retrieving all rows). Why it is not using the "clustered index table" that has less blocks and already has the ID that I need?
SQL Server indices are b-trees. A non-clustered index just contains the indexed columns, with the leaf nodes of the b-tree being pointers to the approprate data page. A clustered index is different: its leaf nodes are the data page itself and the clustered index's b-tree becomes the backing store for the table itself; the heap ceases to exist for the table.
Your non-clustered index contains a single, presumably integer column. It's a small, compact index to start with. Your query select id from scan has a covering index: the query can be satisfied just by examining the index, which is what is happening. If, however, your query included columns not in the index, assuming the optimizer elected to use the non-clustered index, an additional lookup would be required to fetch the data pages required, either from the clustering index or from the heap.
To understand what's going on, you need to examine the execution plan selected by the optimizer:
See Displaying Graphical Execution Plans
See Red Gate's SQL Server Execution Plans, by Grant Fritchey
A clustered index generally is about as big as the same data in a heap would be (assuming the same page fullness). It should use just a little more reads than a heap would use because of additional B-tree levels.
A CI cannot be smaller than a heap would be. I don't see why you would think that. Most of the size of a partition (be it a heap or a tree) is in the data.
Note, that less physical reads does not necessarily translate to a query being faster. Random IO can be 100x slower than sequential IO.
When to use Clustered Index-
Query Considerations:
1) Return a range of values by using operators such as BETWEEN, >, >=, <, and <= 2) Return large result sets
3) Use JOIN clauses; typically these are foreign key columns
4) Use ORDER BY, or GROUP BY clauses. An index on the columns specified in the ORDER BY or GROUP BY clause may remove the need for the Database Engine to sort the data, because the rows are already sorted. This improves query performance.
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Are unique or contain many distinct values
2) Defined as IDENTITY because the column is guaranteed to be unique within the table
3) Used frequently to sort the data retrieved from a table
Clustered indexes are not a good choice for the following attributes:
1) Columns that undergo frequent changes
2) Wide keys
When to use Nonclustered Index-
Query Considerations:
1) Use JOIN or GROUP BY clauses. Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
2) Queries that do not return large result sets
3) Contain columns frequently involved in search conditions of a query, such as WHERE clause, that return exact matches
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Cover the query. For more information, see Index with Included Columns
2) Lots of distinct values, such as a combination of last name and first name, if a clustered index is used for other columns
3) Used frequently to sort the data retrieved from a table
Database Considerations:
1) Databases or tables with low update requirements, but large volumes of data can benefit from many nonclustered indexes to improve query performance.
2) Online Transaction Processing applications and databases that contain heavily updated tables should avoid over-indexing. Additionally, indexes should be narrow, that is, with as few columns as possible.
Try running
DBCC DROPCLEANBUFFERS
Before the queries...
If you really want to compare them.
Physical reads don't mean the same as logical reads when optimizing a query
Running On: SQL Server 2008 R2 Standard. Though I imagine this is a question for all databases, not just SQL Server.
Background: I've always heard/read/been told that the leading edge on an index should be highly selective. This makes sense when you've got queries seeking for a particular value or small set of values -- a product id or something like that.
General question: are there times when a non highly-selective index is useful?
For example: I have a table with 350 million rows. The table contains a bunch of prices. The table has the following columns:
priceId -- clustered index on the table
warehouseId -- fk to one of 10 warehouses, equally distributed among the 150m rows
algorithmId -- fk to one of 23 algorithms for how I calculated price, equally distributed among 150m rows
priceDate -- the date we last calculated the price
productId
Then I run this query:
select productId
from price
where warehouseId = 1
and algorithmId = 1
order by priceDate
Specific question: Wouldn't I benefit from an index like this?
create nonclustered index ix_p
on price (warehouseId, algorithmId, priceDate) includes (productId)
It seems I would benefit b/c I've created a covering index with the filter columns nicely organized so that SQL Server can carve out huge chunks at a time and order by priceDate. Does that make sense? And does it work?
Note: I am going to try this out and will let you know what I find.
Short answer - yes, but you basically have doubled your storage.
Long answer:
I tested this out on a SQL 2012 VirtualBox Server 2008 VM with 150 million rows of data. Filegroups were stored on the VM image, which is on a USB 3.0 connection to a solid state drive (sequential read seems to be about 250 mb/s, write about 150 mb/s).
I built a table with pseudo-random dates & productIds, with warehouseids from 1-10 evenly distributed, and algorithimids from 1-23 evenly distributed. (basically i wrote a source script component in SSIS that loaded the data).
Table storage space was about 4.7 GB, with a clustered index on primary key priceid.
Running this query:
select productId
from price
where warehouseId = 1
and algorithmId = 1
order by priceDate
~1 million rows returned in about 30 seconds.
Plan indicates a clustered index scan plus a sort (order by priceDate).
I then added this nonclustered index:
create nonclustered index ix_p
on price (warehouseId, algorithmId, priceDate) include (productId)
This index is nearly as large as the table - about 4.3 GB.
Adding the nonclustered index eliminated the SORT step on the priceDate, and changed to do a nonclustered index seek to access the data. Creating this index took over 11 minutes.
Same query:
~1 million rows returned in about 4 seconds.
Plan indicates a nonclustered index seek.
I think the biggest thing that this is doing is essentially creating two copies of your data - one in the clustered index structure, and one in the "nonclustered" structure.
I expect inserts to take about twice as long since now you have to create basically two rows for each insert.
Are you doing updates to this table on a regular basis? There may be some other strategies that may help.
I just finished implementing a non-clustered index similar to what I describe in my question. Table had 101,308,183 rows, 61 bytes per row. Here were some results:
With current "selective" index with productId and warehouse as the keys:
461,000 Rows returned
Average Run Time: 2 min 36 sec
Scan Count: 116
Logical Reads: 9,870,354
Physical Reads: 20,086
Read-ahead Reads: 967,324
With the new non-selective index as described in my original question:
461,000 Rows returned
Average Run Time: 47 sec
Scan Count: 76
Logical Reads: 109,934
Physical Reads: 0
Read-ahead Reads: 1
So to summarize, a non-selective index gave me 90 times less logical reads (9.87 million to 110k), 100% decrease in physical reads (from 20k to 0) and a 100% decrease in read-ahead reads (967k to 0).
Again, I believe this is because SQL already has all the data sorted, so it's very easy to cleave off (i.e. exclude) large chunks of data. Because the index covers this query (which is one of just two queries we run on it in our production environment) we don't waste time w/ key lookups.
As far as I understand it, each transaction sees its own version of the database, so the system cannot get the total number of rows from some counter and thus needs to scan an index. But I thought it would be the clustered index on the primary key, not the additional indexes. If I had more than one additional index, which one will be chosen, anyway?
When digging into the matter, I've noticed another strange thing. Suppose there are two identical tables, Articles and Articles2, each with three columns: Id, View_Count, and Title. The first has only a clustered PK-based index, while the second one has an additional non-clustered, non-unique index on view_count. The query SELECT COUNT(1) FROM Articles runs 2 times faster for the table with the additional index.
SQL Server will optimize your query - if it needs to count the rows in a table, it will choose the smallest possible set of data to do so.
So if you consider your clustered index - it contains the actual data pages - possibly several thousand bytes per row. To load all those bytes just to count the rows would be wasteful - even just in terms of disk I/O.
Therefore, it there is a non-clustered index that's not filtered or restricted in any way, SQL Server will pick that data structure to count - since the non-clustered index basically contains the columns you've put into the NC index (plus the clustered index key) - much less data to load just to count the number of rows.