Defining indexes: Which Columns, and Performance Impact? - sql

I know how to use indexes(clustured and non clustured)
But when should i use non clustured indexes in my table.
What scenarios should be there, so as to make my column non clustured index.
I have gone throught msdn guidelines but still little bit confusion.
Should i make only unique columns as NC or should there any other columns also as NC.
If i overload my table with NC indexes then will it decrease my performance also ?
Should I use composite non-C index on columns that are foreign keys.
I know primary key should be Clustured, Unique keys should be NC but what about foreign keys.

The clustered index defines your table's physical structure (to a certain degree) - e.g. it defines in what order the data is ordered. Think of the phonebook, which is "clustered" by (LastName,FirstName) - at least in most countries it is.
You only get one clustered index per table - so choose it wisely! According to the gospel of the Queen of Indexing, Kimberly Tripp, the clustering key should be narrow, stable (never change), unique (yes!) and ideally ever-increasing.
It should be narrow, because the clustering key will be added to each and every entry of each and every non-clustered index - after all, the clustering key is the value used to ultimately find the actual data.
It should be stable since constantly updating lots of index values is a costly affair - especially since the clustering key would have to updated in all non-clustered indices as well.
It needs to be unique, since again - it's ultimately the value used to locate the actual data. If you choose a column that is not guaranteed to be unique, SQL Server will "uniquify" your clustering key by adding a 4-byte value to it - not a good thing.
And ideally, the clustering key should be ever-increasing since that causes the least page and index fragmentation and thus is best for performance.
The ideal candidate for a clustering key would be a INT (or BIGINT) IDENTITY - it ideally fulfills all those requirements.
As for non-clustered indices - use and choose them wisely! There's only one general rule I can give you: all columns that are part of a foreign key (referencing another table) should be in an index - SQL Server will not (contrary to popular belief and lots of myths) put such an index in place automatically - never has, never does.
Other than that - you need to watch your system, see what kind of queries you have - all columns that show up in a WHERE or SORT clause are potential candidate to be indexed - but too many indices isn't a good thing either....

You can only have one clustered index per table. It doesn't have to be the primary key, but in most cases it will be.
Beyond that - it really depends on the queries & the tipping point for what indexes will be used. But defining indexes also means there will be an impact to DML - inserts, updates & deletes will take a slight performance hit.
Should I use composite non clustered index(es) on columns that are foreign keys?
Doesn't matter what the column is, it's the usage that matters for the optimizer to determine what index, clustered or otherwise, to use.

Yes, you can overload your tables with too many indexes. In general, every additional index costs performance time in terms of index maintenance. Tables that are heavily updated should generally have fewer indexes.
Another broad rule (from Richard Campbell, on RunAs Radio and DotNetRocks), is that a few broad indexes will perform better than a larger number of narrow indexes. A broad index will cover a wider range of queries, and there's less for the query optimizer to investigate. Remember that the query optimizer has a limited time to run.
Investigate SQL Server Profiler. There are tools there (used to be stand-alone, but they've changed and I haven't used them recently). They can analyze workloads and make indexing recommendations. These will be better choices than indexes picked "intuitively."

If you have queries that are referencing columns that are not in your index, the SQL server engine will have to perform a table lookup to get the non-included columns from the actual table.
If you are running these queries often, you should create non-clustered indexes that "cover" the query by including all the referenced columns in the index. This should include any non-unique columns.
Adding indexes to a table always decreases write performance, since the index will have to be updated every time the table is updated.

What fields are you doing lookups on? Searching? Etc.
Determine what fields you are using when running your queries (WHERE clause)
and they could possibly be good candidates.
For instance, think of a library. The book catalog has a clustered index for the ISBN number and a non clustered index for say publishing year, etc.
Also what helped me is something that Bart Duncan posted a long time ago.
He deserves the credit for this.
The article was entitled "Are you using SQL's Missing Index DMV's?". Look it up and run this query:
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) AS improvement_measure,
'CREATE INDEX [missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*, mid.database_id, mid.[object_id]
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC
It is not the ultimate solution for you but it will help you determine some indexes.
And the link to the article: http://blogs.msdn.com/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx. By default when you create a PK in SQL Server it by default is the clustered index, it doesn't have to be, but it generally is.

If you should or not make clustered indexes depends on you workload (usually dominated by the amount and kind of SELECT statements hitting your table)
A clustered index will force the disk storage order of the rows to be according to the clustered index values. (For this reason, there can be only 1 clustered index per table, as rows are stored on disk only once) This makes sense if most of your queries are always demanding a group of related rows.
Example: suppose you are storing CustomerOrders, and you frequently want to know the number of CustomerOrders (regardless of the customer) in a certain time period. In this case it may be useful to create a clusterd index with the OrderDate as first column. If on the other hand you are frequently looking for all CustomerOrders with the same CustomerId, it makes more sense to put the CustomerId as first column in your clustered index.
The disadvantage of clustered indexes is not in de clustered index itself, but on the secondary indexes: secondary indexes are themselves not clustered (by definition, as the rows can only be stored once, and are stored in order of the clustered index), and their index entries point to the index entries of the clustered index. So to retrieve a row via a secondary index requires 2 read operations: one of the secondary index, and then one of the clustered index it it pointing to.

Related

Why NonClustered index scan faster than Clustered Index scan?

As I know, heap tables are tables without clustered index and has no physical order.
I have a heap table "scan" with 120k rows and I am using this select:
SELECT id FROM scan
If I create a non-clustered index for the column "id", I get 223 physical reads.
If I remove the non-clustered index and alter the table to make "id" my primary key (and so my clustered index), I get 515 physical reads.
If the clustered index table is something like this picture:
Why Clustered Index Scans workw like the table scan? (or worse in case of retrieving all rows). Why it is not using the "clustered index table" that has less blocks and already has the ID that I need?
SQL Server indices are b-trees. A non-clustered index just contains the indexed columns, with the leaf nodes of the b-tree being pointers to the approprate data page. A clustered index is different: its leaf nodes are the data page itself and the clustered index's b-tree becomes the backing store for the table itself; the heap ceases to exist for the table.
Your non-clustered index contains a single, presumably integer column. It's a small, compact index to start with. Your query select id from scan has a covering index: the query can be satisfied just by examining the index, which is what is happening. If, however, your query included columns not in the index, assuming the optimizer elected to use the non-clustered index, an additional lookup would be required to fetch the data pages required, either from the clustering index or from the heap.
To understand what's going on, you need to examine the execution plan selected by the optimizer:
See Displaying Graphical Execution Plans
See Red Gate's SQL Server Execution Plans, by Grant Fritchey
A clustered index generally is about as big as the same data in a heap would be (assuming the same page fullness). It should use just a little more reads than a heap would use because of additional B-tree levels.
A CI cannot be smaller than a heap would be. I don't see why you would think that. Most of the size of a partition (be it a heap or a tree) is in the data.
Note, that less physical reads does not necessarily translate to a query being faster. Random IO can be 100x slower than sequential IO.
When to use Clustered Index-
Query Considerations:
1) Return a range of values by using operators such as BETWEEN, >, >=, <, and <= 2) Return large result sets
3) Use JOIN clauses; typically these are foreign key columns
4) Use ORDER BY, or GROUP BY clauses. An index on the columns specified in the ORDER BY or GROUP BY clause may remove the need for the Database Engine to sort the data, because the rows are already sorted. This improves query performance.
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Are unique or contain many distinct values
2) Defined as IDENTITY because the column is guaranteed to be unique within the table
3) Used frequently to sort the data retrieved from a table
Clustered indexes are not a good choice for the following attributes:
1) Columns that undergo frequent changes
2) Wide keys
When to use Nonclustered Index-
Query Considerations:
1) Use JOIN or GROUP BY clauses. Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
2) Queries that do not return large result sets
3) Contain columns frequently involved in search conditions of a query, such as WHERE clause, that return exact matches
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Cover the query. For more information, see Index with Included Columns
2) Lots of distinct values, such as a combination of last name and first name, if a clustered index is used for other columns
3) Used frequently to sort the data retrieved from a table
Database Considerations:
1) Databases or tables with low update requirements, but large volumes of data can benefit from many nonclustered indexes to improve query performance.
2) Online Transaction Processing applications and databases that contain heavily updated tables should avoid over-indexing. Additionally, indexes should be narrow, that is, with as few columns as possible.
Try running
DBCC DROPCLEANBUFFERS
Before the queries...
If you really want to compare them.
Physical reads don't mean the same as logical reads when optimizing a query

Optimizing my SQL queries - picking the right indexes

I have a basic table as follows.
create table Orders
(
ID INT IDENTITY(1,1) PRIMARY KEY,
Company VARCHAR(3),
ItemID INT,
BoxID INT,
OrderNum VARCHAR(5),
Status VARCHAR(5),
--about 10 more columns, varchars and ints and dates
)
I'm trying to optimize all my SQL since I am getting a fair few deadlocks and some slowness - but I'm no expert on this sort of thing!
I created a few indexes:
Clustered on the ID (Primary Key).
Non-Clustered index on ([ItemID])
Non-Clustered index on ([BoxID])
Non-Clustered index on ([Company],[OrderNum],[Status])
Maybe 1 or 2 more on some other columns
But I'm not 100% happy with the results.
SELECT * FROM Orders WHERE ItemID=100
Gives me an index seek + a key lookup and a Nested loop (Inner join).
I can see why - but don't know if I should do anything about it. They key lookup is 97% of the batch which seems bad!
Every query used will pull back every column in the table, but I don't like the idea of including every column in the index.
I'm making a change now to query everything on the [Company] field. Every query will be using it, because results should never contain more than 1 value. So they will all change:
SELECT * FROM Orders WHERE ItemID=100 --Old
SELECT * FROM Orders WHERE Company='a' and ItemID=100 --New
But the execution plan of that gives me exactly the same as not including company (which does surprise me!).
Why are the two execution plans above the same? (I have no index on [company] at the moment)
Is it worth adding [Company] to all my indexes since it seems to make
0 different to the execution plan?
Should I instead just add 1 single index to [Company] and keep the original indexes? - but will that
mean every query will have 2 seeks?
Is it worth 'including' all other columns in my indexes to avoid the
key lookup? (making the index a tonne bigger, but potentially
speeding it up?) i.e.
CREATE NONCLUSTERED INDEX [IX_Orders_MyIndex] ON [Orders]
( [Company] ASC, [OrderNum] ASC, [Status] ASC )
INCLUDE ([ID],[ItemID],[BoxID],
[Column5],[Column6],[Column7],[Column8],[Column9],[Column10],etc)
That seems messy if I did it on 4 or 5 indexes.
Basically I have 4-5 queries which run quite often (some selects and updates) so I want to make it as efficient as possible.
All queries will use the [company] field, and at least 1 other. How should I go about it.
Any help appreciated :)
In your execution plan, you say that lookup takes 97% of the batch.
In this case it doesn't mean anything because an index seek is very fast and you didn't have that much operation to be done.
That lookup is actually the record you read based on the index you have specified.
Why are the two execution plans above the same? (I have no index on [company] at the moment)
Non-Clustered index on ([Company],[OrderNum],[Status])
This index will be considered only if Company, OrderNum and Status appear in your where clause.
Concatenated indexes generates a key that would look like this 0000000000000 when you pass only company it creates an incomplete key that requires using wildcard for the other to values.
It would look a little like this : key like 'XXX%' this logic will require an index scan which is time consuming.
The optimizer will determine that it's preferable to first seek and rows from the ItemID index and then scan these to match any with the required company.
Is it worth adding [Company] to all my indexes since it seems to make 0 different to the execution plan?
You should consider having a Company index instead of adding it to all your indexes.
Composite index could speed things up by reducing the number of nested loops, but you have to think then thoroughly.
The order of the fields you add to such an index is very important, they should be ordered by uniqueness to allow a better seek. Also, you should never add a field that might not be used in a query.
Should I instead just add 1 single index to [Company] and keep the original indexes? - but will that mean every query will have 2 seeks?
Having more than one index seek is not all that bad, they are usually paralleled and only the result of both are matched together.
Is it worth 'including' all other columns in my indexes to avoid the key lookup? (making the index a tonne bigger, but potentially speeding it up?)
It is worth when it's only a few fields that could be optional in the where clause or when you have queries that select only those fields when you are using the specified index.
Last notes
All indexes are not equal, comparing string (varchar) is not the same as comparing numbers (integer, datetime, bytes, etc).
Also, keeping them clean helps a lot, if your indexes are fragmented, they will be next to useless in terms of performance gain.

Why is my index not automatically used?

I have a table
Archive(VarId SMALLINT, Timestamp DATETIME, Value FLOAT)
VarId is not unique. The table contains measurements. I have a clustered index on Timestamp. Now i have the requirement of finding a measurement for a specific VarId before a specific date. So I do:
SELECT TOP(1) *
FROM Archive
WHERE VarId = 135
AND Timestamp < '2012-06-01 14:21:00'
ORDER BY Timestamp DESC;
If there is no such measurement this query searches the whole table. So I introduced another index on (VarId, Timestamp).
My problem is: SQL Server doesn't seem to care about it, the query still takes forever. When I explicitly state 'WITH (INDEX = <id>)' it works as it should. What can I do so SQL Server uses my index automatically?
I'm using SQL Server 2005.
There are different possibilities with this.
I'll try help you to isolate them:
It could be SQL Server is favouring your Clustered Index (very likely it's the Primary Key) over your newly created index. One way to solve this is to have a NonClustered Primary Key and cluster the index on the other two fields (varid and timestamp). That is, if you don't want varid and timestamp to be the PK.
Also, looking at the (estimated) execution plan might help.
But I believe #1 only works nicely if those 2 fields are the most commonly used (queried) index. To find out if this is the case, it would be good to analyse which index users are most likely use (from http://sqlblog.com/blogs/louis_davidson/archive/2007/07/22/sys-dm-db-index-usage-stats.aspx):
select
ObjectName = object_schema_name(indexes.object_id) + '.' + object_name(indexes.object_id),
indexes.name,
case when is_unique = 1 then 'UNIQUE ' else '' end + indexes.type_desc,
ddius.user_seeks,
ddius.user_scans,
ddius.user_lookups,
ddius.user_updates
from
sys.indexes
left join sys.dm_db_index_usage_stats ddius on (
indexes.object_id = ddius.object_id
and indexes.index_id = ddius.index_id
and ddius.database_id = db_id()
)
WHERE
object_schema_name(indexes.object_id) != 'sys' -- exclude sys objects
AND object_name(indexes.object_id) LIKE 'Archive'
order by
ddius.user_seeks + ddius.user_scans + ddius.user_lookups
desc
Good luck
My guess is that your index design is the issue. You have a CLUSTERED index on a DATETIME field and I suspect that it is not unique data, much like VarId, and hence you did not declare it as UNIQUE. Because it is not unique there is a hidden, 4-byte "uniqueifier" field (so that each row can by physically unique regardless of you not giving it unique data) and the rows with the same DATETIME value are essentially random within the group of same DATETIME values (so even narrowing down a time still requires scanning through that grouping). You also have a NONCLUSTERED index on VarId, Timestamp. NONCLUSTERED indexes include the data from the CLUSTERED index so internally your NONCLUSTERED index is really: VarId, Timestamp, Timestamp (from the CLUSTERED index). So you could have left off the Timestamp column in the NONCLUSTERED index and it would have all been the same to the optimizer, but in a sense it would have been better as it would be a smaller index.
So your physical layout is based on a date while the VarId values are spread across those dates. Hence VarId = 135 can be spread very far apart in terms of data pages. Yes, your non-clustered index does group them together, but the optimizer is probably looking at the fact that you are wanting all fields (the "SELECT *" part) and the Timestamp < '2012-06-01 14:21:00' condition in addition to that seems to get most of what you need as opposed to finding a few rows and doing a bookmark lookup to get the "Value" field to fulfill the "SELECT *". Quite possibly if you do just "SELECT TOP(1) VarId, Timestamp" it would more likely use your NONCLUSTERED index without needing the "INDEX =" hint.
Another issue affecting performance overall could be that the ORDER BY is requesting the Timestamp in DESC order and if you have the CLUSTERED index in ASC order then it would be the opposite direction of what you are looking for (at least in this query). Of course, in that case then it might be ok to have Timestamp in the NONCLUSTERED index if it was in DESC order.
My advice is to rethink the CLUSTERED index. Judging on just this query alone (other queries/uses might alter the recommendation), try dropping the NONCLUSTERED index and recreate the CLUSTERED index with the Timestamp field first, in DESC order, and also with the VarId so it can be delcared UNIQUE. So:
CREATE UNIQUE CLUSTERED INDEX [UIX_Archive_Timestamp_VarId]
ON Archive (Timestamp DESC, VarId ASC)
This, of course, assumes that the Timestamp and VarId combination is unique. If not, then still try this without the UNIQUE keyword.
Update:
To pull all of this info and advice together:
When designing indexes you need to consider the distribution of the data and the use-cases for interacting with it. More often than not there is A LOT to consider and several different approaches will appear good in theory. You need to try a few approaches, profile/test them, and see which works best in reality. There is no "always do this" approach without knowing all aspects of what you are doing and what else is going on and what else is planned to use and/or modify this table which I suspect has not been presented in the original question.
So to start the journey, you are ordering records by date and are looking at ranges of dates AND dates naturally occur in order so putting Timestamp first benefits more of what you are doing and has less fragmentation, especially if defined as DESC in the CREATE. Having an NC index on just VarId at that point will then be fine, even if spread out, for looking at a set of rows for a particular VarId. So maybe start there (change order of direction of CLUSTERED index and remove Timestamp from the NC index). See how those changes compare to the existing structure. Then try moving the VarId field into the CLUSTERED index and remove the NC index. You say that the combination is also not unique but does increase the predictability of the ordering of the rows. See how that works. Does this table ever get updated? If not and if the Value field along with Timestamp and VarId would be unique, then try adding that to the CLUSTERED index and be sure to create with the UNIQUE keyword. See how these different approaches work by looking at the Actual Execution Plan and use SET STATISTICS IO ON before running the query and see how the logical reads between the different approaches compare.
Hope this helps :)
You might need to analyze your table to collect statistics, so the optimizer can determine whether to use the index or not.

What is the proper index to make SELECT queries more efficient?

I have a table similar you can see below
Table Keywords
Column ID
Column Keyword
Column Keyword2
the first query is
select keyword from keywords with (nolock) where keyword = keyword
another query for the same tabel is
select keyword2 from with (nolock) keywords where keyword2 Like 'keyword%'
My question is what index type to set on which columns int this table
to make a select process more sufficient? Is it should be clustered index or non-clustered? and on which columns I need to set it?
This table contains about 600k rows and it constantly growing.
Another question I'm getting a dead-lock error when I trying to insert a new record to Keywords table. What can be the problem? I'm selecting records with nolock.
Thank you
Since your two queries are on totally separate columns, you will need two separate non-clustered indices:
one index on keyword to speed up the first query
a second index on keyword2 to speed up the second query
And assuming you're using SQL Server: neither of them really makes a good clustered index, I would say - but a good clustered index would be really beneficial!
A good clustered index should be:
unique
small
stable (never changing)
ever-increasing
Your best bet would be on an INT IDENTITY field. See Kimberly Tripp's outstanding blog post Ever-increasing clustering key - the Clustered Index Debate..........again! for more detailed background on the requirements for a good clustering key.
If we are really seeing the only use cases, you want a clustered key on keyword2 and then hope your DBMS is smart enough to optimize index use with LIKE operator. Clustering helps when the returned rows from a typical query are adjacent in the DB, so keeping the table in alphabetical order on keyword2 will mean fewer pages have to be scanned on the SELECT. Clustering a table where access is pretty much random (e.g., user names) won't give you any more than a standard index.

is count(indexed column) faster than count(*)? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Performance of COUNT SQL function
Hi all,
I've very large tables and I need to know number of records in each , My question is does it reduce the run time if I run :
select count(indexed column like my PK) from tbTest
instead of
select count(*) from tbTest
see Performance of COUNT SQL function
The important thing to note is they are not equivalent
Since the question is whether or not there is a performance difference, it would depend on the index. When you do COUNT(*), it will use the PK column(s) to determine the number of rows. If you do not have any indexes besides a clustered index on the PK column(s), it will scan the leaf nodes on the clustered index. That's probably a lot of pages. If you have a non clustered index that is skinnier than the clustered index, it will choose that instead, resulting in less reads.
So, if the column you select is contained in the smallest possible non-clustered index on the table, the SQL query optimizer will choose that for both count() (if you have a clustered ix that is the PK) and count(indexed_column). If you choose a count(indexed_col) that is only contained in a wide index, then the count() will be faster if your PK is a clustered index. The reason this works is that there is a pointer to the clustered index in all non-clustered indexes and SQL Server can figure out the number of rows based on that non-clustered index.
So, as usual in SQL Server, it depends. Do a showplan and compare the queries to each other.
SELECT COUNT(*) may be faster. That is because using * gives the optimizer liberty to choose any column to count on. Say you have a primary key on a INT column, and a non clustered key on a different bigint column. But the primary key is likely the clustered index, and as such it is in fact significantly larger than the nonclustered bigint index (has more pages). So if the optimizer is free to choose the bigint non-clustered index, it can return the response faster. Possible much faster, depending on the table.
So overall is always better to leave it as COUNT(*) and let the optimizer choose.
most likely, if the query scans the index instead of the whole table.
it is an easy thing to test, become your own scientist.
Both are identical. If you look at the query execution plan for both, both will do an "index scan"