What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?
Is it "made" unique and if so how? What is the best way to make it unique?
I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.
What happens if you create a CI on a non-unique column
If the clustered index is not a unique
index, SQL Server makes any duplicate
keys unique by adding an internally
generated value called a uniqueifier
Does this lead to bad performance?
Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.
How much data the table contains.
What is the rate of inserts.
How often is the CI used in a select (when no covering indexes exist, pretty much always).
Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).
I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:
I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....
Let's start with the key things that I look for in a clustering key:
* Unique
* Narrow
* Static
Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).
Source: Ever-increasing clustering key debate - again!
Do clustered indexes have to be unique?
They don't, and there are times where it's better if they're not.
Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is
SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%
then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).
Do you have any references?
There's Clustered Index Design Guidelines for example, which says,
With few exceptions, every table
should have a clustered index defined
on the column, or columns, that offer
the following:
Can be used for frequently used queries.
Provide a high degree of uniqueness.
Can be used in range queries.
My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.
If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster.
This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.
Related
Here is the scenario:
SQL Server 2000 (8.0.2055)
Table currently has 478 million rows of data. The Primary Key column is an INT with IDENTITY. There is an Unique Constraint imposed on two other columns with a Non-Clustered Index. This is a vendor application and we are only responsible for maintaining the DB.
Now the vendor has recommended doing the following "to improve performance"
Drop the PK and Clustered Index
Drop the non-clustered index on the two columns with the UNIQUE CONSTRAINT
Recreate the PK, with a NON-CLUSTERED index
Create a CLUSTERED index on the two columns with the UNIQUE CONSTRAINT
I am not convinced that this is the right thing to do. I have a number of concerns.
By dropping the PK and indexes, you will be creating a heap with 478 million rows of data. Then creating a CLUSTERED INDEX on two columns would be a really mammoth task. Would creating another table with the same structure and new indexing scheme and then copying the data over, dropping the old table and renaming the new one be a better approach?
I am also not sure how the stored procs will react. Will they continue using the cached execution plan, considering that they are not being explicitly recompiled.
I am simply not able to understand what kind of "performance improvement" this change will provide. I think that this will actually have the reverse effect.
All thoughts welcome.
Thanks in advance,
Raj
All stored procs will recompile automatically. This will happen anyway when stats change and after index maintenance anyway.
At some point, you have to reorganise 478 million rows (drop/create indexes) or move (new table). Neither way is better then the other, unfortunately. I feel your pain though: we have similarly large tables with pending new columns and an index changes.
Saying that, you should do it step 2-1-4-3 to avoid unnecessary non-clustered index maintenance when you drop/create the clustered index.
And drop the unique constraint. The clustered index could be unique and clustered. A unique constrint is just another index that would be unnecessary.
As for the performance benefit, perhaps ask the vendor why.
The one thing I would have a serious look at is the question what type those other two columns are - how big are they, compared to the INT IDENTITY (4 byte) ??
The reason I ask: the clustering key will be added to all non-clustered indices on the table, too - and if you have close to 500 million rows, it will make a huge difference whether the clustering key is a single 4-byte INT, or e.g. two 16-byte GUID's.
This is not only on disk, mind you - the pages are loaded into SQL Server's RAM in their entirety - so by potentially bloating up your clustering key, you'd incur performance penalties due to the larger number of pages on disk (and in RAM) that your non-clustered indices would need.
The only compelling reason I could see to actually go through with those changes would be if by clustering the table using those two other columns, you'd gain something in terms of query performance, e.g. if some of the most frequent queries would be faster, due to the fact that the table is now clustered by these two columns. That's really hard to know unless you know what the access and query patterns really are....
Would creating another table with the same structure and new indexing scheme and then copying the data over, dropping the old table and renaming the new one be a better approach?
I believe that this is what SQL Enterprise Manager will do behind the scenes anyways if you use the visual tools to do this. If you make a schema change such as add a column in the middle of a table, or change primary keys, there is a little button that will allow you to "Script Changes". If you view this script, you can see the steps that Enterprise Manager will take in order to do what you requested.
In many places it's recommended that clustered indexes are better utilized when used to select range of rows using BETWEEN statement. When I select joining by foreign key field in such a way that this clustered index is used, I guess, that clusterization should help too because range of rows is being selected even though they all have same clustered key value and BETWEEN is not used.
Considering that I care only about that one select with join and nothing else, am I wrong with my guess ?
Discussing this type of issue in the absolute isn't very useful.
It is always a case-by-case situation !
Essentially, access by way of a clustered index saves one indirection, period.
Assuming the key used in the JOIN, is that of the clustered index, in a single read [whether from an index seek or from a scan or partial scan, doesn't matter], you get the whole row (record).
One problem with clustered indexes, is that you only get one per table. Therefore you need to use it wisely. Indeed in some cases, it is even wiser not to use any clustered index at all because of INSERT overhead and fragmentation (depending on the key and the order of new keys etc.)
Sometimes one gets the equivalent benefits of a clustered index, with a covering index, i.e. a index with the desired key(s) sequence, followed by the column values we are interested in. Just like a clustered index, a covering index doesn't require the indirection to the underlying table. Indeed the covering index may be slightly more efficient than the clustered index, because it is smaller.
However, and also, just like clustered indexes, and aside from the storage overhead, there is a performance cost associated with any extra index, during INSERT (and DELETE or UPDATE) queries.
And, yes, as indicated in other answers, the "foreign-key-ness" of the key used for the clustered index, has absolutely no bearing on the the performance of the index. FKs are constraints aimed at easing the maintenance of the integrity of the database but the underlying fields (columns) are otherwise just like any other field in the table.
To make wise decisions about index structure, one needs
to understands the way the various index types (and the heap) work
(and, BTW, this varies somewhat between SQL implementations)
to have a good image of the statistical profile of the database(s) at hand:
which are the big tables, which are the relations, what's the average/maximum cardinality of relation, what's the typical growth rate of the database etc.
to have good insight regarding the way the database(s) is (are) going to be be used/queried
Then and only then, can one can make educated guesses about the interest [or lack thereof] to have a given clustered index.
I would ask something else: would it be wise to put my clustered index on a foreign key column just to speed a single JOIN up? It probably helps, but..... at a price!
A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
So those are my criteria - put your clustering key on a narrow, stable, unique, hopefully ever-increasing column. If your foreign key column matches those - perfect!
However, I would not under any circumstances put my clustering key on a wide or even compound foreign key. Remember: the value(s) of the clustering key are being added to each and every non-clustered index entry on that table! If you have 10 non-clustered indices, 100'000 rows in your table - that's one million entries. It makes a huge difference whether that's a 4-byte integer, or a 200-byte VARCHAR - HUGE. And not just on disk - in server memory as well. Think very very carefully about what to make your clustered index!
SQL Server might need to add a uniquifier - making things even worse. If the values will ever change, SQL Server would have to do a lot of bookkeeping and updating all over the place.
So in short:
putting an index on your foreign keys is definitely a great idea - do it all the time!
I would be very very careful about making that a clustered index. First of all, you only get one clustered index, so which FK relationship are you going to pick? And don't put the clustering key on a wide and constantly changing column
An index on the FK column will help the JOIN because the index itself is ordered: clustered just means that the data on disk (leaf) is ordered rather then the B-tree.
If you change it to a covering index, then clustered vs non-clustered is irrelevant. What's important is to have a useful index.
It depends on the database implementation.
For SQL Server, a clustered index is a data structure where the data is stored as pages and there are B-Trees and are stored as a separate data structure. The reason you get fast performance, is that you can get to the start of the chain quickly and ranges are an easy linked list to follow.
Non-Clustered indexes is a data structure that contains pointers to the actual records and as such different concerns.
Refer to the documentation regarding Clustered Index Structures.
An index will not help in relation to a Foreign Key relationship, but it will help due to the concept of "covered" index. If your WHERE clause contains a constraint based upon the index. it will be able to generate the returned data set faster. That is where the performance comes from.
The performance gains usually come if you are selecting data sequentially within the cluster. Also, it depends entirely on the size of the table (data) and the conditions in your between statement.
OK. I've read things here and there about SQL Server heaps, but nothing too definitive to really guide me. I am going to try to measure performance, but was hoping for some guidance on what I should be looking into. This is SQL Server 2008 Enterprise. Here are the tables:
Jobs
JobID (PK, GUID, externally generated)
StartDate (datetime2)
AccountId
Several more accounting fields, mainly decimals and bigints
JobSteps
JobStepID (PK, GUID, externally generated)
JobID FK
StartDate
Several more accounting fields, mainly decimals and bigints
Usage: Lots of inserts (hundreds/sec), usually 1 JobStep per Job. Estimate perhaps 100-200M rows per month. No updates at all, and the only deletes are from archiving data older than 3 months.
Do ~10 queries/sec against the data. Some join JobSteps to Jobs, some just look at Jobs. Almost all queries will range on StartDate, most of them include AccountId and some of the other accounting fields (we have indexes on them). Queries are pretty simple - the largest part of the execution plans is the join for JobSteps.
The priority is the insert performance. Some lag (5 minutes or so) is tolerable for data to appear in the queries, so replicating to other servers and running queries off them is certainly allowable.
Lookup based on the GUIDs is very rare, apart from joining JobSteps to Jobs.
Current Setup: No clustered index. The only one that seems like a candidate is StartDate. But, it doesn't increase perfectly. Jobs can be inserted anywhere in a 3 hour window after their StartDate. That could mean a million rows are inserted in an order that is not final.
Data size for a 1 Job + 1 JobStepId, with my current indexes, is about 500 bytes.
Questions:
Is this a good use of a heap?
What's the effect of clustering on StartDate, when it's pretty much non-sequential for ~2 hours/1 million rows? My guess is the constant re-ordering would kill insert perf.
Should I just add bigint PKs just to have smaller, always increasing keys? (I'd still need the guids for lookups.)
I read GUIDs as PRIMARY KEYs and/or the clustering key, and it seemed to suggest that even inventing a key will save considerable space on other indexes. Also some resources suggest that heaps have some sort of perf issues in general, but I'm not sure if that still applies in SQL 2008.
And again, yes, I'm going to try to perf test and measure. I'm just trying to get some guidance or links to other articles so I can make a more informed decision on what paths to consider.
Yes, heaps have issues. Your data will logically fragment all over the show and can not be defragmented simply.
Imagine throwing all your telephone directory into a bucket and then trying to find "bob smith". Or using a conventional telephone directory with a clustered index on lastname, firstname.
The overhead of maintaining the index is trivial.
StartDate, unless unique, is not a good choice. A clustered index requires internal uniqueness for the non-clustered indexes. If not declared unique, SQL Server will add a 4 byte "uniquifier".
Yes, I'd use int or bigint to make it easier. As for GUIDs: see the questions at the right hand side of the screen.
Edit:
Note, PK and clustered index are 2 separate issues even if SQL Server be default will make the PK clustered.
Heap fragmentation isn't necessarily the end of the world. It sounds like you'll rarely be scanning the data, so that's not the end of the world.
Your non-clustered indexes are the things that will impact your performance. Each one will need to store the address of the row in the underlynig table (either a heap or a clustered index). Ideally, your queries never have to use the underlying table itself, because it stores all the information needed in the ideal way (including all columns, so that it's a covering index).
And yes, Kimberly Tripp's stuff is the best around for indexes.
Rob
As your own research has shown, and as all the other answerers have mentioned, using a GUID as the clustered index on a table is a bad idea.
However, having a heap also isn't really a good choice, since heaps have other issues, mostly to do with fragmentation and other things that just don't work well with a heap.
My best practice advice would always be this:
do use a primary, clustered key on any data table (unless it's a temporary table, or a table used for bulk-loading)
try to make sure the clustered key is a INT IDENTITY or BIGINT IDENTITY
I would argue that the benefits you get by adding a INT/BIGINT - even just for the sake of having a good clustered index - far outweigh the drawbacks this has (as Kim Tripp also argues in her blog post you cited).
Marc
As a GUId is your primary and foreign key your database will still need to check the contraints on every insert you will probably need to index this. Indexing a GUId is not advisable due to it's randomness. Therefore I'd say absolutely you should go down the bigint (probably identity) route for your primary key and use it as a clustered index.
I have inherited a database where there are clustered indexes and additional duplicate indexes for each of the clustered index.
i.e
IX_PrimaryKey is a clustered index on the column ID.
IX_ID is a non clustered index on the column ID.
I want to clean up these duplicate non clustered indexes and I wanted to check to see if anyone could think of a reason to do this.
Can anyone think of a performance benefit for doing this?
For exact same indexes, there's no performance gain. Actually, it incurs performance loss in insertion and updates. However, if there are multicolumn indexes with different column order, there might be a valid reason for them.
Maybe I'm not thinking hard enough, but I can't see any reason to do this; the nature of the clustered index is that the data is organized in the order of the index. It seems that the extra index is a complete waste.
Digging through BOL and watching this question, though ...
There seems no sensible reason for doing this, and there is a performance hit.
The only thing I could think of to do this is to create an index with an incredibly narrow row width so that the rows per page was very high, making it very quick to scan / seek. But since it contains no other fields (except the clustered key, which is the same value) I still cannot see a reason for it.
It's quite possible the original creator was not aware that the PK was defaulting to a clustered index and created an NC index without realising it was a duplicate.
I presume what would have happened is that SQL Server would have automatically created clustered index when a primary key constraint was specified (this would happen if another index (non-clustered/clustered) is not present already) and then some one might have created a non-clustered index for the primary key column.
Such a scenario would:
Have some adverse effect on performance as indexes are updated when inserts/deletes/updates happen.
Use additional disk space.
Might lead to deadlocks.
Would contribute to more time in backup/restore of database.
cheers
It will be a waste to create a clustered primary key. Unless you have query that search for records using WHERE ID = 10 ?
You may want to create a clustered index on the column which will be frequently queried on WHERE City = 'Sydney'. Clustered means that SQL will group the data in the table based on the clustered index. By grouping the City values in the table means SQL can search for data quicker.
Storing two indexes over the same data is a waste of disk space and the processing needed to maintain the data.
However, I can imagine a product which depends on the existence of an index named IX_PrimaryKey. E.G.
string queryPattern = "select * from {0} as t with (index(IX_PrimaryKey))";
You can make the argument that the clustered index itself occupies much less space than the others, since the leaf is the actual data. On the other hand, the clustered index can be more susceptible to page splitting, and some indexes are better non-clustered.
Putting this together, I can definitely think of scenarios where removing the duplicate indexes would be a Bad Thing:
Code like above which depends on a known index name.
Code which can alter the clustered index to any of the non-clustered indexes.
Code which uses the presence/absence of IX_PrimaryKey to treat the table in a certain way.
I don't consider any of these good design, but I can definitely imagine someone doing it. (Have you posted this to DailyWTF?)
There are cases where it makes sense to have overlapping indexes which are not identical:
create index IX_1 on table1 (ID)
create index IX_2 on table1 (ID, TYPE, ORDER_DATE, TOTAL_CHARGES)
If you are looking up strictly by ID, SQL can optimize and use IX_1. If you are running a query based on ID, TYPE, ORDER_DATE and summing up TOTAL_CHARGES, SQL can use IX_2 as a "covering index", satisfying all the query details from the index without ever touching the table. Generally this is something you add in the course of performance tuning, after extensive testing.
Looking at your given example of two indexes on exactly the same field, I don't see a great fit. Perhaps SQL can use IX_ID as a "covering index" when checking for the existence of a value and bypass some blocking on IX_PrimaryKey?
I have a number of indexes on some tables, they are all similar and I want to know if the Clustered Index is on the correct column. Here are the stats from the two most active indexes:
Nonclustered
I3_Identity (bigint)
rows: 193,781
pages: 3821
MB: 29.85
user seeks: 463,355
user_scans: 784
user_lookups: 0
updates: 256,516
Clustered Primary Key
I3_RowId (varchar(80))
rows: 193,781
pages: 24,289
MB: 189.76
user_seeks: 2,473,413
user_scans: 958
user_lookups: 463,693
updates: 2,669,261
As you can see, the PK is being seeked often, but all the seeks for the i3_identity column are doing key lookups to this PK as well, so am I really benefiting from the index on I3_Identity much at all? Should I change to using the I3_Identity as the clustered? This could have a huge impact as this table structure is repeated about 10000 times where I work, so any help would be appreciated.
Frederik sums it up nicely, and that's really what Kimberly Tripp also preaches: the clustering key should be stable (never changes), ever increasing (IDENTITY INT), small and unique.
In your scenario, I'd much rather put the clustering key on the BIGINT column rather than the VARCHAR(80) column.
First of all, with the BIGINT column, it's reasonably easy to enforce uniqueness (if you don't enforce and guarantee uniqueness yourself, SQL Server will add a 4-byte "uniquefier" to each and every one of your rows) and it's MUCH smaller on average than a VARCHAR(80).
Why is size so important? The clustering key will also be added to EACH and every one of your non-clustered indexes - so if you have a lot of rows and a lot of non-clustered indexes, having 40-80 byte vs. 8 byte can quickly make a HUGE difference.
Also, another performance tip: in order to avoid the so-called bookmark lookups (from a value in your non-clustered index via the clustering key into the actual data leaf pages), SQL Server 2005 has introduced the notion of "included columns" in your non-clustered indexes. Those are extremely helpful, and often overlooked. If your queries often require the index fields plus just one or two other fields from the database, consider including those in order to achieve what is called "covering indexes". Again - see Kimberly Tripp's excellent article - she's the SQL Server Indexing Goddess! :-) and she can explain that stuff much better than I can...
So to sum it up: put your clustering key on a small, stable, unique column - and you'll do just fine!
Marc
quick 'n dirty:
Put the clustered index on:
a column who's values (almost) never change
a column for which values on new records increase / decrease
sequentially
a column where you perform range - searches
Here's the best discussion I've found about the topic. Kimberly Tripp is a MS blogger that stays on top of the debate. I could interpret it for you, but you obviously uonderstand the basic words and concepts, and the article is highly readable. So enjoy!
Hint: you'll find that short answers are almost always too simplistic.
From what I've read in the past, two of the most important measures with regards to indexing tables are the number of queries performed against the index and the index density. By using DBCC_SHOWSTATISTICS([table],[index]), you can examine index density. The idea is that you want your clustered index on the columns that provide the most distinctness per query.
In short, if you look at the "All density" measure from DBCC SHOW_STATISTICS and notice the number is very low, this is a good index to cluster. It makes logical sense to cluster on an index that provides more uniqueness, but only if it's actively queried against. Clustering on a seldom-used index will probably do more harm than good.
In the end it's a judgment call. You may want to talk with your DBA and analyze your code to see where you'll get the biggest benefit. In this limited example, your indexing seems to be clustered in the right area if you only consider usage (and even when you consider all density, given the fact that a primary key provides the most uniqueness you can muster.)
Edit: There's a pretty good article on MSDN that explains what SHOW_STATISTICS provides you. I'm certainly not an uber DBA, but most of the information I've provided here came from guidance given by our DBA :)
Here's the article: http://msdn.microsoft.com/en-us/library/ms174384.aspx
Generally, when I see key lookups to the primarykey/clustered key, it means I need to include (using the INCLUDE statement) more columns in the non-clustered key. Look at your queries and see what columns are being selected/used in those statements. If you include those columns in the non-clustered key, then it won't need to do the key lookup any more.