Am little confused with clustered index and non clustered index.
is any differences in MySQL and DB2 regarding clustered Indexing ?
In DB2, any single index on a table can be designated as the table's clustering index. The index is a normal b-tree index, no different (physically) than any other index other than the fact that it's been identified as the clustering index. The index has a series of index keys, and each index key has a list of RIDs (row IDs) that point to the physical location of the data for each row that matches the index key.
If you reorganize the table (using the REORG TABLE utility) DB2 will physically arrange the table's data (which is separate from the index's data) in the same physical order as the clustering index. DB2 will attempt to maintain the physical clustering order as new rows are inserted into the table (and you can help it by choosing an appropriate value for table's PCTFREE attribute), but over time the cluster ratio may decrease and you may need to reorganize the table again.
Compare this with MySQL, where with InnoDB, the table's data is stored in the primary key index's structure. So, unlike DB2 where the index has the key columns and then a list of RIDs, the primary key index stores the entire row – there is no separate storage object holding the table's data. This is why it's called a clustered index rather than a clustering index. This massively increases the size of the physical index, making it significantly harder to ensure that it will remain cached in memory.
Secondary indexes in InnoDB store the index key and the primary key columns for the rows (rather than a RID) – this could be inefficient if the primary key is made up of many columns.
<soapbox>
Using the primary key (or any unique key) for "clustering" is ridiculous. The entire point of clustering it to maintain locality of related data. InnoDB is not alone here - Microsoft SQL Server does this as well.
Take, for example, a transaction table. The primary key for this table may be transaction_id. With InnoDB, this is the clustered index. However, the likelihood that one transaction ID is related to the next transaction ID is pretty low.
account_id would make a much better clustering key precisely because it is not unique. If I'm looking for all transactions for a particular account_id, having all of those rows on a single physical page makes a lot of sense and greatly will reduce the amount of I/O necessary to find all of those rows.
If the table's data is stored as part of the primary key's structure (i.e. on transaction_id), then you'll likely be reading pages from all over the index just to find all of the transactions for a single account.
You may argue that storing all of the data as part of the primary key is a performance benefit (i.e., 1 I/O to get any particular row), but this also means that caching the index has just become a lot harder because it will be much bigger. "In memory" may be de rigueur, but if you need as much RAM as the size of your database to maintain performance that's useful only up to a point.
</soapbox>
Related
I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.
A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).
Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2
You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes
I've some knowledge of using clustered index and non clustered index, but I'm not sure when and on what conditions it would be helpful to use non clustered index over clustered index. can someone explain or provide some links so that it would be helpful to all of us.
pick your clustered index! Every "regular" data table ought to have a clustered index, since having a clustered index does indeed speed up a lot of operations - yes, speed up, even inserts and deletes! But only if you pick a good clustered index.
It's the most replicated data structure in your SQL Server database. The clustering key will be part of each and every non-clustered index on your table, too.
You should use extreme care when picking a clustering key - it should be:
narrow (4 bytes ideal)
unique (it's the "row pointer" after all. If you don't make it unique SQL Server will do it for you in the background, costing you a couple of bytes for each entry times the number of rows and the number of nonclustered indices you have - this can be very costly!)
static (never change - if possible)
ideally ever-increasing so you won't end up with horrible index fragmentation (a GUID is the total opposite of a good clustering key - for that particular reason)
it should be non-nullable and ideally also fixed width - a varchar(250) makes a very poor clustering key
Anything else should really be second and third level of importance behind these points ....
See some of Kimberly Tripp's (The Queen of Indexing) blog posts on the topic - anything she has written in her blog is absolutely invaluable - read it, digest it - live by it!
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
A non-clustered index is useful for columns that have some repeated values.
A clustered index is automatically created when we create the primary key for the table. We need to take care of the creation of the non-clustered index.
clustered index is one per table, When creating the clustered index, SQL server 2005 reads the column and forms a Binary tree on it. This binary tree information is then stored separately in the disc. With the use of the binary tree, now the search for the specific value based on the column decreases the number of comparisons to a large amount.
I have a table Item with autoinc int primary key Id and a foreign key UserId.
And I have a table User with autoinc int primary key Id.
Default is that the index for Item.Id gets clustered.
I will mostly query items on user-id so my question is: Would it be better to set the UserId foreign key index to be clustered instead?
Having the clustered index on the identity field has the advantage that the records will be stored in the order that they are created. New records are added at the end of the table.
If you use the foreign key as clustered index, the records will be stored in that order instead. When you create new records the data will be fragmented as records are inserted in the middle, which can reduce performance.
If you want an index on the foreign key, then just add a non-clustered index for it.
The answer depends only on usage scenario. For example, Guffa tolds that data will be fragmented. That's wrong. If your queries depends mostly on UserId, then data clustered by ItemId is fragmented for you, because items for same user may be spreaded over a lot of pages.
Of course, compared to sequential ItemId (if it is sequential in your schema), using UserId as clustered key can cause page splits while inserting. This is two additional page writes at maximum. But when you're selecting by some user, his items may be fragmented over tens of pages (depends on items per user, item size, insertion strategy, etc) and therefor a lot of page reads. If you have a lof of such selects per single single insert (very often used web/olap scenarios), you can face hundreds of IO operations compared to few ones spent on page splitting. That was the clustering index was created for, not only for clustering by surrogate IDs.
So there is no clear answer, are the clustered UserId in your case good or bad, because this is highly depends on context. What is ratio between selects/inserts operations? How fragmented user ids are if clustered by itemid? How many additional indicies are on the table, because there is a pitfall (below) in sql server.
As you might know, clustered index requires unique values. This is not a big problem, because you can create index on pair (UserId, ItemId). Clustered index isn't itself stored on disk, so no matter how many fields are there. But non-clustered indices store clustered index values in their leaves. So if you have clustered index on UserId+ItemId (lets imagine their type is [int] and size is 8 bytes) and non-clustered index on ItemId, then this index will have twice size (8 bytes per a b-tree leaf) compared to just the ItemId as clustered index (4 bytes per a leaf).
In general, you want to cluster on the most frequently accessed index. But you're not required to have a clustering index at all. You (or your DBAs) need to evaluate things and weigh the advantages and disadvantages so as to choose the most appropriate indexing strategy.
If you cluster on a monotonic counter like an identity column, all new rows are going to be inserted at the end of the table: that means a "hot spot" is created that is likely to cause lock contention on inserts, since every SPID doing an insert is hitting the same data page.
Tables without a clustering index have their data pages organized as a heap, pretty much just a linked list of data pages.
SQL Server indexes are B-trees. For non-clustered indexes, the leaf nodes of the B-tree are pointers to the appropriate data page. That means if the index is used and doesn't cover the query's columns, an additional look aside has to be done to fetch the data page. That means additional I/O and paging.
Clustered indices are different: their leaf nodes are the data pages themselves, meaning the heap essentially goes away: a table scan means a traversal of the clustering index's B-tree. The advantage is that once you've found what you need in the clustered index, you already have the data page you need, thus avoiding the additional I/O that a seek on a non-clustered index is likely to requir. The disadvantage, of course, is that the clustered index is larger, since it carrys the entire table with it, so traversals of the clustered index are more expensive.
clustered index is created on primary key so what you can do is leave that as clustered and then create a non clustered index on the user Id of item. This will still be very fast as user. Id column will be clustered index.
Possibly.
Is the item.user-id column a unique column within your item table? If not you'd need to make this a clustered primary key by adding a second (possibly more) column to the key to make it unique / possibly this will add additional overhead that you'd not anticipated.
Are there any relationships with the item.id column? If so those may be important to the performance of your application so should be taken into account.
How often is the item.user-id value likely to change? If not at all that counts in its favour; the more often it's likely to be updated the worse, since that leads to fragmentation.
My recommendation would be to build you app with the regular item.id as clustered key, the later once you've got some data try (in a test system using a copy of your production data) switching the clustered index and testing its impact; that way you can easily see real results rather than trying to guess the multitude of possibilities. This avoids premature optimisation / ensures you make the correct choice.
I've recently got the advice, that I should convert all our tables from using heap indexes such that each table has a clustered index. What are the consequences of persuing this strategy? E.g. is it more important to regularly reorganize the database? datagrowth? danger of really slow inserts? Danger of page-defragmentation if the PK is a GUID? Noticable speed-increase of my application? What are your experiences?
To serve as inspiration for good answers, here are some of the "facts" I've picked up from other threads here on stackoverflow
Almost certainly want to establish a clustered index on every table in your database. If a table does not have one. Performance of most common queries is better.
Clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. The INSERT speed will suffer, but the SELECT speed will be improved.
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
Clustered index on GUID is ok in situations where the GUID has a meaning and improves performance by placing related data close to each other http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html
Clustering doesn't affect lookup speed - a unique non-clustered index should do the job.
If your key is a GUID, then a non-clustered index on it is probably just as effective as a clustered index on it. This is because on GUIDs you absolutely never ever can have range scans on them (what could between 'b4e8e994-c315-49c5-bbc1-f0e1b000ad7c' and '3cd22676-dffe-4152-9aef-54a6a18d32ac' possibly mean??). With a width of 16 bytes, a GUID clustered index key is wider than a row id that you'd get from a heap, so a NC index on a PK guid is actually strategy that can be defended in a discussion.
But making the primary key the clustered index key is not the only way to build a clustered index over your heap. Do you have other frequent queries that request ranges over a certain column? Typical candidates are columns like date, state or deleted. If you do, then you should consider making those columns the clustered index key (it does not have to be unique) because doing so may help queries that request ranges, like 'all records from yesterday'.
The only scenario where heaps have significant performance benefit is inserts, specially bulk inserts. IF your load is not insert heavy, then you should definitely go for a clustered index. See Clustered Index Design Guidelines.
Going over over your points:
Almost certainly want to establish a clustered index on every table in
your database. If a table does not
have one. Performance of most common
queries is better.
A clustered index that can satisfy range requirements for most queries will dramatically improve performance, true. A clustered index that can satisfy order requirements can be helpful too, but nowhere as helpful as one that can satisfy a range.
Clustered indexes are not always bad on GUIDs... it all depends upon
the needs of your application. The
INSERT speed will suffer, but the
SELECT speed will be improved.
Only probe SELECTs will be improved: SELECT ... WHERE key='someguid';. Queries by object ID and Foreign key lookups will benefit from this clustered index. A NC index can server the same purpose just as well.
The problem with clustered indexes in a GUID field are that the GUIDs are
random, so when a new record is
inserted, a significant portion of the
data on disk has to be moved to insert
the records into the middle of the
table.
Wrong. Insert into position in an index does not have to move data. The worst it can happen is a page-split. A Page-split is (somehow) expensive, but is not the end of the world. You comment suggest that all data (or at least a 'significant' part) has to be moved to make room for the new row, this is nowhere near true.
Clustered index on GUID is ok in situations where the GUID has a
meaning and improves performance by
placing related data close to each
other
http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html
I can't possibly imagine a scenario where GUID can have 'related data'. A GUID is the quintessential random structure how could two random GUIDs relate in any way? The scenario Donald gives has a better solution: Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads, which is cheaper to implement (less storage required) and works for unique keys too (the solution in linked article would not work for unique keys, only for foreign keys).
Clustering doesn't affect lookup speed - a unique non-clustered index
should do the job.
For probes (lookup a specific unique key) yes. A NC index is almost as fast as the clustered index (the NC index lookup does require and additional key lookup to fetch in the rest of the columns). Where clustered index shines is range scans, as it the clustered index can cover any query, while a NC index that could potentially satisfy the same range may loose on the coverage and trigger the Index Tipping Point.
I would also recommend you read Kimberly Tripp's The Clustered Index Debate Continues... in which she details quite clearly all the benefits of having a *good clustering key over having a heap.
Pretty much all operations are faster - yes! even inserts and updates!
But this requires a good clustering key, and a GUID with its very random and unpredictable nature is not considered a good candidate for a clustering key. GUIDs as clustering key are bad - whether they have application meaning or not - just avoid those.
Your best bet is a key which is narrow, stable, unique and ever-increasing - a column of type INT IDENTITY fulfills all those requirements ideally.
For a lot more background on why a GUID doesn't make a good clustering key, and on just how bad it is, see more of Kim Tripp's blog posts:
GUIDs as PRIMARY key and/or clustering key
Disk space is cheap... but that's not the point!
I can recommend the book "SQL Performance Explained" - it is a 200 page book about indexes.
It also mentions when clustered indexes have worse performance than normal indexs. One of the problems is that the clustered index itself is a B-tree. So when you have other indexes on the same table, they can't point to a specifik row - instead they point to a "key" in the clustered index, so "the way" to the data gets longer.