SQL Server - can GUID be a good choice as part of a clustered index? - sql

I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.

A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).

Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.

Related

Clustering Factor and Unique Key

Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2
You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes

Clustered index on foreign key or primary key?

I have a table Item with autoinc int primary key Id and a foreign key UserId.
And I have a table User with autoinc int primary key Id.
Default is that the index for Item.Id gets clustered.
I will mostly query items on user-id so my question is: Would it be better to set the UserId foreign key index to be clustered instead?
Having the clustered index on the identity field has the advantage that the records will be stored in the order that they are created. New records are added at the end of the table.
If you use the foreign key as clustered index, the records will be stored in that order instead. When you create new records the data will be fragmented as records are inserted in the middle, which can reduce performance.
If you want an index on the foreign key, then just add a non-clustered index for it.
The answer depends only on usage scenario. For example, Guffa tolds that data will be fragmented. That's wrong. If your queries depends mostly on UserId, then data clustered by ItemId is fragmented for you, because items for same user may be spreaded over a lot of pages.
Of course, compared to sequential ItemId (if it is sequential in your schema), using UserId as clustered key can cause page splits while inserting. This is two additional page writes at maximum. But when you're selecting by some user, his items may be fragmented over tens of pages (depends on items per user, item size, insertion strategy, etc) and therefor a lot of page reads. If you have a lof of such selects per single single insert (very often used web/olap scenarios), you can face hundreds of IO operations compared to few ones spent on page splitting. That was the clustering index was created for, not only for clustering by surrogate IDs.
So there is no clear answer, are the clustered UserId in your case good or bad, because this is highly depends on context. What is ratio between selects/inserts operations? How fragmented user ids are if clustered by itemid? How many additional indicies are on the table, because there is a pitfall (below) in sql server.
As you might know, clustered index requires unique values. This is not a big problem, because you can create index on pair (UserId, ItemId). Clustered index isn't itself stored on disk, so no matter how many fields are there. But non-clustered indices store clustered index values in their leaves. So if you have clustered index on UserId+ItemId (lets imagine their type is [int] and size is 8 bytes) and non-clustered index on ItemId, then this index will have twice size (8 bytes per a b-tree leaf) compared to just the ItemId as clustered index (4 bytes per a leaf).
In general, you want to cluster on the most frequently accessed index. But you're not required to have a clustering index at all. You (or your DBAs) need to evaluate things and weigh the advantages and disadvantages so as to choose the most appropriate indexing strategy.
If you cluster on a monotonic counter like an identity column, all new rows are going to be inserted at the end of the table: that means a "hot spot" is created that is likely to cause lock contention on inserts, since every SPID doing an insert is hitting the same data page.
Tables without a clustering index have their data pages organized as a heap, pretty much just a linked list of data pages.
SQL Server indexes are B-trees. For non-clustered indexes, the leaf nodes of the B-tree are pointers to the appropriate data page. That means if the index is used and doesn't cover the query's columns, an additional look aside has to be done to fetch the data page. That means additional I/O and paging.
Clustered indices are different: their leaf nodes are the data pages themselves, meaning the heap essentially goes away: a table scan means a traversal of the clustering index's B-tree. The advantage is that once you've found what you need in the clustered index, you already have the data page you need, thus avoiding the additional I/O that a seek on a non-clustered index is likely to requir. The disadvantage, of course, is that the clustered index is larger, since it carrys the entire table with it, so traversals of the clustered index are more expensive.
clustered index is created on primary key so what you can do is leave that as clustered and then create a non clustered index on the user Id of item. This will still be very fast as user. Id column will be clustered index.
Possibly.
Is the item.user-id column a unique column within your item table? If not you'd need to make this a clustered primary key by adding a second (possibly more) column to the key to make it unique / possibly this will add additional overhead that you'd not anticipated.
Are there any relationships with the item.id column? If so those may be important to the performance of your application so should be taken into account.
How often is the item.user-id value likely to change? If not at all that counts in its favour; the more often it's likely to be updated the worse, since that leads to fragmentation.
My recommendation would be to build you app with the regular item.id as clustered key, the later once you've got some data try (in a test system using a copy of your production data) switching the clustered index and testing its impact; that way you can easily see real results rather than trying to guess the multitude of possibilities. This avoids premature optimisation / ensures you make the correct choice.

SQL Db index recommendation

I am trying to see if using a custom index for a specific type of data might reduce fragmentation in my database.
[Edit: we are using MS SQL Server 2008 R2]
I have an SQL database containing timestamped measurement data. Lots of data is inserted all the time, but once inserted it practically never needs to be updated. These timestamps are, however, not unique, as several devices (around 50 of them) measure the data at the same time.
This means that every 50 rows in the table contain equal timestamp values. This data is received more or less simultaneously, although I could take additional care to ensure that rows are written as sequentially as possible (if that would help), perhaps by keeping them in memory for some time and then writing only when I get the data from all the devices for a single timestamp.
We are using NHibernate with Guid.Comb to avoid index lookups we would have with plain bigint IDs. As opposed to plain GUIDs, this should reduce fragmentation, but for so many inserts, fragmentation nevertheless happens very soon.
Since my data is timestamped, and data is inserted almost sequentially (increasing timestamps), I am wondering if there is a more clever way to create a primary key with a unique clustered index for this table. Timestamp column is basically a bigint number (.NET DateTime ticks).
I have also noticed that a non-clustered index over that same timestamp column also gets pretty fragmented. So what index strategy would you recommend to reduce heap fragmentation in this case?
Maybe take a look at this answer, HiLo looks interesting.
Also, maybe your fragmentation is not result of the discrepancy between the ordering of the index values and the order in which they are added, but natural file growth effect (as explained here)?
A seperate column for a key doesn't make a lot of sense for this table since you won't be updating any of the data. I imagine you'll be doing a lot of queries though, probably based on that timestamp column.
You could try making the primary key a combination of the timestamp column and a device id column. You could try making that clustered. That should allow you to write nearly as fast as possible. If you query by device however, you may need another index on device id and timestamp (the reverse). I wouldn't make the reverse the clustered one though, as that will make the writes happen all over the table rather than on the trailing pages. And if most queries involve a date range and more than one device, clustering on timestamp first should give you the best performance.

Is ROWID internally indexed unique by an SQL DBMS?

It's my understanding that the quickest way to access a particular row is by its ROWID. In INFORMIX-SE 7.3, when I do a SELECT ROWID FROM table I notice that its values are type SERIAL[INT]. In Oracle, they are SERIAL[HEX]. Has anyone ever used ROWID for any practical use? If I wanted to locate the most recent row added to a table, would SELECT MAX(ROWID) FROM table be quicker and more reliable than say SELECT MAX(pk_id) FROM table, where pk_id is a user-created SERIAL column? What other practical use have you ever put ROWID to work for you?
Your understanding is not necessarily correct. The ROWID property in SQL Server is primarily intended for replication as a way to guarantee that the table has a single-field unique index value. This way the replication system does not have to account for any specific primary key semantics that your design might employ, while still being able to identify every row by a single column. No table is required to have a ROWID column unless it is part of a merge replication publication, so it's not something that every table has, unlike Oracle. It also doesn't serve the same purpose (they're Guid's--or uniqueidentifier in T-SQL parlance--on SQL Server and are random, not sequential integers like they are on Oracle).
The quickest way to retrieve a row from a table is by accessing the row via the clustered index. A table can only have one clustered index, as it's what determines the physical ordering of the rows on the disk. Furthermore, if the table has a primary key, the primary key is the clustered index. While it's possible to declare a table without a primary key and assign the clustered index to something else, I can't (off the top of my head) fathom a reason why you'd want to do this (or, for practical purposes, how you can justify having a table without a primary key).
In short, that means that the quickest way to retireve a row is by using the primary key of the table. Unless the ROWID column is the primary key (which is certainly possible to do), then it isn't the fastest way.
Well, I can only really tell how it works in Oracle, using it for 19+ years :-)
Put simply, ROWID is an internel identification, that acts like an physical address. It can be split into database file no, block no, and row no. So obtaining the ROWID makes the db engine able to look the data up in a single direct IO.
In an index the B* tree will have ROWIDs on the leaf nodes pointing directly the location of the data, e.g. in a primary index.
Being an physical address it is submit to change on relocation on disk, which can happen after restoring a backup, rebuilding a table, or export/import of data.
The db engine can do some tricks, e.g. when moving a pluggable tablespace from one instance to another to avoid rebuilding indexes, however this is strickly db engine internals.
So to keep out of trouble leave the ROWID for internal use for the db engine. Storing the ROWID for your own usage will eventually lead to inconsistency.
In Informix-SE, the ROWID is basically the record number within the C-ISAM file that is used to hold the table. SE only deals in fixed size records, of course (no VARCHAR data).
In Informix Dynamic Server, the ROWID is (a) more complex (page number plus slot number) and (b) not always present (fragmented tables do not expose the ROWID, unless the table was created WITH ROWIDS, in which case the ROWID is a physical column that is indexed after all) - be aware!
If no data is ever deleted and you are using SE, then selecting the row with the maximum ROWID will be the most recently added row. If a row is deleted, then that space will eventually be reused, and then the most recently added row ceases to be the one with the maximum ROWID. (IDS does not make that promise for a variety of complex reasons.)
The SE implementation of ROWID does not store the ROWID in the table, and does not create an index on it, but it does not need an index because it knows the formula for where to go to find the data (offset in data file = ROWID * RowSize), give or take a plus one on RowSize or a minus one ROWID or both.
As to practical use for ROWID, the style that was used before fragmentation was added to IDS was to select a list of ROWID values for the records of interest in the table, maintaining that list in memory:
SELECT ROWID
FROM InterestingTable
WHERE SomeColumn = xxx
AND AnotherColumn < yyy;
Then, the program could present these rows one at time, fetching the current data via the stored ROWID. The ROWID for a record would not change while a program was running. This ensured that the current data - whether edits from the current user or someone else - was shown when the record was displayed.
There's a program you're familiar with, ISQL Perform, that behaves like this. And it does not work with fragmented tables (necessarily in IDS; SE does not support fragmented tables) unless they are created with a physical ROWID column with the WITH ROWIDS clause.
Perhaps the term "RDBMS" rather than "an SQL server"?
Attaching any purpose to a ROWID is a bad idea. Particularly if you're in the habit of dropping and recreating tables. If your table needs a SERIAL PK, then that's what it should have. No good can come of using ROWIDs within your application.

Indexing Speed in SQL

When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?
I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.
EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit
At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
Unique Index on ParentPriceID and DependantPriceID
and
ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)
Unique Index on CostDependencyID and DistributionID
and
ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit
Unique Index on ExtensionID and LocationID
I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).
Thanks, Dane.
With all the new info available, I'd like to suggest the following:
1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table.
==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.
2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.
3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.
4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.
It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.
I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?
My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?
Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.
One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?
By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.
The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.
Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.