Changing newid() to newsequentialid() on an existing table - sql

At the moment we have a number of tables that are using newid() on the primary key. This is causing large amounts of fragmentation. So I would like to change the column to use newsequentialid() instead.
I imagine that the existing data will remain quite fragmented but the new data will be less fragmented. This would imply that I should perhaps wait some time before changing the PK index from non-clustered to clustered.
My question is, does anyone have experience doing this? Is there anything I have overlooked that I should be careful of?

You might think about using comb guids, as opposed to newsequentialid.
cast(
cast(NewID() as binary(10)) +
cast(GetDate() as binary(6))
as uniqueidentifier)
Comb guids are a combination of purely random guids along with the non-randomness of the current datetime, so that sequential generations of comb guids are near each other and in general in ascending order. Comb guids have various advantages over newsequentialid, including the facts that they are not a black box, that you can use this formula outside of a default constraint, and that you can use this formula outside of SQL Server.

If you switch to sequentialguids and reorganize the index once at the same time, you'll eliminate fragmentation. I don't understand why you want to just wait until the fragmented page links rearrange themselves in continuous extents.
That being said, have you done any measurement to show that the fragmentation is actually affecting your system? Just looking at an index and seeing 'is fragmented 75%' does not imply that the access time is affected. There are many more factors that come into play (buffer pool page life expectancy, rate of reads vs. writes, locality of sequential operations, concurrency of operations etc etc). While switching from guids to sequential guids is usualy safe, you may introduce problems still. For instance you can see page latch contention for an insert intensive OLTP system because it creates a hot-spot page where the inserts accumulate.

Thank you, yfeldblum! Your simple and concise explanation of COMB GUIDs really helped me out. I was actually looking at doing the reverse of this post: I had to get away from relying on newsequentialid() since I was trying to migrate a SQL Server 2012 db to Azure, and the newsequentialid() function is not supported there.
I was able to change all of my table PK defaults to COMB GUIDs, with the following syntax:
ALTER TABLE [dbo].[Company]
ADD CONSTRAINT [DF__Company__Company_ID__72E6D332]
DEFAULT (CONVERT([uniqueidentifier],CONVERT([binary](10),newid(),0)+CONVERT([binary](6),getdate(),0),0)) FOR [CompanyId]
GO
My SQL2012 db is now happily living in the Azure cloud.

If this is SQL Server, you are generating a Guid by calling newid(). This is not good for primary keys. Use an integer identity column for the primary key, and make your Guid a surrogate key (and a row guid column).

Related

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.
Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys
When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.
There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.
I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.
This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.
Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.

Why not always use GUIDs instead of Integer IDs?

What are the disadvantages of using GUIDs?
Why not always use them by default?
Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.
GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.
GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc
GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.
GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!
This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.
The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle
Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.

Uniqueidentifier PK: Is a SQL Server heap the right choice?

OK. I've read things here and there about SQL Server heaps, but nothing too definitive to really guide me. I am going to try to measure performance, but was hoping for some guidance on what I should be looking into. This is SQL Server 2008 Enterprise. Here are the tables:
Jobs
JobID (PK, GUID, externally generated)
StartDate (datetime2)
AccountId
Several more accounting fields, mainly decimals and bigints
JobSteps
JobStepID (PK, GUID, externally generated)
JobID FK
StartDate
Several more accounting fields, mainly decimals and bigints
Usage: Lots of inserts (hundreds/sec), usually 1 JobStep per Job. Estimate perhaps 100-200M rows per month. No updates at all, and the only deletes are from archiving data older than 3 months.
Do ~10 queries/sec against the data. Some join JobSteps to Jobs, some just look at Jobs. Almost all queries will range on StartDate, most of them include AccountId and some of the other accounting fields (we have indexes on them). Queries are pretty simple - the largest part of the execution plans is the join for JobSteps.
The priority is the insert performance. Some lag (5 minutes or so) is tolerable for data to appear in the queries, so replicating to other servers and running queries off them is certainly allowable.
Lookup based on the GUIDs is very rare, apart from joining JobSteps to Jobs.
Current Setup: No clustered index. The only one that seems like a candidate is StartDate. But, it doesn't increase perfectly. Jobs can be inserted anywhere in a 3 hour window after their StartDate. That could mean a million rows are inserted in an order that is not final.
Data size for a 1 Job + 1 JobStepId, with my current indexes, is about 500 bytes.
Questions:
Is this a good use of a heap?
What's the effect of clustering on StartDate, when it's pretty much non-sequential for ~2 hours/1 million rows? My guess is the constant re-ordering would kill insert perf.
Should I just add bigint PKs just to have smaller, always increasing keys? (I'd still need the guids for lookups.)
I read GUIDs as PRIMARY KEYs and/or the clustering key, and it seemed to suggest that even inventing a key will save considerable space on other indexes. Also some resources suggest that heaps have some sort of perf issues in general, but I'm not sure if that still applies in SQL 2008.
And again, yes, I'm going to try to perf test and measure. I'm just trying to get some guidance or links to other articles so I can make a more informed decision on what paths to consider.
Yes, heaps have issues. Your data will logically fragment all over the show and can not be defragmented simply.
Imagine throwing all your telephone directory into a bucket and then trying to find "bob smith". Or using a conventional telephone directory with a clustered index on lastname, firstname.
The overhead of maintaining the index is trivial.
StartDate, unless unique, is not a good choice. A clustered index requires internal uniqueness for the non-clustered indexes. If not declared unique, SQL Server will add a 4 byte "uniquifier".
Yes, I'd use int or bigint to make it easier. As for GUIDs: see the questions at the right hand side of the screen.
Edit:
Note, PK and clustered index are 2 separate issues even if SQL Server be default will make the PK clustered.
Heap fragmentation isn't necessarily the end of the world. It sounds like you'll rarely be scanning the data, so that's not the end of the world.
Your non-clustered indexes are the things that will impact your performance. Each one will need to store the address of the row in the underlynig table (either a heap or a clustered index). Ideally, your queries never have to use the underlying table itself, because it stores all the information needed in the ideal way (including all columns, so that it's a covering index).
And yes, Kimberly Tripp's stuff is the best around for indexes.
Rob
As your own research has shown, and as all the other answerers have mentioned, using a GUID as the clustered index on a table is a bad idea.
However, having a heap also isn't really a good choice, since heaps have other issues, mostly to do with fragmentation and other things that just don't work well with a heap.
My best practice advice would always be this:
do use a primary, clustered key on any data table (unless it's a temporary table, or a table used for bulk-loading)
try to make sure the clustered key is a INT IDENTITY or BIGINT IDENTITY
I would argue that the benefits you get by adding a INT/BIGINT - even just for the sake of having a good clustered index - far outweigh the drawbacks this has (as Kim Tripp also argues in her blog post you cited).
Marc
As a GUId is your primary and foreign key your database will still need to check the contraints on every insert you will probably need to index this. Indexing a GUId is not advisable due to it's randomness. Therefore I'd say absolutely you should go down the bigint (probably identity) route for your primary key and use it as a clustered index.

Improving performance of cluster index GUID primary key

I've a table with large number of rows (10K+) and it primary key is GUID. The primary key is clustered. The query performance is quite low on this table. Please provide suggestions to make it efficient.
A clustered index on GUID is not a good design. The very nature of GUID is that it's random, while a clustered index physically orders the records by the key. The two things are completely at odds. For every insert SQL has to reorder the records on disk! Remove clustering from this index!
The time to use clustering is when you have a "natural" order to the data: time inserted, account number, etc. For time fields, clustering is almost free. For account number, it might be free or cheap (when account numbers are assigned sequentially).
While there may be technical ways around the GUID issue, the best idea is to understand when to use clustering.
There is no problem with using a GUID as the primary key. Just make sure that when you actually set the GUID to be the primary key then set the index it automatically creates to be of type Non-clustered. A lot of people forget (or dont know) to do this in SQL Server.
NEVER use a clustered index on a GUID. This will cause a physical ordering around the GUID on disk, which is obviously pointless (as others have already pointed out)
You need to use newsequentialid() instead see here Some Simple Code To Show The Difference Between Newid And Newsequentialid
You can try sequential GUIDS, which will make the index more effective. Info here.
You need to analyze your query. We can only guess why your queries perform badly without viewing the execution plan (which you can get quiet easily from SQL Server or Oracle).
Considering that a GUID is a 128-bit value (if stored raw), a GUID cuts the density of the data and index blocks by as much as 50% (in the case of the primary key index) so make sure GUID is appropriate.
But that might not be the problem, so review the query plan. It could be several other issues.
Please avoid creating clustered index for lenghty string columns. GUID will have 36 char. It will reduce the query performance even you have created as clustered index. for better practice, use integer identity columns.

What's the purpose of sequental uniqueidentifier in PK of MS SQL?

I know that primary keys based on Guids do not have the best performance (due to the fragmentation), but they are globally unique and allow replication scenarios.
Integral identifiers, on the other side, have greater performance at the cost of scalability.
But in what scenarios would someone want to use sequential uniqueidentifier as the primary key? I think, that it beats the purpose of GUID, but still I see mentioning of the sequentals now and then.
What do I miss here?
What is commonly known as a sequential guids in SQL Server 2005 (generated by NEWSEQUENTIALID()) are an attempt to overcome the issues with normal guids.
They are still universally unique but also are always ascending. This means that they can be used for replication and have much better insert performance than traditional GUIDs.
The one drawback is that they are not "secure" because it is possible to guess the next sequential guid.
http://msdn.microsoft.com/en-us/library/ms189786.aspx
Using sequential guids ensures that you are always using a value larger than the last value. For an indexed field this is important. Instead of inserting randomly all over the spectrum, you are always inserting at the end of the last "page" of data, resulting in drastically reduced page splits, especially in the case where your uniqueidentifier column is also your clustered index.