What's the purpose of sequental uniqueidentifier in PK of MS SQL? - sql

I know that primary keys based on Guids do not have the best performance (due to the fragmentation), but they are globally unique and allow replication scenarios.
Integral identifiers, on the other side, have greater performance at the cost of scalability.
But in what scenarios would someone want to use sequential uniqueidentifier as the primary key? I think, that it beats the purpose of GUID, but still I see mentioning of the sequentals now and then.
What do I miss here?

What is commonly known as a sequential guids in SQL Server 2005 (generated by NEWSEQUENTIALID()) are an attempt to overcome the issues with normal guids.
They are still universally unique but also are always ascending. This means that they can be used for replication and have much better insert performance than traditional GUIDs.
The one drawback is that they are not "secure" because it is possible to guess the next sequential guid.
http://msdn.microsoft.com/en-us/library/ms189786.aspx

Using sequential guids ensures that you are always using a value larger than the last value. For an indexed field this is important. Instead of inserting randomly all over the spectrum, you are always inserting at the end of the last "page" of data, resulting in drastically reduced page splits, especially in the case where your uniqueidentifier column is also your clustered index.

Related

Index on a Varchar?

We're suffering some heavy table locking issues in production. I have noticed that I created a stored procedure which gets a list of orders by order number. Order number is a VARCHAR(150). There is no index of any type on this column.
At the moment, there is a LOT of NULL values in this column. However, over time (This table went live recently), the table will grow significantly. No more NULL values will be added in this time.
My question is two fold. Firstly, would an index be beneficial here. The proc is heavily used. And if so, should it be clustered or not? Data is things like CP123456, DR126512.
The second question, which probably influences the first question is - would it be beneficial to change the column to a CHAR(10), as it 'seems' the order number is always the same size. Is there any speed benefit in putting an index on a fixed length CHAR, as opposed to a VARCHAR(150)?
(The different in size is because of unknown requirements when the column was created).
SQL Server 2008.
Yes, absolutely! Go right ahead and add an index. Clustering the index is probably unnecessary here, and will not be possible anyway if you already have another clustered index (such as the primary key) on the table.
Changing the column to a CHAR(10) might have some benefits in terms of storage size, but it's unlikely to make a particularly great difference in index performance. I'd skip it for now.
I don't have references to cite on this, only experience / anecdotal evidence.
Firstly, queries can nearly always be improved through use of indexes. The exact benefit depends on the query.
- If a query requires only specific records / a small portion of the table, an index will help
- If a query requires the whole table, but could benefit from ordered data, an index will help
Clustered indexes generally provide performance benefits over non-clustered indexes. In a very simplified sense, using a non-clustered index is like using two tables and joining them (The search friendly index is used first, which is then joined to the data itself - Unless the index contains all the data fields you need).
A consideration here, however, is the order in which data is added to your table. If your clustered index means that data is often inserted or deleted at the middle of the table, you'll get fragmentation and other artefacts. In my experience, however, awareness and consideration to this is only needed in extreme situations.
In short, DEFINITELY index your data. And the clustered index is normally best placed to serve your worst performing queries.
As for the difference between VARCHAR and CHAR? In the olden days it was important to keep variable length fields at the end of your data, to make the fixed length fields easier to identify. This meant that having a VARCHAR field as your first field, and using it as a unique identifier, was pretty poor.
Nowadays, the performance difference is marginal. Personally, I'd still keep unique identifiers as fixed length though. Variable length data won't normally have noticeable performance costs, but when you're actually making comparisons for join predicates, etc., it's much tidier to have fixed length fields if possible.

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.
Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys
When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.
There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.
I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.
This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.
Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.

Why not always use GUIDs instead of Integer IDs?

What are the disadvantages of using GUIDs?
Why not always use them by default?
Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.
GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.
GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc
GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.
GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!
This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.
The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle
Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.

Changing newid() to newsequentialid() on an existing table

At the moment we have a number of tables that are using newid() on the primary key. This is causing large amounts of fragmentation. So I would like to change the column to use newsequentialid() instead.
I imagine that the existing data will remain quite fragmented but the new data will be less fragmented. This would imply that I should perhaps wait some time before changing the PK index from non-clustered to clustered.
My question is, does anyone have experience doing this? Is there anything I have overlooked that I should be careful of?
You might think about using comb guids, as opposed to newsequentialid.
cast(
cast(NewID() as binary(10)) +
cast(GetDate() as binary(6))
as uniqueidentifier)
Comb guids are a combination of purely random guids along with the non-randomness of the current datetime, so that sequential generations of comb guids are near each other and in general in ascending order. Comb guids have various advantages over newsequentialid, including the facts that they are not a black box, that you can use this formula outside of a default constraint, and that you can use this formula outside of SQL Server.
If you switch to sequentialguids and reorganize the index once at the same time, you'll eliminate fragmentation. I don't understand why you want to just wait until the fragmented page links rearrange themselves in continuous extents.
That being said, have you done any measurement to show that the fragmentation is actually affecting your system? Just looking at an index and seeing 'is fragmented 75%' does not imply that the access time is affected. There are many more factors that come into play (buffer pool page life expectancy, rate of reads vs. writes, locality of sequential operations, concurrency of operations etc etc). While switching from guids to sequential guids is usualy safe, you may introduce problems still. For instance you can see page latch contention for an insert intensive OLTP system because it creates a hot-spot page where the inserts accumulate.
Thank you, yfeldblum! Your simple and concise explanation of COMB GUIDs really helped me out. I was actually looking at doing the reverse of this post: I had to get away from relying on newsequentialid() since I was trying to migrate a SQL Server 2012 db to Azure, and the newsequentialid() function is not supported there.
I was able to change all of my table PK defaults to COMB GUIDs, with the following syntax:
ALTER TABLE [dbo].[Company]
ADD CONSTRAINT [DF__Company__Company_ID__72E6D332]
DEFAULT (CONVERT([uniqueidentifier],CONVERT([binary](10),newid(),0)+CONVERT([binary](6),getdate(),0),0)) FOR [CompanyId]
GO
My SQL2012 db is now happily living in the Azure cloud.
If this is SQL Server, you are generating a Guid by calling newid(). This is not good for primary keys. Use an integer identity column for the primary key, and make your Guid a surrogate key (and a row guid column).

Improving performance of cluster index GUID primary key

I've a table with large number of rows (10K+) and it primary key is GUID. The primary key is clustered. The query performance is quite low on this table. Please provide suggestions to make it efficient.
A clustered index on GUID is not a good design. The very nature of GUID is that it's random, while a clustered index physically orders the records by the key. The two things are completely at odds. For every insert SQL has to reorder the records on disk! Remove clustering from this index!
The time to use clustering is when you have a "natural" order to the data: time inserted, account number, etc. For time fields, clustering is almost free. For account number, it might be free or cheap (when account numbers are assigned sequentially).
While there may be technical ways around the GUID issue, the best idea is to understand when to use clustering.
There is no problem with using a GUID as the primary key. Just make sure that when you actually set the GUID to be the primary key then set the index it automatically creates to be of type Non-clustered. A lot of people forget (or dont know) to do this in SQL Server.
NEVER use a clustered index on a GUID. This will cause a physical ordering around the GUID on disk, which is obviously pointless (as others have already pointed out)
You need to use newsequentialid() instead see here Some Simple Code To Show The Difference Between Newid And Newsequentialid
You can try sequential GUIDS, which will make the index more effective. Info here.
You need to analyze your query. We can only guess why your queries perform badly without viewing the execution plan (which you can get quiet easily from SQL Server or Oracle).
Considering that a GUID is a 128-bit value (if stored raw), a GUID cuts the density of the data and index blocks by as much as 50% (in the case of the primary key index) so make sure GUID is appropriate.
But that might not be the problem, so review the query plan. It could be several other issues.
Please avoid creating clustered index for lenghty string columns. GUID will have 36 char. It will reduce the query performance even you have created as clustered index. for better practice, use integer identity columns.