SQL primary key, INT or GUID or..? [duplicate] - sql

This question already has answers here:
INT vs Unique-Identifier for ID field in database
(6 answers)
Closed 9 years ago.
Is there any reason why I should not use an Integer as primary key for my tables?
Database is SQL-CE, two main tables of approx 50,000 entries per year, and a few minor tables. Only two connections will exist constantly open to the database. But updates will be triggered through multiple TCP socket connections, so it will be many cross threads that access and use the same database-connection. Although activity is very low, so simultanous updates are quite unlikely, but may occur maybe a couple of times per day max.
Will probably use LINQ2SQL for DAL, or typed datasets.
Not sure if this info is relevant, but that's why I'm asking, since I don't know :)

You should use an integer - it is smaller, meaning less memory, less IO (disk and network), less work to join on.
The database should handle the concurrency issues, regardless of the type of PK.

The advantage of using GUID primkey is that it should be unique in the world, such as whether to move data from one database to another. So you know that the row is unique.
But if we are talking about a small db, so I prefer integer.
Edit:
If you using SQL Server 2005++, can you also use NEWSEQUENTIALID(),
this generates a GUID based on the row above.Allows the index problem with newid() is not there anymore.

I see no reason not to use an auto-increment integer in this scenario. If you ever get to the point where an integer can't handle the volume of data then you're talking about an application scaled up to the point that a lot more work is involved anyway.
Keep in mind a few things:
An integer is the native word size for the hardware. It's about as fast and simple and easy on the computer as a data type gets.
When considering possibly using a GUID, know that they make for terrible primary keys. Relational databases in general (I can't speak for all, but MS SQL is a good example) don't index GUIDs well. There are hacks out there to try to make more index-friendly GUIDs, take them or leave them. But in general a GUID should be avoided as a PK for performance reasons.

Is there any reason why I should not
use an Integer as primary key for my
tables?
Nope, as long as each one is unique, integers are fine. Guids sounds like a good idea at first, but in reality they are much too large. Most of the time, it's using a sledgehammer to kill a fly, and the size of the Guid makes it much slower than using an integer.

Definitely use an integer, you do not want to use a GUID in a clustered index (PK) as it will cause the table to unnecessarily fragment.

Related

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.
Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys
When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.
There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.
I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.
This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.
Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.

Why not always use GUIDs instead of Integer IDs?

What are the disadvantages of using GUIDs?
Why not always use them by default?
Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.
GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.
GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc
GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.
GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!
This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.
The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle
Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?
Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.
You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.
I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.

Changing newid() to newsequentialid() on an existing table

At the moment we have a number of tables that are using newid() on the primary key. This is causing large amounts of fragmentation. So I would like to change the column to use newsequentialid() instead.
I imagine that the existing data will remain quite fragmented but the new data will be less fragmented. This would imply that I should perhaps wait some time before changing the PK index from non-clustered to clustered.
My question is, does anyone have experience doing this? Is there anything I have overlooked that I should be careful of?
You might think about using comb guids, as opposed to newsequentialid.
cast(
cast(NewID() as binary(10)) +
cast(GetDate() as binary(6))
as uniqueidentifier)
Comb guids are a combination of purely random guids along with the non-randomness of the current datetime, so that sequential generations of comb guids are near each other and in general in ascending order. Comb guids have various advantages over newsequentialid, including the facts that they are not a black box, that you can use this formula outside of a default constraint, and that you can use this formula outside of SQL Server.
If you switch to sequentialguids and reorganize the index once at the same time, you'll eliminate fragmentation. I don't understand why you want to just wait until the fragmented page links rearrange themselves in continuous extents.
That being said, have you done any measurement to show that the fragmentation is actually affecting your system? Just looking at an index and seeing 'is fragmented 75%' does not imply that the access time is affected. There are many more factors that come into play (buffer pool page life expectancy, rate of reads vs. writes, locality of sequential operations, concurrency of operations etc etc). While switching from guids to sequential guids is usualy safe, you may introduce problems still. For instance you can see page latch contention for an insert intensive OLTP system because it creates a hot-spot page where the inserts accumulate.
Thank you, yfeldblum! Your simple and concise explanation of COMB GUIDs really helped me out. I was actually looking at doing the reverse of this post: I had to get away from relying on newsequentialid() since I was trying to migrate a SQL Server 2012 db to Azure, and the newsequentialid() function is not supported there.
I was able to change all of my table PK defaults to COMB GUIDs, with the following syntax:
ALTER TABLE [dbo].[Company]
ADD CONSTRAINT [DF__Company__Company_ID__72E6D332]
DEFAULT (CONVERT([uniqueidentifier],CONVERT([binary](10),newid(),0)+CONVERT([binary](6),getdate(),0),0)) FOR [CompanyId]
GO
My SQL2012 db is now happily living in the Azure cloud.
If this is SQL Server, you are generating a Guid by calling newid(). This is not good for primary keys. Use an integer identity column for the primary key, and make your Guid a surrogate key (and a row guid column).

Sanity Check: Floats as primary keys?

I'm working with an old sql server 2000 database, mixing some of it's information with a new app I'm building. I noticed some of the primary keys in several of the tables are floats rather than any type of ints. They aren't foreign keys and are all unique. I can't think of any reason that anyone would want to make their unique primary key IDs floats, but I'm not a SQL expert by any means. So I guess what I'm asking is does whoever designed this fairly extensive database know something I don't?
I'm currently working with a rather big accountant package where EACH of 350+ tables has a primary key of FLOAT(53). All actual values are integers and the system strictly checks that they indeed are (there are special functions that do all the incrementing work).
I did wonder at this design, yet I can understand why it was chosen and give it some credits.
On the one hand, the system is big enough to have billion records in some tables. On the other hand, those primary keys must be easily readable from external applications like Excel or VB6, in which case you don't really want to make them BIGINT.
Hence, float is fine.
I worked with someone who used floats as PKs in SQL Server databases. He was worried about running out of numbers for identifiers if he stuck to INTs. (32 bit on SQL Server.) He just looked at the range of float and did not think about the fact never mind that for larger numbers the ones place is not held in the number, due to limited precision. So his code to take MAX(PK) + 1.0 would at some point return a number equal to MAX(PK). Not good. I finally convinced him to not use float for surrogate primary keys for future databases. He quit before fixing the DB he was working on.
To answer your question, "So I guess what I'm asking is does whoever designed this fairly extensive database know something I don't?" Most likely NO! At least not with respect to choosing datatypes.
Floats have one interesting property: it is always possible to insert a value between two others, except for the pathological case where you run out of bits. It has the disadvantage that representational issues may keep you from referring to a row by the key; it's hard to make two floating point values equal to each other.
Is it a NUMERIC( x, y) format and an IDENTITY? If so, it might be an upgrade from an older version of SQL Server. Back-in-the-day IDENTITY could only be a NUMERIC format, not the common INT we use today.
Otherwise, there's no way to tell if a float is suitable as a primary key -- it depends upon your domain. It's a bit harder to compare (IEEE INT is more efficient than float) and most people use monotonically increasing numbers (IDENTITY), so integers are often what people really want.
Since it looks like you're storing ints:
To answer the original question more directly: If you're storing ints, use the integer datatype. It's more efficient to store and compare.
I have been working with the Cerner Millenium database for a few years (under the covers it uses Oracle). Initially I was very surprised to see that it used floats for IDs on tables. Then I encountered an ID in our database > 2^32 and a query I wrote gave the wrong results because I had incorrectly cast it to an INT I realized why they did it. I don't find any of the arguments above against using floats persuasive in the real world, where for keys you just need numbers only "somewhat greater" than 2^32 and the value of the ID is always of the form ######.0. (No one is talking about an ID of the form ######.######.) However now that we are importing this data into a SQL Server warehouse, and bigint is available, we are going to go with bigint instead of float.
FYI-- there is another way of looking at this:
I work in real-time process control and as such, most of my row entries are time-based and are generated automatically at high rates by non-ASCII machines. time--that's what my users usually search on, and many of my 'users' are actually machines themselves. Hence, UTC based primary keys.