here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?
Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.
You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.
I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.
Related
I've read that UUIDs are typically not recommended as a primary key due to size and performance issues on large data sets.
However, would it be detrimental at all to use it on a few of the top level organizational tables? E.g. Organization or Branch, where there are only a handful of entries?
I would recommend that you use serial instead of UUIDs. Why are integers preferable to UUIDs?
They occupy less space. This is a marginal consideration in the base table, but a bigger issue for foreign keys.
Integers are easier to read and remember.
In many databases, tables are physically ordered using primary keys. In such databases, new inserts on a UUID will almost always go "between" records, which is expensive. However, Postgres does not support clustered indexes so the underlying data is not ordered.
There are downsides to integers:
There are a finite number, although big ints pretty much solve that problem.
They encode order-of-insertion information. Actually, this can be a positive or a negative.
Other than space usage, I don't think there is much harm in using UUIDs on a static table. I strongly prefer integers, only resorting to UUIDs in situations where an integer would be difficult to calculate.
I'm looking for a solution for reverse engineering a DB without foreign keys (really! a 20 years old DB...). The intention is to do this completely without additional application or persistence logic, just by analyzing the data.
I know this would be somewhat difficult, but should be possible if the data itself esp. the PKs are analyzed as well.
I don't think there is a universal solution to your problem. Hopefully there is some sort of a naming convention for the tables/columns that can lead you. You can query the system tables to try and figure what's going on (Oracle: user_tab_columns, SQL Server: INFORMATION_SCHEMA.COLUMNS, etc.). Good luck!
I also don't think you'll find a universal solution to this... but I'd like to suggest you an approach, especially if you don't have any source code that could be read to guide you on mapping:
First scan all tables on your database. With scan, I mean store table names and columns.
You can assume columns types trying to convert data to a specific format (start trying to convert to dates, numbers, booleans and so on)... You can also try to discover data types by analysing its contents (if it has only numbers without floating points, if it has numbers with slashes, if it has long or short texts, if it is only one digit..etc.).
Once you have mapped all tables, start by comparing contents of all columns that has a numeric type. (Why? If the database was designed by a human... then he/she/they probably will use numbers as primary/foreign keys).
Every single time you find more than X successful correspondences between the contents of 2 columns from 2 different tables, log this connection. (This X factor depends on the amount of records you have...)
This analysis must run for each table comparing all other tables... column by column - so... this will take some time...
Of course, this is an overview of what need to be done, but it is not a complex code to be written...
Good luck and let me know if you find any sort of tool to do this! :-)
No offense but you can't have been in databases very long if this surprises you.
I am going to assume that by "reverse engineering" you are just looking to fill in the foreign keys, not moving to NoSQL or something. It could be an interesting project. Here is how I would go about it:
Look at all the SELECT statements and see how joins are made to a table. 20 years ago this would be in a WHERE clause but it gets more complicated than that, of course. With correlated subqueries and UPDATE statements with FROM clauses and whatever implies a join of some sort. You have to be able to figure all that out. If you want to do it formally (you can probably suss out all this stuff intuitively) you could list the number of times combinations are used in joins between tables. List them by pairs of tables not the the set of all the tables in the join. Those would be the candidate foreign keys if one side is a primary key. The other side gets the foreign key. There are multi-column PKs but you can figure that out (so if the other side of the primary key is in two tables that's not a foreign key). If one column ends up pointing to two different table PKs that's not a proper foreign key either but it might be appropriate to pick a table and use it as the target.
If you don't already have a primary keys you should do that first. Indexes, perhaps even clustered indexes (in Sybase/MSSQL), aren't always the correct primary keys. In any case you may have to change the primary keys accordingly.
To collect all the statements might be challenging in itself. You could use perl/awk to parse them out of their C/Java/PHP/Basic/COBOL programs, or you could reap them from monitoring input to the server. You would want to look for WHERE/JOIN/APPLY etc. rather than SELECT. There are lots of other ways.
This question already has answers here:
INT vs Unique-Identifier for ID field in database
(6 answers)
Closed 9 years ago.
Is there any reason why I should not use an Integer as primary key for my tables?
Database is SQL-CE, two main tables of approx 50,000 entries per year, and a few minor tables. Only two connections will exist constantly open to the database. But updates will be triggered through multiple TCP socket connections, so it will be many cross threads that access and use the same database-connection. Although activity is very low, so simultanous updates are quite unlikely, but may occur maybe a couple of times per day max.
Will probably use LINQ2SQL for DAL, or typed datasets.
Not sure if this info is relevant, but that's why I'm asking, since I don't know :)
You should use an integer - it is smaller, meaning less memory, less IO (disk and network), less work to join on.
The database should handle the concurrency issues, regardless of the type of PK.
The advantage of using GUID primkey is that it should be unique in the world, such as whether to move data from one database to another. So you know that the row is unique.
But if we are talking about a small db, so I prefer integer.
Edit:
If you using SQL Server 2005++, can you also use NEWSEQUENTIALID(),
this generates a GUID based on the row above.Allows the index problem with newid() is not there anymore.
I see no reason not to use an auto-increment integer in this scenario. If you ever get to the point where an integer can't handle the volume of data then you're talking about an application scaled up to the point that a lot more work is involved anyway.
Keep in mind a few things:
An integer is the native word size for the hardware. It's about as fast and simple and easy on the computer as a data type gets.
When considering possibly using a GUID, know that they make for terrible primary keys. Relational databases in general (I can't speak for all, but MS SQL is a good example) don't index GUIDs well. There are hacks out there to try to make more index-friendly GUIDs, take them or leave them. But in general a GUID should be avoided as a PK for performance reasons.
Is there any reason why I should not
use an Integer as primary key for my
tables?
Nope, as long as each one is unique, integers are fine. Guids sounds like a good idea at first, but in reality they are much too large. Most of the time, it's using a sledgehammer to kill a fly, and the size of the Guid makes it much slower than using an integer.
Definitely use an integer, you do not want to use a GUID in a clustered index (PK) as it will cause the table to unnecessarily fragment.
Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.
Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys
When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.
There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.
I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.
This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.
Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.
What are the disadvantages of using GUIDs?
Why not always use them by default?
Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.
GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.
GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc
GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.
GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!
This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.
The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle
Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.