Using a UUID as a primary key for small SQL tables

Using a UUID as a primary key for small SQL tables - sql

I've read that UUIDs are typically not recommended as a primary key due to size and performance issues on large data sets.
However, would it be detrimental at all to use it on a few of the top level organizational tables? E.g. Organization or Branch, where there are only a handful of entries?

I would recommend that you use serial instead of UUIDs. Why are integers preferable to UUIDs?
They occupy less space. This is a marginal consideration in the base table, but a bigger issue for foreign keys.
Integers are easier to read and remember.
In many databases, tables are physically ordered using primary keys. In such databases, new inserts on a UUID will almost always go "between" records, which is expensive. However, Postgres does not support clustered indexes so the underlying data is not ordered.
There are downsides to integers:
There are a finite number, although big ints pretty much solve that problem.
They encode order-of-insertion information. Actually, this can be a positive or a negative.
Other than space usage, I don't think there is much harm in using UUIDs on a static table. I strongly prefer integers, only resorting to UUIDs in situations where an integer would be difficult to calculate.

Related

How best to update row where composite primary key values change

We have numerous tables where we have composite keys with MULTIPLE entries. In some cases as many as SIX values that make up the primary key for a table that is not super large, maybe a few thousand entries, and is not accessed very heavily.
A better solution to this would be to use a primary key that is a single auto-incremented ID field and in order to make sure the combination of the six different fields now used as a primary key are unique you can create an index with a unique constraint. The performance might not be quite so good, but the code complexity would be DRASTICALLY reduced.
I was told that making the primary key so complex is necessary because the primary key is the only clustered index on a table and that this enhances the performance. I can understand how this would help, but is it THAT big of a performance enhancement? It seems to be a premature optimization.
Is it common practice to use composite primary keys? I understand that if you had a very large table, with many thousands of entries, and that was hit constantly, then even a small performance enhancement could be worth adding the complexity I am seeing.
It also seems like having a primary key made up of values that can be updated/changed is just asking for trouble. If other tables are referencing it couldn't that lead to issues?
This would mostly be for adding new tables, since changing the structure of the existing tables could be too drastic a change for them to accept. But I want to know if I am out of line before trying to push back against this practice.

Generally using many columns to form a PRIMARY KEY is the worst practice that I found regularly in my databases audits. In fact it was used in the hierarchical database model dating in the 50's... This was dropped due to poor performance !
Database relational model says that the key can be any column or group of columns, but the database experts and practionners have all demonstrates that the best way to have performances in order to ensure the scalability, is to have a key that is only one column, and with a datatype that is :
the shortest in terms of bytes
the biggest in terms of values
with asemantic values
in a monotonous order
The only way to assume all these considerations is to have a PRIMARY KEY with an auto-increment dataype such as IDENTITY or SEQUENCE.
Every other datatype or ways to do so have some extraoverhead or performs poorly.
In the case of PK with compound columns, the statistics for the optimizer are accurate only for the first column of the key. The statistics of the combination of multiple columns does not exists in any accurate way (except for a complete set of all value of the key in the case of a strict equality and of course this is always equal to 1) and conducts the optimizer to get an average of the global selectivity or worst, compute a correlated cardinality. In both case the execution plan will be of poor quality and sometime catastrophic...
For MS SQL Server clustered indexes are the best choice for PK, only if all the specification I wrote are strictly applied.

Modeling database : many small tables or not?

I have a database with some information which are repeated in some tables.
I want to know if it's interesting to create a table with this information and in the other table, I put only the id.
It's interesting because with this method I haven't got redundance. But I will have to do many joints between my tables in my request, and I'm afraid my request will be more slow.
(I work with symfony if it changes something)

It sounds like the 'information' in question is data that makes up key values. If so, it sounds like the database designer likes to use natural keys and that you prefer to use surrogate keys.
First, these are both merely a question of style. If the natural key values are composite (i.e. involve more than one column) and are included in other columns for data integrity purposes then they are not redundant.
Second, as you have observed, when it comes to performance of surrogate keys you have to weigh the advantage of the more efficient data type (e.g. a single integer column) against the degrading performance of needing to write more JOINs. Note that using surrogates tends to make constraints more troublesome to write e.g. when the required values for a rule is in another table and you SQL product doesn't support subqueries in CHECK constraints then you will need to use a trigger which degrades performance in a high activity environment.
Further consider that performance is not the only consideration e.g. using natural key values will tend to make the data more readable and therefore make the schema easier to maintain because the physical model will reflect the logical model more closely (surrogate keys do not appear in the logical model at all).

You're talking about Normalisation. As with so many design aspects it's a trade-off.
Having duplication within the database leads to many problems - for example how to keep those duplicates in step when updating data. So Inserts and Updates may well go more slowly because of the duplication. Hence we tend to normalise the database to avoid such duplication. That does lead to more complex queries and possibly some retrieval overhead.
Modern database products tend to do such queries really well if you take a bit of care to have the right indexes in place.
Hence my starting position would be to normalise your data, avoid duplication. Then in a special case perhaps denormalise just pieces where it really becomes essential. For example suppose some part of you database is large, mostly queried rather than updated (eg. historic order information) then perhaps denormalise that bit.

It is not a question of style.
The answer is, as the seeker has already identified, removal of duplication; Normalisation. Pull them all into one table, and place a Foreign Key wherever they are used.
Now an Integer FK may be "tidy", but any good, short, fixed length key will do. Variable length keys are very bad for performance, as the key needs to be unpacked every time the index is searched.
The nature of a Normalised database is more, smaller tables, which is much faster than an Unnormalised data heap, with fewer, larger tables. Get used to it.
As long as you are Joining on keys, Joins do not cost anything in themselves; ten joins to construct a row do not cost more than five. The cost is in the table sizes; the indices used; the distribution; the datatypes of the index columns; etc. Relational dbms are heavily engineered for Normalised databases.
If you need to do lookups of lookups, then that is the way it is. Just ensure that the tables are Normalised.

If you don't normalise
How are you going to store values that could potentially be used?
How are you going to separate "Lookup value" from "Look up value from "LookUpValue" etc
You'll be slows because you are storing variable length string "Lookup value" across many rows, rather than a nice tidy integer key
This is the more practical points to the other 2 answers...

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?

I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.

Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.

Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys

When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.

There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.

I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.

This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.

Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.

Why not always use GUIDs instead of Integer IDs?

What are the disadvantages of using GUIDs?
Why not always use them by default?

Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.

GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.

GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc

GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.

GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!

This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.

The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle

Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?

Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.

You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.

I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas