Deterministic Surrogate Key

Deterministic Surrogate Key - sql

I have a table with an integer identity column as a surrogate key for two other columns (int and datetime). To keep the value of this key in sync across test and production environments, I had the idea to make a trigger that sets the surrogate key to some deterministic value instead of the auto-generated identity (in this case a hash of the natural key). The tradeoff, as far as I can tell, is that I introduce the risk of collisions (which can be offset by changing the surrogate column to bigint).
CREATE TRIGGER dbo.TRG_TestTable_SetID ON dbo.TestTable
INSTEAD OF INSERT
AS
BEGIN
insert into dbo.TestTable (ID, IntKey, DateKey, MoreData)
select convert(bigint, hashbytes('md5', convert(binary(4), IntKey) + convert(binary(8), DateKey))),
IntKey, DateKey, MoreData
from inserted
END
Is this a good solution from a design standpoint? Will it still perform better than using the natural composite key as the primary key?
Edit: The int in the natural key is a foreign key to another table, where it is the surrogate key for a guid and a varchar. So the "natural key" alternative on this table would be the rather ugly composite of guid, varchar, and datetime.

I have used similar techniques before for similar reasons and with good success. To get the deterministic qualities that you want, you might try coercing the the composite natural key column values to strings, string-concatenating them together, and then generating an MD5 hash from that to use as your deterministic primary key.
Some considerations:
Case-sensitivity. Unless some of your business keys are meant by design to be case-sensitive, it is a good idea to establish a convention in your system to downcase or upcase letters first as 'a' is not the same as 'A' to a hash function. This can help avoid issues if you are creating a key from possibly manually user keyed data. For example, if a user keyed in item number 'itm009876' instead of 'ITM009876', and your various source systems aren't robust enough to conform the value before storing them.
String coercion: Make sure that you coerce values into strings in a way that makes sense and is very specific. For example, using ISO dates and date times plus time zone, or converting dates and date times to Unix timestamp integers before coercing to string
String delimiter. Use a good string separater between the strings before concatenation, such as ';'. (E.g., string concatenation of A+CB should not be the same as AB+C)
Store hash as binary: If possible store the MD5 hash as a 16-byte binary value on the table, and use a HEX() function to display it in a human readable format. Storing an MD5 hash as binary uses exactly half of the amount of space it would take to store a 32 byte hexadecimal string, which has advantages for performance of lookups and joins because it is both shorter and completely avoids any possible cycles wasted on special string comparison logic.
Pros
May avoid accidental duplication of row data at times
May avoid unnecessary round trips to single authority that must generate or retrieve serial or UUID surrogate keys.
Single column keys are easier for end users work with.
Single column keys are easier for downstream developers writing SQL, generating urls, etc to work with.
MD5 is old and well established so it's very well supported as an SQL function by most DBMS, so you can use compute them there too as needed without and third party extensions.
With MD5, collisions are extremely rare. As in more likely that your data center gets destroyed by a meteor than to experience a collision, even with hundreds of billions of rows and a single table. There is quite a bit of robust discussion about this online if you Google for one popular methodology that employs hash keys: 'data vault hash keys'.
Cons
Collisions are of course still theoretically possible with MD5. Many organizations are still very hesitant about this. So if you must have more bytes on the hash space, and you can live with the potential performance hit during joins and index updates, you can always choose a longer SHA hash.
Generation is complicated. You must choose and document the algorithm for generating the composite key hashes very well and communicate well with other developers in the organization. Just make sure that everyone is doing it the same way.
because of the non-sequential nature of hashes, they can be inefficient to query in some scenarios, such as in clustered index tables. Be careful with this as some DBMS's use clustered index as the default - or may not even have any other option - such as MYSQL's InnoDB. Heap tables are generally better, such as is supported/default in PostgreSQL and Microsoft SQL Server.
(Sorry for any typos and grammar mistakes. I am writing this on my mobile phone. I'll try to come back later and clean it up.)

Related

Using a UUID as a primary key for small SQL tables

I've read that UUIDs are typically not recommended as a primary key due to size and performance issues on large data sets.
However, would it be detrimental at all to use it on a few of the top level organizational tables? E.g. Organization or Branch, where there are only a handful of entries?

I would recommend that you use serial instead of UUIDs. Why are integers preferable to UUIDs?
They occupy less space. This is a marginal consideration in the base table, but a bigger issue for foreign keys.
Integers are easier to read and remember.
In many databases, tables are physically ordered using primary keys. In such databases, new inserts on a UUID will almost always go "between" records, which is expensive. However, Postgres does not support clustered indexes so the underlying data is not ordered.
There are downsides to integers:
There are a finite number, although big ints pretty much solve that problem.
They encode order-of-insertion information. Actually, this can be a positive or a negative.
Other than space usage, I don't think there is much harm in using UUIDs on a static table. I strongly prefer integers, only resorting to UUIDs in situations where an integer would be difficult to calculate.

Generally, are string (or varchar) fields used as join fields?

We have two tables. The first contains a name (varchar) field. The second contains a field that references the name field from the first table. This foreign key in the second table will be repeated for every row associated with that name. Is it generally discouraged to use a varchar/string field as a join between two tables? When is the best case where a string field can be used as a join field?

It's certainly possible to use a varchar as a key field (or simply something to join on). The main problems with it are based on what you normally store in a varchar field; mutable data. Strictly speaking, it's not advisable to have key fields change. A person's name, telephone number, even their SSN can all change. However, the employee with internal ID 3 will always be ID 3, even if there are two John Smiths.
Second, string comparison is dependent on a number of nit-picky details, such as culture, collation, whitespace translation, etc. that can break a join for no immediately-apparent reason. Say you use a tabspace character \t for a certain string you're joining on. Later, you change your software to replace \t with 3 spaces to reduce character escapes in your raw strings. You have now broken any functionality requiring a string with escaped tabs to be matched to an identical-looking, but differently-composed, string.
Lastly, even given two perfectly identical strings, there is a slight performance benefit to comparing two integer numbers than comparing two strings. Integer comparison is effectively constant-time. String comparison is linear at best, based on the length of the string.

Is it generally discouraged to use a varchar/string field as a join between two tables?
If there's a natural key to be used (extremely rare in real life, but state/province abbreviations are a good example), then VARCHAR fields are fine.
When is the best case where a string field can be used as a join field?
Depends on the database because of the bits allocated to the data type, but generally VARCHAR(4) or less takes around the same amount of space (less the less number of characters) as INT would.

Generally speaking you shouldn't use anything that is editable by the end users as a FK as an edit would require not one update, but one update per table which references that key.
Everyone else has already mentioned the potenetial performance implications of a query, but the update cost is also worth noting. I strongly suggest the use of a generated key instead.

If you're concerned about performance, the best way to know is to create tables that implement your potential design choices, then load them up with massive amounts of data to see what happens.
In theory, very small strings should perform as well as a number in joins. In practice, it would definitely depend upon the database, indexing, and other implementation choices.

In a relational database, you shouldn't use a string in one table that references the same string in another table. If the second table is a look-up, create an identity column for the table, and then reference the integer value in the first. When displaying the data, use a join to the second table. Just make sure in the second table you never actually delete records.
The only exception would be if you are creating an archive table where you want to store exactly what was chosen at a given time.

Sometimes a join will happen on fields that are not "join fields", because that's just the nature of the query (e.g. most ways of identifying records that are duplicates in a particular column). If the query you want relates to those values, then that's what the join will be on, end of story.
If a field genuinely identifies a row, then it is possible to use it as a key. It's even possible to do so if it could change (it brings issues, but not insurmountable issues) as long as it remains a genuine identifier (it'll never change to a value that exists for another row).
The performance impact varies by common query and by database. By database the type of indexing strategies of some makes them better at using varchar and other textual keys than other databases (in particular, hash-indices are nice).
Common queries can be such that it becomes more performant to use varchar even without hash indices. A classic example is storing pieces of text for a multi-lingual website. Each such piece of text will have a particular languageID relating to the language it is in. However, obtaining other information about that language (it's name etc.) is rarely needed; what's much more often needed is to either filter by the RFC 5646 code, or to find out what that RFC 6546 code is. If we use a numeric id, then we will have to join for both types of query to obtain that code. If we use the code as the ID, then the most common queries concerned with the language won't need to look in the language table at all. Most queries that do care about the details of the language also won't need to do any join; pretty much the only time the key will be used as a foreign key is in maintaining referential integrity on update and insert of text or on deletion of languages. Hence while the join is less efficient when it is used the system as a whole will be more efficient by using fewer joins.

It depends on the nature of your data.
If the string is some user-entered and updated value then I would probably shy away from joining on it. You may run into consistency difficulties for storing the name in both the parent and the detail table.
Nothing has duplicate names?
I have used a string field as a join when using GUIDs or single char identifiers or when I know the string to be a natural key (though I almost always prefer a surrogate)

Natural primary keys like a zip code, phone number, email address or user name are by definition strings. There are unique and relatively short.
If you put an index on such a column there is no problem with using them a join. Impact on performance will usually be minimal.

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?

I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.

Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.

Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys

When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.

There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.

I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.

This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.

Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.

Why not always use GUIDs instead of Integer IDs?

What are the disadvantages of using GUIDs?
Why not always use them by default?

Integers join a lot faster, for one. This is especially important when dealing with millions of rows.
For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.
For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.
A more in depth look can be found here, and on Jeff's blog.

GUIDs are four times larger than an int, and twice as large as a bigint.
GUIDs are really difficult to look at if you are trying to troubleshoot tables.

GUIDs are great from a programmer's perspective - they're guaranteed to be (almost) unique, so why not use them everywhere, right?
If you look at it from the DBA perspective and from the database standpoint, at least for SQL Server, there are a few things to consider:
GUIDs as primary key (which is responsible for uniquely identifying a single row in your table) might be okay - after all, they're unique, right?
however, SQL Server also has the concept of the clustering key, which physically orders the data in your table; if you don't know about this, and don't do anything explicitly, your primary key becomes your clustering key.
Kimberly Tripp - world-known expert on SQL Server indexing and performance - has a great many blog posts on why a GUID as your clustering key is a really bad idea - check out her blog on indexes.
Most notably, her best practices for a clustering key are:
narrow
static
unique
ever-increasing
GUIDs are typically static and unique - but they're neither narrow (16 byte vs. 4 byte for a INT) nor ever-increasing. Due to their nature, they're unique and (pseudo-)random.
The narrow part is important because the clustering key will be added to each and every index page for each and every non-clustered index on your table - and if you have a few of those, and a few million rows in your table, this amounts to a massive waste of space - and not just on your disk, but also in your SQL Server's RAM.
The ever-increasing part is immportant, because the randomness of the GUIDs causes a lot of fragmentation in your indices, which negatively affects your performance all around. Even the newsequentialid() of SQL Server 2005 and up doesn't really create sequential GUIDs all around - they're sequential for a while and then there's a jump again, causing fragmentation (less than totally random GUIDs, but still).
So all in all, if you're really concerned with your SQL Server performance, using GUIDs as a clustering key is a really bad idea - use INT IDENTITY() instead, possibly using a GUID as the primary (non-clustered) key if you really have to.
Marc

GUIDS can simplify generating keys ahead of time, or generating keys offline, or in a cluster, without risk of collision. There may also be a slight security benefit, with all keys being unguessable.
The disadvantage is that it's harder to read/type and on many of your tables you may later realize a need to go back and generate human-friendly keys anyways. They'll also evenly distribute your records in a table, which may make it slower to query multiple records that were inserted at around the same time vs having an autonumber key where your records are in order of time inserted.

GUIDs are big and slow compared to ints - so use them when they're needed, eschew them when they're NOT needed, it's as simple as that!

This answer does NOT preclude the idea of using INT's as a primary key. It is mainly meant to point-out WHEN a guid is useful.
HERE IS A GREAT (SHORT) ARTICLE:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Explained...
I use guids for any (common) DB entity-type which may need to be exported or shared with another DB instance. This way, I have a DNA marker (i.e. the guid) that can be used to differentiate between "like" objects of the same "physical" entity.
For example, let's pretend two database instances have a table called PROJECT. If the two projects share the same name or number it is hard to distinguish which one is which. Using GUID's though you can easily distinguish between 2 projects and where they come from...even when they have many similar values between them. This seems impossible...but actually can and does happen.

The biggest performance hit you'll see with GUIDs as a primary/clustered key is inserting records in large tables. It can be a heavy task to reindex since your key will fall somewhere in the middle

Using GUIDs as a primary key will eventually lead to your database crashing because the drive becomes too fragmented. This is a condition known as thrashing.

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?

Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.

You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.

I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas