Related
I am designing a database to contain a table reference, with a column type that is one of several predefined values (e.g., book, movie, magazine, etc.). I intend the range of possible values to expand over time (e.g. if I realize that I missed the academic_paper type, I want to be able to put that in).
The easiest solution would seem to be to simply store a string representing the type into the table. But this sounds like it would result in a lot of wasted space.
The other solution I thought of is creating a new table reference_types, which the type column references in its foreign key. This seems to have the added benefit of ensuring valid foreign keys (so that I won't accidentally mistype a "magzine" somewhere in my code), possible allow for faster queries for all media of a certain type (since integer comparisons should be much faster than string comparisons), but also slow my application down a bit as joins would be required whenever I need the reference type, and probably complicate logic because of those extra joins.
What are your thoughts on schema design for this problem?
Your second solution is the correct one. Create a secondary table to store your reference types and link them using a foreign key.
For further reading on this subject the search term you'd want to use is 'database normalisation'.
Create the reference_types table. And in your references table use integer and also add a reference_type_name field.
You can query the references table to get the integer key and print its name when needed without performing a join to the other table, and still use that table to perfom other operations, just keep both tables with equal type names.
I know it sonds redundant, but it's really the fastest way to do a simple query by int key and have it all together.
It depends, if you will want to add some other information to reference types, then use the second approach. If not, use the first one because it's faster and the information stored is only a string (you can always select unique to retrieve your types). Read this article for more info.
I want to know if I can use human readable primary keys for a relatively small number of database objects, which will describe large metropolitan areas.
For example, using "washington_dc" as the pk for the Washington, DC metro area, or "nyc" for the New York City one.
Tons of objects will be foreign keyed to these metro area objects, and I'd like to be able to tell where a person or business is located just by looking at their database record.
I'm just worried because my gut tells me this might be a serious crime against good practices.
So, am I "allowed" to do this kind of thing?
Thanks!
It all depends on the application - natural primary keys make a good deal of sense on the surface, since they are human readable and don't require any joins when displaying data to end users.
However, natural primary keys tend to be larger than INT (or even BIGINT) suragate primary keys and there are very few domains where there isn't some danger of having a natural primary key change. To take your example, a city changing its name is not a terribly uncommon occurrence. When a city's name changes you are then left with either an update that needs to touch every instance of city as a foreign key or with a primary key that no longer reflects reality ("The data shows Leningrad, but it really is St. Petersburg.")
So in sum, natural primary keys:
Take up more disc space (most of the time)
Are more susceptible to change (in the majority of cases)
Are more human readable (as long as they don't change)
Whether #1 and #2 are sufficiently counteracted by #3 depends on what you are building and what its use is.
I think that this question
What are the design criteria for primary keys?
gives a really good overview of the tradeoffs you might be making. I think the answer given is the correct one, but its brevity belies some significant thinking you actually have to do to work out what's right for you.
(From that answer)
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
For what it's worth, the small number of times I've had problems with scaling by choosing strings as the primary key is about the same as the number of time's I've had problems with redundant data using an autoincrement key. The problems that arise with autoincrement keys are worse, in my opinion, because you don't usually see them as soon.
A primary key must be unique and immutable, a human-readable string can be used as a PK so long as it meets both of those requirements.
In the example you've given, it sounds fine, given that cities don't change their names (and in the rare event they do then you can change the PK value with enough effort).
One of the main reasons you'd use numeric PKs instead of strings is performance (the other being to take advantage of automatically-incrementing IDs, see IDENTITY). If you anticipate more than a hundred queries per second on your textual PK then I would move to use int or bigint as a PK type. When you reach that level of database size and complexity you tend to stop using SSMS to edit table data directly and use your own tools, which would presumably perform a JOIN so you'd get the city name in the same resultset as the city's numeric PK.
you are allowed.
it is generally not the best practice.
numeric - auto incrementing keys are preferred. they are easily maintained and allow for coding of input forms and other interfaces where the user does not have to think up a new string as a key...
imagine: should it be washington, or washington_dc or dc or washingtondc.. etc.
I found this reading material on choosing a primary key.
Is there a guide / blog post on how to choose the primary key for a given table?
Should I use a auto-incremented/generated key, or should I base the primary key on the data being modeled (assuming it has a truly unique field)?
Should the primary key always be long for performance's sake, or can I take an external unique id as primary key, even if it's a string?
I believe that in practice using a natural key is rarely better than a surrogate key.
The following are the main disadvantages of using a natural key as the primary key:
You might have an incorrect key value, or you may simply want to rename a key value. To edit it, you would have to update all the tables that would be using it as a foreign key.
It is often difficult to have a truly unique natural key.
Natural keys are often strings. An index on an numeric field will be much more compact than one on a string field.
There is no hard rule on what the data type of the primary key should be. A numeric key normally performs better, but you could use a string, especially if the table is not big, and the tables that reference it are not big either.
A key is a set of attributes with two fundamental features: uniqueness and minimality. Minimality means the key has only the minimum number of attributes required to ensure uniqueness.
There are three criteria commonly applied as a guide to choosing a good key:
Familiarity - keys should be meaningful and familiar to the people who use them
Simplicity - keys should be as simple and concise as possible
Stability - key values should change infrequently
These are good guidelines but are not absolute requirements. In all cases functional requirements and the needs of data integrity should determine what keys to use.
I use surrogate keys, often referred to as non-sensical keys, made up of an autogenerated int/bigint datatype.
Here are some of the reasons I like using these keys.
When deleting several items from a list (such as old email) you can supply a comma separated list of integers instead of guids or natural keys
I find it makes writing your own cascade deletes easier
I think inner-joins are faster on integer fields
It can make learning a new system without documentation easier to understand.
Here are a couple of blog posts about primary keys:
http://www.mysqlperformanceblog.com/2006/10/03/long-primary-key-for-innodb-tables/
http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
I have worked with a lot of different data models in professional systems (mostly bank software) and there were different solutions. There was the GUID solution I have seen and it seemed not to have impacted the performances too much. I have seen the "number provided by a service as a system wide unique number". I have seen algorithms of providing something like a GUID "but shorter". I have seen also that the business key was used (like the account number) which is poor design and caused problems and I would not recommend it. I have seen the auto-incremented key for each table.
What did I like the most? The number provided by a service as a system wide number. It works well. And with a simple key translation table one can use a user key (like an account number) to find out what unique number and what sort of data object (not necessarily the table because the same unique key may apply to several tables if a data object is split up on different tables depending on its type).
So is there a blog or something? Well I have a book to recommend called "Data Modeling Essentials" by Graeme Simsion and Graham Witt. They might not suggest my preferred solution but they give many real live examples and show the different kind of solutions that are possible.
I always choose uuid as a primary key. In comparison to int/long key, there is a slight overhead, but there are a lot of benefits: you cannot run into type overflow, you can shard database later on without changing primary keys, you can integrate with other systems and be sure that your primary keys are always unique, uuid cannot be guessed etc.
Via this link, I know that a GUID is not good as a clustered index, but it can be uniquely created anywhere. It is required for some advanced SQL Server features like replication, etc.
Is it considered bad design if I want to have a GUID column as a typical Primary Key ? Also this assumes a separate int identity column for my clustering ID, and as an added bonus a "user friendly" id?
update
After viewing your feedback, I realise I didn't really word my question right. I understand that a Guid makes a good (even if its overkill) PK, but a bad clustering index (in general). My question more directly asked, is, is it bad to add a second "int identity" column to act as the clustering index?
I was thinking that the Guid would be the PK and use it to build all relationships/joins etc. Then I would instead of using a natural key for the Cluster Index, I would add an additional "ID" that not data-specific. What I'm wondering is that bad?
If you are going to create the identity field anyway, use that as the primary key. Think about querying this data. Ints are faster for joins and much easier to specify when writing queries.
Use the GUID if you must for replication, but don't use it as a primary key.
What are you intending to accomplish with the GUID? The int identity column will also be unique within that table. Do you actually need or expect to need the ability to replicate? If so, is using a GUID actually preferable in your architecture over handling identity columns through one of the identity range mangement options?
If you like the "pretty" ids generated using the Active Record pattern, then I think I'd try to use it instead of GUIDs. If you do need replication, then use one of the replication strategies appropriate for identity columns.
Consider using only GUID, but get your GUIDs using the NEWSEQUENTIALID method (which allocates sequential values and so doesn't have the same clustering performance problems as the NEWID method).
A problem with using a secondary INT key as an index is that, if it's an adequate index, why use a GUID at all? If a GUID is necessary, how can you use an INT index instead? I'm not sure whether you need a GUID, and if so then why: are you doing replication and/or merging between multiple databases? And if you do need a GUID then you haven't specified exactly how you intend to use the non-globally-unique INT index in that scenario.
Sounds like what you are saying is that I have not made a good case for using a Guid at all, and I agree I know its overkill, but my question I guess would be is it too much overkill?
I think it's convenient to use GUID instead of INT for the primary key, if you have a use case for doing so (e.g. multiple databases) and if you can tolerate the linear, O(1) loss of performance caused simply by using a bigger (16-byte) key (which results in there being fewer index instances per page of memory).
The bigger worry is the way in which using a (random) GUID could affect performance when it's used for clustering. To counter-act that:
Either, use something else (e.g. one of the record's natural keys) as the clustered index, even if you still use a GUID for the primary key
Or, let the clustered index be the same field as the GUID primary key, but use NewSequentialId() instead of NewId() to allocate the GUID values.
is it bad to insert an additional artifical "id" for clustering, since I'm not sure I'll have a good natural ID candidate for clustering?
I don't understand why you wouldn't prefer to instead use just the GUID with NewSequentialId(), which is I think is provided for exactly this reason.
Using a GUID is lazy -- i.e., the DBA can't be bothered to model his data properly. Also it offers very bad join performance -- typically (16-byte type with poor locality).
Is it a bad design, if I want to have a GUID column as my typical Primary Key, and a separate, int identity column for my clustering ID, and as an added bonus a "user friendly" id?
Yes it is very bad -- firstly you don't want more than one "artificial" candidate key for your table. Secondly, if you want a user friendly id to use as keys just use a fixed length type such as char[8] or binary(8) -- preferably binary as the sort won't use the locale; you could use 16-byte types however you will notice a deterioration in performance -- however not as bad as GUID's. You can use these fixed types to build your own user-friendly allocation scheme that preserves some locality but generates sensible and meaningful id's.
As an Example:
If you are writing some sort of a CRM system (lets say online insurance quotes) and you want an extremely user friendly type for example a insurance quote reference (QR) that looks like so "AD CAR MT 122299432".
In this case -- since the quote length huge -- I would create a separate LUT/Symboltable to resolve the quote reference to the actual identifier used. but I will divorce this LUT from the rest of the model, I will never use the quote reference anywhere else in the model, especially not in the table representing the QR's.
Create Table QRLut
{
bigint bigint_id;
char(32) QR;
}
Now if my model has one table that represents the QR and 20 other tables featuring the bigint QR as a foreign key -- the fact that a bigint is used will allow my DB to scale well -- the wider the join predicates the more contention is caused on the memory bus -- and the amount of contention on the memory bus determines how well your CPU's can be saturated (multiple CPU's).
You might think with this example that you could just place the user-friendly QR in the table that actually represents the quote, however keep in mind that SQL server gathers statistics on tables and indices, and you don't want to let the server make caching decisions based on the user-friendly QR -- since it is huge and wastefull.
I think it is bad design to do it that way but I don't know if it is bad otherwise. Remember, SQLServer automatically assigns the clustered index to the Primary key. You would have to remove it after making the GUID the primary key. Also, you usually want your identity column to be your primary key. So doing what you are saying would confuse anyone who reads your code that doesn't look closely. I would suggest you make the ID column your primary key, identity column, and put the clustered index on it. Then make your GUID column a unique key, making it a non-clustered index and not allowing nulls. That in affect will do what you want but will follow more of the standard.
Personally, I would go this way:
An internally known identity field for
your PK (one that isn't known to the
end-user because they will inevitably
want to control it somehow). A
user-friendly "ID" that is unique with
respect to some business rule
(enforced either in your app code or
as a constraint). A GUID in the
future if it's ever deemed necessary
(like if it's required for
replication).
Now with respect to the clustered index, which you may or may not be confused about, consider this guide from MS for SQL Server 2000.
You are right that GUIDs make good object identifiers, which are implemented in a database as primary keys. Additionally, you are right that primary keys do not need to be the clustered indices.
GUIDs share the same characteristics for clustered indexes as INT IDENTITY columns, provided that the GUIDs are sequential. There is a NewSequentialID specific to SQL Server, but there is also a generic algorithm for creating them called COMB GUID, based on combining the current datetime with random bytes in a way that retains a large degree of randomness while retaining sequentiality.
One thing to keep in mind, if you intend to use NHibernate at some point, is that NHibernate natively knows how to use the COMB GUID strategy - and NHibernate can even use it to do batch-inserts, something that cannot be done with INT IDENTITY or NewSequentialID. If you are inserting multiple objects with NHibernate, then it will be faster to use the COMB GUID strategy than either of the other two methods.
It is not bad design at all, an int Identity for your clustering key gives you a number of good benefits (Narrow,Unique,Ascending) whilst keeping the GUID for functionality purposes very separate and acting as your primary key.
If anything I would suggest you have the right approach, although the "user friendly" ID is the most questionable part - as in what purpose is it there to serve.
Addendum : I should put in the obligatory link to (possibly?) the most read article about the topic by Kimberley Tripp. http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.