Related
I have a database structure as follow:
tbl_products
-pk id (AutoIncrement)
-name
-description
tbl_tags (1) OR tbl_tags (2)
-pk name -pk id (AutoIncrement)
-name (unique)
tbl_products_tags
-fk product_id
-fk tag_id
I have seen most choose data structures tbl_tags (2). I want to ask whether i could choose tbl_tags(1) since name is always unique, so i want to to make it primary. Does it have any downside ?
If you make the tag name unique, you have to think about what you'll do if a name needs to be changed. For example, if I want to change "tag" to "tags".
If this is a primary key, then all the child records that refer to "tag" will also have to be updated so the constraint is valid. If you have a lot of rows referring to a given name, running this change is likely to be slow and introduce some blocking/contention into your application. Whereas if you use a surrogate primary key, you only have to update the unique name field, not all the child rows as well.
If you're certain that you'll never update a tag name then you could use it as the primary key. Beware of changing requirements however!
Natural keys generally make sense when using codes that are issued and managed by an external source (e.g. airport, currency and country codes). In these cases you can be sure that the natural key won't change and is guaranteed to be unique within the domain.
My understanding is there would be a marginal performance penalty to tbl_tags (1) in the context of a very large dataset when compared to option 2. In smaller datasets, probably not so much. The machine can process integers much more efficiently than strings.
In the bigger picture though, with modern processor speeds, the difference between the two might be negligable in all but the largest datasets.
Of course, I am speaking about relational databases here. The various flavors of NoSQL are a different animal.
Also, there is the matter of consistency. The other tables in your database all seem to be using (what I assume to be) an auto-incrementing integer ID. For that reason, I would use it on the tags table as well.
The use of auto-incrementing integer PK fields vs "Natural Keys" in designing a database is a long-standing debate. My understanding is academics largely prefer the "Natural Keys" concept, while in practice some form of generated unique key tends to be the norm.
Personally, I prefer to create generated keys which have no meaning to the end user, integers where possible. Unless I have missed something, index performance is significantly enhanced.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.
This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.
Is there any real, practical reason we need to use auto-incrementing keys?
This question has been asked numerous times on SO and has been the subject of much debate over the years amongst (and between) developers and DBAs.
Let me start by saying that the premise of you question implies that one approach is universally superior to the other ... this is rarely the case in real life. Surrogate keys and natural keys both have their uses and challenges - and it's important to understand what they are. Whichever choice you make in your system, keep in mind there is benefit to consistency - it makes the data model easier to understand and easier to develop queries and applications for. I also want to say that I tend to prefer surrogate keys over natural keys for PKs ... but that doesn't mean that natural keys can't sometimes be useful in that role.
It is important to realize that surrogate and natural keys are NOT mutually exclusive - and in many cases they can complement each other. Keep in mind that a "key" for a database table is simply something that uniquely identifies a record (row). It's entirely possible for a single row to have multiple keys representing the different categories of constraints that make a record unique.
A primary key, on the other hand, is a particular unique key that the database will use to enforce referential integrity and to represent a foreign key in other tables. There can only be a single primary key for any table. The essential quality of a primary key is that it be 100% unique and non-NULL. A desirable quality of a primary key is that it be stable (unchanging). While mutable primary keys are possible - they cause many problems for database that are better avoided (cascading updates, RI failures, etc). If you do choose to use a surrogate primary key for your table(s) - you should also consider creating unique constraints to reflect the existence of any natural keys.
Surrogate keys are beneficial in cases where:
Natural keys are not stable (values may change over time)
Natural keys are large or unwieldy (multiple columns or long values)
Natural keys can change over time (columns added/removed over time)
By providing a short, stable, unique value for every row, we can reduce the size of the database, improve its performance, and reduce the volatility of dependent tables which store foreign keys. There's also the benefit of key polymorphism, which I'll get to later.
In some instances, using natural keys to express relationships between tables can be problematic. For instance, imagine you had a PERSON table whose natural key was {LAST_NAME, FIRST_NAME, SSN}. What happens if you have some other table GRANT_PROPOSAL in which you need to store a reference to a Proposer, Reviewer, Approver, and Authorizer. You now need 12 columns to express this information. You also need to come up with a naming convention of some kind to identify which columns belong to which kind of individual. But what if your PERSON table required 6, or 8, or 24 columns to for a natural key? This rapidly becomes unmanageable. Surrogate keys resolve such problems by divorcing the semantics (meaning) of a key from its use as an identifier.
Let's also take a look at the example you described in your question.
Should the 2-character abbreviation of a state be used as the primary key of that table.
On the surface, it looks like the abbreviation field meets the requirements of a good primary key. It's relatively short, it is easy to propagate as a foreign key, it looks stable. Unfortunately, you don't control the set of abbreviations ... the postal service does. And here's an interesting fact: in 1973 the USPS changed the abbreviation of Nebraska from NB to NE to minimize confusion with New Brunswick, Canada. The moral of the story is that natural keys are often outside of the control of the database ... and they can change over time. Even when you think they cannot. This problem is even more pronounced for more complicated data like people, or products, etc. As businesses evolve, the definitions for what makes such entities unique can change. And this can create significant problems for data modelers and application developers.
Earlier I mentioned that primary keys can support key polymorphism. What does that mean? Well, polymorphism is the ability of one type, A, to appear as and be used like another type, B. In databases, this concept refers to the ability to combine keys from different classes of entities into a single table. Let's look at an example. Imagine for a moment that you want have an audit trail in your system that identifies which entities were modified by which user on what date. It would be nice to create a table with the fields: {ENTITY_ID, USER_ID, EDIT_DATE}. Unfortunately, using natural keys, different entities have different keys. So now we need to create a separate linking table for each kind of entity ... and build our application in a manner where it understand the different kinds of entities and how their keys are shaped.
Don't get me wrong. I'm not advocating that surrogate keys should ALWAYS be used. In the real world never, ever, and always are a dangerous position to adopt. One of the biggest drawbacks of surrogate keys is that they can result in tables that have foreign keys consisting of lots of "meaningless" numbers. This can make it cumbersome to interpret the meaning of a record since you have to join or lookup records from other tables to get a complete picture. It also can make a distributed database deployment more complicated, as assigning unique incrementing numbers across servers isn't always possible (although most modern database like Oracle and SQLServer mitigate this via sequence replication).
No.
In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.
In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.
A primary key must be:
unique (100% guaranteed! Not just "almost" unique)
NON NULL
A primary key should be:
stable if ever possible (not change - or at least not too frequently)
State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.
So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.
I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.
First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,
Then, having any key serves several often mutually inconsistent purposes.
Purpose 1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
Purpose 2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)
Purpose 3. To increase performance of queries that need to rapidly locate a specific record/row in the table.
Purpose 4. (Most Important from data consistency perspective!) To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)
Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying Purpose 4.
But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.
The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...
The only time you can get away without both is for a completely stand alone table that participates in no relationships with other tables and has an obvious and reliable natural key.
In general, a numeric primary key will perform better than a string. You can additionaly create unique keys to prevent duplicates from creeping in. That way you get the assurance of no duplicates, but you also get the performance of numbers (vs. strings in your scenario).
In all likelyhood, the major databases have some performance optimizations for integer-based primary keys that are not present for string-based primary keys. But, that is only a reasonable guess.
Yes, in my opinion every table needs an auto incrementing integer key because it makes both JOINs and (especially) front-end programming much, much, much easier. Others feel differently, but this is over 20 years of experience speaking.
The single exception is small "code" or "lookup" tables in which I'm willing to substitute a short (4 or 5 character) TEXT code value. I do this because the I often use a lot of these in my databases and it allows me to present a meaningful display to the user without having to look up the description in the lookup table or JOIN it into a result set. Your example of a States table would fit in this category.
No, absolutely not.
Having a primary key which can't change is a good idea (UPDATE is legal for primary key columns, but in general potentially confusing and can create problems for child rows). But if your application has some other candidate which is more suitable than an auto-incrementing value, then you should probably use that instead.
Performance-wise, in general fewer columns are better, and particularly fewer indexes. If you have another column which has a unique index on it AND can never be changed by any business process, then it may be a suitable primary key.
Speaking from a MySQL (Innodb) perspective, it's also a good idea to use a "real" column as a primary key rather than an "artificial" one, as InnoDB always clusters the primary key and includes it in secondary indexes (that is how it finds the rows in them). This gives it potential to do useful optimisation with a primary key which it can't with any other unique index. MSSQL users often choose to cluster the primary key, but it can also cluster a different unique index.
EDIT:
But if it's a small database and you don't really care about performance or size too much, adding an unnecessary auto-increment column isn't that bad.
A non auto-incrementing value (e.g. UUID, or some other string generated according to your own algorithm) may be useful for distributed, sharded, or diverse systems where maintaining a consistent auto-incrementing ID is difficult (or impossible - think of a distributed system which continues to insert rows on both sides of a network partition).
I think there are two things that may explain the reason why auto-incrementing keys are sometimes used:
Space consideration; ok your state name doesn't amount to much, but the space it takes may add up. If you really want to store the state with its name as a primary key, then go ahead, but it will take more place. That may not be a problem in certain cases, and it sounds like a problem of olden days, but the habit is perhaps ingrained. And we programmers and DBA do love habits :D
Defensive consideration: i recently had the following problem; we have users in the database where the email is the key to all identification. Why not make the email the promary key? except suddenly border cases creep in where one guy must be there twice to have two different adresses, and nobody talked about it in the specs so the adress is not normalized, and there's this situation where two different emails must point to the same person and... After a while, you stop pulling your hairs out and add the damn integer id column
I'm not saying it's a bad habit, nor a good one; i'm sure good systems can be designed around reasonable primary keys, but these two points lead me to believe fear and habit are two among the culprits
It's a key component of relational databases. Having an integer relate to a state instead of having the whole state name saves a bunch of space in your database! Imagine you have a million records referencing your state table. Do you want to use 4 bytes for a number on each of those records or do you want to use a whole crapload of bytes for each state name?
Here are some practical considerations.
Most modern ORMs (rails, django, hibernate, etc.) work best when there is a single integer column as the primary key.
Additionally, having a standard naming convention (e.g. id as primary key and table_name_id for foreign keys) makes identifying keys easier.
Via this link, I know that a GUID is not good as a clustered index, but it can be uniquely created anywhere. It is required for some advanced SQL Server features like replication, etc.
Is it considered bad design if I want to have a GUID column as a typical Primary Key ? Also this assumes a separate int identity column for my clustering ID, and as an added bonus a "user friendly" id?
update
After viewing your feedback, I realise I didn't really word my question right. I understand that a Guid makes a good (even if its overkill) PK, but a bad clustering index (in general). My question more directly asked, is, is it bad to add a second "int identity" column to act as the clustering index?
I was thinking that the Guid would be the PK and use it to build all relationships/joins etc. Then I would instead of using a natural key for the Cluster Index, I would add an additional "ID" that not data-specific. What I'm wondering is that bad?
If you are going to create the identity field anyway, use that as the primary key. Think about querying this data. Ints are faster for joins and much easier to specify when writing queries.
Use the GUID if you must for replication, but don't use it as a primary key.
What are you intending to accomplish with the GUID? The int identity column will also be unique within that table. Do you actually need or expect to need the ability to replicate? If so, is using a GUID actually preferable in your architecture over handling identity columns through one of the identity range mangement options?
If you like the "pretty" ids generated using the Active Record pattern, then I think I'd try to use it instead of GUIDs. If you do need replication, then use one of the replication strategies appropriate for identity columns.
Consider using only GUID, but get your GUIDs using the NEWSEQUENTIALID method (which allocates sequential values and so doesn't have the same clustering performance problems as the NEWID method).
A problem with using a secondary INT key as an index is that, if it's an adequate index, why use a GUID at all? If a GUID is necessary, how can you use an INT index instead? I'm not sure whether you need a GUID, and if so then why: are you doing replication and/or merging between multiple databases? And if you do need a GUID then you haven't specified exactly how you intend to use the non-globally-unique INT index in that scenario.
Sounds like what you are saying is that I have not made a good case for using a Guid at all, and I agree I know its overkill, but my question I guess would be is it too much overkill?
I think it's convenient to use GUID instead of INT for the primary key, if you have a use case for doing so (e.g. multiple databases) and if you can tolerate the linear, O(1) loss of performance caused simply by using a bigger (16-byte) key (which results in there being fewer index instances per page of memory).
The bigger worry is the way in which using a (random) GUID could affect performance when it's used for clustering. To counter-act that:
Either, use something else (e.g. one of the record's natural keys) as the clustered index, even if you still use a GUID for the primary key
Or, let the clustered index be the same field as the GUID primary key, but use NewSequentialId() instead of NewId() to allocate the GUID values.
is it bad to insert an additional artifical "id" for clustering, since I'm not sure I'll have a good natural ID candidate for clustering?
I don't understand why you wouldn't prefer to instead use just the GUID with NewSequentialId(), which is I think is provided for exactly this reason.
Using a GUID is lazy -- i.e., the DBA can't be bothered to model his data properly. Also it offers very bad join performance -- typically (16-byte type with poor locality).
Is it a bad design, if I want to have a GUID column as my typical Primary Key, and a separate, int identity column for my clustering ID, and as an added bonus a "user friendly" id?
Yes it is very bad -- firstly you don't want more than one "artificial" candidate key for your table. Secondly, if you want a user friendly id to use as keys just use a fixed length type such as char[8] or binary(8) -- preferably binary as the sort won't use the locale; you could use 16-byte types however you will notice a deterioration in performance -- however not as bad as GUID's. You can use these fixed types to build your own user-friendly allocation scheme that preserves some locality but generates sensible and meaningful id's.
As an Example:
If you are writing some sort of a CRM system (lets say online insurance quotes) and you want an extremely user friendly type for example a insurance quote reference (QR) that looks like so "AD CAR MT 122299432".
In this case -- since the quote length huge -- I would create a separate LUT/Symboltable to resolve the quote reference to the actual identifier used. but I will divorce this LUT from the rest of the model, I will never use the quote reference anywhere else in the model, especially not in the table representing the QR's.
Create Table QRLut
{
bigint bigint_id;
char(32) QR;
}
Now if my model has one table that represents the QR and 20 other tables featuring the bigint QR as a foreign key -- the fact that a bigint is used will allow my DB to scale well -- the wider the join predicates the more contention is caused on the memory bus -- and the amount of contention on the memory bus determines how well your CPU's can be saturated (multiple CPU's).
You might think with this example that you could just place the user-friendly QR in the table that actually represents the quote, however keep in mind that SQL server gathers statistics on tables and indices, and you don't want to let the server make caching decisions based on the user-friendly QR -- since it is huge and wastefull.
I think it is bad design to do it that way but I don't know if it is bad otherwise. Remember, SQLServer automatically assigns the clustered index to the Primary key. You would have to remove it after making the GUID the primary key. Also, you usually want your identity column to be your primary key. So doing what you are saying would confuse anyone who reads your code that doesn't look closely. I would suggest you make the ID column your primary key, identity column, and put the clustered index on it. Then make your GUID column a unique key, making it a non-clustered index and not allowing nulls. That in affect will do what you want but will follow more of the standard.
Personally, I would go this way:
An internally known identity field for
your PK (one that isn't known to the
end-user because they will inevitably
want to control it somehow). A
user-friendly "ID" that is unique with
respect to some business rule
(enforced either in your app code or
as a constraint). A GUID in the
future if it's ever deemed necessary
(like if it's required for
replication).
Now with respect to the clustered index, which you may or may not be confused about, consider this guide from MS for SQL Server 2000.
You are right that GUIDs make good object identifiers, which are implemented in a database as primary keys. Additionally, you are right that primary keys do not need to be the clustered indices.
GUIDs share the same characteristics for clustered indexes as INT IDENTITY columns, provided that the GUIDs are sequential. There is a NewSequentialID specific to SQL Server, but there is also a generic algorithm for creating them called COMB GUID, based on combining the current datetime with random bytes in a way that retains a large degree of randomness while retaining sequentiality.
One thing to keep in mind, if you intend to use NHibernate at some point, is that NHibernate natively knows how to use the COMB GUID strategy - and NHibernate can even use it to do batch-inserts, something that cannot be done with INT IDENTITY or NewSequentialID. If you are inserting multiple objects with NHibernate, then it will be faster to use the COMB GUID strategy than either of the other two methods.
It is not bad design at all, an int Identity for your clustering key gives you a number of good benefits (Narrow,Unique,Ascending) whilst keeping the GUID for functionality purposes very separate and acting as your primary key.
If anything I would suggest you have the right approach, although the "user friendly" ID is the most questionable part - as in what purpose is it there to serve.
Addendum : I should put in the obligatory link to (possibly?) the most read article about the topic by Kimberley Tripp. http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
I have read a lot of articles about whether we should have primary keys that are identity columns, but I'm still confused.
There are advantages of making columns are identity as it would give better performance in joins and provides data consistency. But there is a major drawback associated with identity ,i.e.When INSERT statement fails, still the IDENTITY value increases If a transaction is rolled back, the new IDENTITY column value isn't rolled back, so we end up with gaps in sequencing. I can use GUIDs (by using NEWSEQUENTIALID) but it reduces performance.
Gaps should not matter: the identity column is internal and not for end user usage or recognition.
GUIDs will kill performance, even sequential ones, because of the 16 byte width.
An identity column should be chosen to respect the physical implementation after modelling your data and working out what your natural keys are. That is, the chosen natural key is the logical key but you choose a surrogate key (identity) because you know how the engine works.
Or you use an ORM and let the client tail wag the database dog...
For all practical purposes, integers are ideal for primary keys and auto increment is a perfect way to generate them. As long as your PK is meaningless (surrogate) it will be protected from creativity of you customers and serve its main purpose (to identify a row in a table) just fine. Indexes are packed, joins fast as it gets, and it is easy to partition tables.
If you happen to need GUID, that's fine too; however, think auto-increment integer first.
I would like to say that depends on your needs. We use only Guids as primary keys (with default set to NewID) because we develop a distributed system with many Sql Server instances, so we have to be sure that every Sql Server generate unique primary key values.
But when using a Guid column as PK, be sure not to use it as your clustered index (thanks to marc_s for the link)
Advantage of the Guid type:
You can create unique values on different locations without synchronization
Disadvantage:
Its a large datatype (16 Bytes) and needs significant more space
It creates index fragmentation (at least when using the newid() function)
Dataconsistency is not an issue with primary keys independent of the datatype because a primary key has to be unique by definition!
I don't believe that an identity column has better join performance. At all, performance is a matter of the right indexes. A primary key is a constraint not an index.
Is your need to have a primary key of typ int with no gaps? This should'nt be a problem normally.
"yes, it KILLS performance - totally. I went from a legacy system with GUID as PK/CK and 99.5% index fragmentation on a daily basis to using INT IDENTITY - HUGE difference. Hardly any index fragmentation anymore, performance is significantly better. GUIDs as Clustering Index on your SQL Server table are BAD BAD BAD - period."
Might be true, but I see no logical reasoning according to which this leads me to conclude that GUIDs PER SE are also BAD BAD BAD.
Maybe you should consider using other types of indexes on such data. And if your dbms does not offer you a choice between several types of index, then perhaps you should consider getting yourself a better dbms.
I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.
I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.
Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.
When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.
Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.
Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:
The prevalent data access pattern.
The storage considerations.
Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:
current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.
FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.
Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.
Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.
These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.
So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.
I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the
primary key.
With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.
And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.
Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.
Marc
I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:
it requires no "algorithm" to generate the id/key/value
you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
you're following a defacto standard
That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.
Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.
Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.
Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.
INT
Use unless good reason not to do so.
GUID
Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'
Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.
CHAR
Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.
For best performance, 99.999% of the time the primary key should be a single integer field.
Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.
Use INT. Your points are all valid; I would prioritize as:
Ease of using SQL auto increment capabiity - why reinvent the wheel?
Managability - you don't want to have to change the key field.
Performance
Disk Space
1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.
If Joe Celko was on here, he would have some harsh words... ;-)
I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):
SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'
Now, with a Varchar PK on VehicleType:
SELECT Make, Model
FROM Vehicle
WHERE VehicleTypeName = 'Truck'
The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.
Just a thought. :-)
While INT is generally recommended, it really depends on your situation.
If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.
But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.
With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".
One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.
Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.
We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).
Keeping in mind that this is quite old a question, I still want to make the case for using varchar with surrogate keys fur future readers:
An environment with several replicated machines
Scenarios where it is required that the ID of a to be inserted row is known before it is actually inserted (i.e., the client assigns this ID, not the database)