<select> for an entity with composite keys - strategy needed - sql

So say I have database table tours (PK tour_id) holding region independent information and tours_regional_details (PK tour_id, region_id) holding region specific information.
Let's say I want to populate select control with entities from tours_regional_details table (my real scenarios are bit different, just imagine this for the sake of simplicity).
So, how would you tackle this? My guts says concatenate PKs into delimited strings, like "pk1|pk2" or "pk1,pk2" and use that as value of select control. While it works, feels dirty and possibly needs additional validation steps before splitting the string again, which again feels dirty.
I don't want to start a composite vs single pk holy war, but may this be a bad database design decision on my part? I always believed identifying relationships and composite keys are there for a reason, but I feel tempted to alter my tables and just stuff them with auto incremental IDs and unique constraints. I'm just not sure what kind of a fresh hell will that introduce.
I am a little bit flabbergasted that I encounter this for the first time now after so many years.
EDIT: Yes, there is a table regions (PK region_id) but is mostly irrelevant for the topic. While in some scenarios two select boxes would make sense, let's say here they don't, let's say I want only one select box and want to select from:
Dummy tour (Region 1)
Dummy tour (Region 2)
Another dummy tour (region 3)
...

Composite primary keys aren't bad database design. In an ideal world, our programming languages and UI libraries would support tuples and relations as first-class values, so you'd be able to assign a pair of values as the value of an option in your dropdown control. However, since they generally only support scalar variables, we're stuck trying to encode or reduce our identifiers.
You can certainly add surrogate keys / autoincrement columns (and unique constraints on the natural keys where available) to every table. It's a very common pattern, most databases I've seen have at least some tables set up like this. You may be able to keep existing composite foreign keys as is, or you may want/need to change them to reference the surrogate primary keys instead.
The risk with using surrogate keys for foreign keys is that your access paths in the database become fixed. For example, let's assume tours_regional_details had a primary key tours_regional_detail_id that's referenced by a foreign key in another table. Queries against this other table would always need to join with tours_regional_details to obtain the tour_id or region_id. Natural keys allow more flexible access paths since identifiers are reused throughout the database. This becomes significant in deep hierarchies of dependent concepts. These are exactly the scenarios where opponents of composite keys complain about the "explosion" of keys, and I can at least agree that it becomes cumbersome to remember and type out joins on numerous columns when writing queries.
You could duplicate the natural key columns into the referencing tables, but storing redundant information requires additional effort to maintain consistency. I often see this done for performance or convenience reasons where surrogate keys were used as foreign keys, since it allows querying a table without having to do all the joins to dereference the surrogate identifiers. In these cases, it might've been better to reference the natural key instead.
If I'm allowed to return to my ideal world, perhaps DBMSs could allow naming and storing joins.
In practice, surrogate keys help balance the complexity we have to deal with. Use them, but don't worship them.

Related

SQL: Primary key column. Artificial "Id" column vs "Natural" columns [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Relational database design question - Surrogate-key or Natural-key?
When I create relational table there is a temptation to choose primary key column the column which values are unique. But for optimization and uniformity purposes I create artifical Id column every time. If there is a column (or columns combination) that should be unique I create Unique Index for that instead of marking them as (composite) primary key column(s).
Is it really a good practice always to prefer artificial "Id" column + indexes instead of natural columns for a primary key?
This is a bit of a religious debate. My personal preference is to have synthetic primary keys rather than natural primary keys but there are good arguments on both sides. Realistically, so long as you are consistent and reasonable, either approach can work well.
If you use natural keys, the two major downsides are the presence of composite keys and mutating primary key values. If you have composite primary keys, you'd obviously have to have multiple columns in each child table. That can get unwieldy from a data model perspective when there are many relationships among entities. But it can also cause grief for people developing queries-- it's awfully easy to create queries that use N-1 of N join conditions and get almost the right result. If you have natural keys, you'll also inevitably encounter a situation where the natural key value changes and you then have to ripple that change through many different entities-- that's vastly more complicated than changing a unique value in the table.
On the other hand, if you use synthetic keys, you're wasting space by adding additional columns, adding additional overhead to maintain an additional index, and you're increasing the risk that you'll get functionally duplicated results. It's awfully easy to either forget to create a unique constraint on the business key or to see that there is a non-unique index on the combination and just assume that it was a unique index. I actually just got bitten by this particular failing a couple days ago-- I had indexed the composite natural key (with a non-unique index) rather than creating a unique constraint. Dumb mistake but one that's relatively easy to make.
From a query writing and naming convention standpoint, I would also tend to prefer synthetic keys because it's nice to know when you're joining tables that the primary key of A is going to be A_ID and the primary key of B is going to be B_ID. That's far more self-documenting than trying to remember that the primary key of A is the combination of A_NAME and A_REVISION_NUMBER and that the primary key of B is B_CODE.
There is little or no difference between a key enforced through a PRIMARY KEY constraint and a key enforced through a UNIQUE constraint. What's important is that you enforce ALL the keys necessary from a data integrity perspective. Usually that means at least one "natural" key (a key exposed to the users/consumers of the data and used to identify the facts about the universe of discourse) per table.
Optionally you might also want to create "technical" keys to support the application and database features rather than the end user (usually called surrogate keys). That should be very much a secondary consideration however. In the interests of simplicity (and very often performance as well) it usually makes sense only to create surrogate keys where you have identified a particular need for them and not before.
It depends on your natural columns. If they are small and steadily increasing, then they are good candidates for the primary key.
Small - the smaller the key, the more values you can get into a single row, and the faster your index scans will be
Steadily increasing - produces fewer index reshuffles as the table grows, improving performance.
My preference is to always use an artificial key.
First it is consistent. Anyone working on your application knows that there is a key and they can make assumptions on it. This makes it easier to understand and maintain.
I've also seen scenarios where the natural key (aka. a string from an HR system that identifies an employee) has to change during the life of the application. If you have an artificial key that links the natural id to your employee record then you only have to change that natural id in the one table. However, if that natural id is a primary key and you have it duplicated across a number of other tables as a foreign key, then you have a mess on your hands.
In my humble opinion, it is always better to have an artificial Id, if I understand properly your meaning of it.
Some people would use, for instance, business significant unique values as their table Id, and I have already read on MSDN, and even in the NHibernate official documentation that a unique business meaningless value is prefered (artificial Id), though you want to create an index on that value for future reference. So, the day the company will change their nomenclature, the system shall still be running flawlessly.
Yes, it is. If nothing else, one of the most important properties of the artificial primary key is opacity, which means the artificial key doesn't reflect any information beyond itself; if you use natural row contents for keys, you wind up exposing that information to things like Web interfaces, which is just a terrible idea on all manner of principle.

Does every table really need an auto-incrementing artificial primary key? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.
This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.
Is there any real, practical reason we need to use auto-incrementing keys?
This question has been asked numerous times on SO and has been the subject of much debate over the years amongst (and between) developers and DBAs.
Let me start by saying that the premise of you question implies that one approach is universally superior to the other ... this is rarely the case in real life. Surrogate keys and natural keys both have their uses and challenges - and it's important to understand what they are. Whichever choice you make in your system, keep in mind there is benefit to consistency - it makes the data model easier to understand and easier to develop queries and applications for. I also want to say that I tend to prefer surrogate keys over natural keys for PKs ... but that doesn't mean that natural keys can't sometimes be useful in that role.
It is important to realize that surrogate and natural keys are NOT mutually exclusive - and in many cases they can complement each other. Keep in mind that a "key" for a database table is simply something that uniquely identifies a record (row). It's entirely possible for a single row to have multiple keys representing the different categories of constraints that make a record unique.
A primary key, on the other hand, is a particular unique key that the database will use to enforce referential integrity and to represent a foreign key in other tables. There can only be a single primary key for any table. The essential quality of a primary key is that it be 100% unique and non-NULL. A desirable quality of a primary key is that it be stable (unchanging). While mutable primary keys are possible - they cause many problems for database that are better avoided (cascading updates, RI failures, etc). If you do choose to use a surrogate primary key for your table(s) - you should also consider creating unique constraints to reflect the existence of any natural keys.
Surrogate keys are beneficial in cases where:
Natural keys are not stable (values may change over time)
Natural keys are large or unwieldy (multiple columns or long values)
Natural keys can change over time (columns added/removed over time)
By providing a short, stable, unique value for every row, we can reduce the size of the database, improve its performance, and reduce the volatility of dependent tables which store foreign keys. There's also the benefit of key polymorphism, which I'll get to later.
In some instances, using natural keys to express relationships between tables can be problematic. For instance, imagine you had a PERSON table whose natural key was {LAST_NAME, FIRST_NAME, SSN}. What happens if you have some other table GRANT_PROPOSAL in which you need to store a reference to a Proposer, Reviewer, Approver, and Authorizer. You now need 12 columns to express this information. You also need to come up with a naming convention of some kind to identify which columns belong to which kind of individual. But what if your PERSON table required 6, or 8, or 24 columns to for a natural key? This rapidly becomes unmanageable. Surrogate keys resolve such problems by divorcing the semantics (meaning) of a key from its use as an identifier.
Let's also take a look at the example you described in your question.
Should the 2-character abbreviation of a state be used as the primary key of that table.
On the surface, it looks like the abbreviation field meets the requirements of a good primary key. It's relatively short, it is easy to propagate as a foreign key, it looks stable. Unfortunately, you don't control the set of abbreviations ... the postal service does. And here's an interesting fact: in 1973 the USPS changed the abbreviation of Nebraska from NB to NE to minimize confusion with New Brunswick, Canada. The moral of the story is that natural keys are often outside of the control of the database ... and they can change over time. Even when you think they cannot. This problem is even more pronounced for more complicated data like people, or products, etc. As businesses evolve, the definitions for what makes such entities unique can change. And this can create significant problems for data modelers and application developers.
Earlier I mentioned that primary keys can support key polymorphism. What does that mean? Well, polymorphism is the ability of one type, A, to appear as and be used like another type, B. In databases, this concept refers to the ability to combine keys from different classes of entities into a single table. Let's look at an example. Imagine for a moment that you want have an audit trail in your system that identifies which entities were modified by which user on what date. It would be nice to create a table with the fields: {ENTITY_ID, USER_ID, EDIT_DATE}. Unfortunately, using natural keys, different entities have different keys. So now we need to create a separate linking table for each kind of entity ... and build our application in a manner where it understand the different kinds of entities and how their keys are shaped.
Don't get me wrong. I'm not advocating that surrogate keys should ALWAYS be used. In the real world never, ever, and always are a dangerous position to adopt. One of the biggest drawbacks of surrogate keys is that they can result in tables that have foreign keys consisting of lots of "meaningless" numbers. This can make it cumbersome to interpret the meaning of a record since you have to join or lookup records from other tables to get a complete picture. It also can make a distributed database deployment more complicated, as assigning unique incrementing numbers across servers isn't always possible (although most modern database like Oracle and SQLServer mitigate this via sequence replication).
No.
In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.
In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.
A primary key must be:
unique (100% guaranteed! Not just "almost" unique)
NON NULL
A primary key should be:
stable if ever possible (not change - or at least not too frequently)
State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.
So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.
I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.
First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,
Then, having any key serves several often mutually inconsistent purposes.
Purpose 1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
Purpose 2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)
Purpose 3. To increase performance of queries that need to rapidly locate a specific record/row in the table.
Purpose 4. (Most Important from data consistency perspective!) To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)
Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying Purpose 4.
But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.
The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...
The only time you can get away without both is for a completely stand alone table that participates in no relationships with other tables and has an obvious and reliable natural key.
In general, a numeric primary key will perform better than a string. You can additionaly create unique keys to prevent duplicates from creeping in. That way you get the assurance of no duplicates, but you also get the performance of numbers (vs. strings in your scenario).
In all likelyhood, the major databases have some performance optimizations for integer-based primary keys that are not present for string-based primary keys. But, that is only a reasonable guess.
Yes, in my opinion every table needs an auto incrementing integer key because it makes both JOINs and (especially) front-end programming much, much, much easier. Others feel differently, but this is over 20 years of experience speaking.
The single exception is small "code" or "lookup" tables in which I'm willing to substitute a short (4 or 5 character) TEXT code value. I do this because the I often use a lot of these in my databases and it allows me to present a meaningful display to the user without having to look up the description in the lookup table or JOIN it into a result set. Your example of a States table would fit in this category.
No, absolutely not.
Having a primary key which can't change is a good idea (UPDATE is legal for primary key columns, but in general potentially confusing and can create problems for child rows). But if your application has some other candidate which is more suitable than an auto-incrementing value, then you should probably use that instead.
Performance-wise, in general fewer columns are better, and particularly fewer indexes. If you have another column which has a unique index on it AND can never be changed by any business process, then it may be a suitable primary key.
Speaking from a MySQL (Innodb) perspective, it's also a good idea to use a "real" column as a primary key rather than an "artificial" one, as InnoDB always clusters the primary key and includes it in secondary indexes (that is how it finds the rows in them). This gives it potential to do useful optimisation with a primary key which it can't with any other unique index. MSSQL users often choose to cluster the primary key, but it can also cluster a different unique index.
EDIT:
But if it's a small database and you don't really care about performance or size too much, adding an unnecessary auto-increment column isn't that bad.
A non auto-incrementing value (e.g. UUID, or some other string generated according to your own algorithm) may be useful for distributed, sharded, or diverse systems where maintaining a consistent auto-incrementing ID is difficult (or impossible - think of a distributed system which continues to insert rows on both sides of a network partition).
I think there are two things that may explain the reason why auto-incrementing keys are sometimes used:
Space consideration; ok your state name doesn't amount to much, but the space it takes may add up. If you really want to store the state with its name as a primary key, then go ahead, but it will take more place. That may not be a problem in certain cases, and it sounds like a problem of olden days, but the habit is perhaps ingrained. And we programmers and DBA do love habits :D
Defensive consideration: i recently had the following problem; we have users in the database where the email is the key to all identification. Why not make the email the promary key? except suddenly border cases creep in where one guy must be there twice to have two different adresses, and nobody talked about it in the specs so the adress is not normalized, and there's this situation where two different emails must point to the same person and... After a while, you stop pulling your hairs out and add the damn integer id column
I'm not saying it's a bad habit, nor a good one; i'm sure good systems can be designed around reasonable primary keys, but these two points lead me to believe fear and habit are two among the culprits
It's a key component of relational databases. Having an integer relate to a state instead of having the whole state name saves a bunch of space in your database! Imagine you have a million records referencing your state table. Do you want to use 4 bytes for a number on each of those records or do you want to use a whole crapload of bytes for each state name?
Here are some practical considerations.
Most modern ORMs (rails, django, hibernate, etc.) work best when there is a single integer column as the primary key.
Additionally, having a standard naming convention (e.g. id as primary key and table_name_id for foreign keys) makes identifying keys easier.

Why do I read so many negative opinions on using composite keys?

I was working on an Access database which loved auto-numbered identifiers. Every table used them except one, which used a key made up of the first name, last name and birthdate of a person. Anyways, people started running into a lot of problems with duplicates, as tables representing relationships could hold the same relationship twice or more. I decided to get around this by implementing composite keys for the relationship tables and I haven't had a problem with duplicates since.
So I was wondering what's the deal with the bad rep of composite keys in the Access world? I guess it's slightly more difficult to write a query, but at least you don't have to put in place tons of checks every time data is entered or even edited in the front end. Are they incredibly super inefficient or something?
A composite key works fine for a single table, but when you start to create relations between tables it can get a bit much.
Consider two tables Person and Event, and a many-to-many relations between them called Appointment.
If you have a composite key in the Person table made up of the first name, last name and birth date, and a compossite key in the Event table made up of place and name, you will get five fields in the Appointment table to identify the relation.
A condition to bind the relation will be quite long:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.FirstName = Appointment.PersonFirstName and
Person.LastName = Appointment.PersonLastName and
Person.BirthDate = Appointment.PersonBirthDate and
Event.Place = Appointment.EventPlace and
Event.Name = Appointment.EventName`.
If you on the other hand have auto-numbered keys for the Person and Event tables, you only need two fields in the Appointment table to identify the relation, and the condition is a lot smaller:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.Id = Appointment.PersonId and Event.Id = Appointment.EventId
If you only use pure self-written SQL to access your data, they are OK.
However, some ORMs, adapters etc. require having a single PK field to identify a record.
Also note that a composite primary key is almost invariably a natural key (there is hardly a point in creating a surrogate composite key, you can as well use a single-field one).
The most common usage of a composite primary key is a many-to-many link table.
When using the natural keys, you should ensure they are inherently unique and immutable, that is an entity is always identified by the same value of the key, once been reflected by the model, and only one entity can be identified by any value.
This it not so in your case.
First, a person can change their name and even the birthdate
Second, I can easily imagine two John Smiths born at the same day.
The former means that if a person changes their name, you will have to update it in each and every table that refers to persons; the latter means that the second John Smith will not be able to make it into your database.
For the case like yours, I would really consider adding a surrogate identifier to your model.
Unfortunately one reason for those negative opinions is probably ignorance. Too many people don't understand the concept of Candidate Keys properly. There are people who seem to think that every table needs only one key, that one key is sufficient for data integrity and that choosing that one key is all that matters.
I have often speculated that it would be a good thing to deprecate and phase out the use of the term "primary key" altogether. Doing that would focus database designers minds on the real issue: that a table should have as many keys as are necessary to ensure the correctness of the data and that some of those keys will probably be composite. Abolishing the primary key concept would do away with all those fatuous debates about what the primary key ought to be or not be.
If your RDBMS supports them and if you use them correctly (and consistently), unique keys on the composite PK should be sufficient to avoid duplicates. In SQL Server at least, you can also create FKs against a unique key instead of the PK, which can be useful.
The advantage of a single "id" column (or surrogate key) is that it can improve performance by making for a narrower key. Since this key may be carried to indexes on that table (as a pointer back to the physical row from the index row) and other tables as a FK column that can decrease space and improve performance. A lot of it depends on the specific architecture of your RDBMS though. I'm not familiar enough with Access to comment on that unfortunately.
As Quassnoi points out, some ORMs (and other third party applications, ETL solutions, etc.) don't have the capability to handle composite keys. Other than some ORMs though, most recent third party apps worth anything will support composite keys though. ORMs have been a little slower in adopting that in general though.
My personal preference for composite keys is that although a unique index can solve the problem of duplicates, I've yet to see a development shop that actually fully used them. Most developers get lazy about it. They throw on an auto-incrementing ID and move on. Then, six months down the road they pay me a lot of money to fix their duplicate data issues.
Another issue, is that auto-incrementing IDs aren't generally portable. Sure, you can move them around between systems, but since they have no actual basis in the real world it's impossible to determine one given everything else about an entity. This becomes a big deal in ETL.
PKs are a pretty important thing in the data modeling world and they generally deserve more thought then, "add an auto-incrementing ID" if you want your data to be consistent and clean.
Surrogate keys are also useful, but I prefer to use them when I have a known performance issue that I'm trying to deal with. Otherwise it's the classic problem of wasting time trying to solve a problem that you might not even have.
One last note... on cross-reference tables (or joining tables as some call them) it's a little silly (in my opinion) to add a surrogate key unless required by an ORM.
Composite Keys are not just composite primary keys, but composite foreign keys as well. What do I mean by that? I mean that each table that refers back to the original table needs a column for each column in the composite key.
Here's a simple example, using a generic student/class arrangement.
Person
FirstName
LastName
Address
Class
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
StudentClass - a many to many join table
StudentFirstName
StudentLastName
StudentAddress
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
You just went from having a 2-column many-to-many table using surrogate keys to having an 8-column many-to-many table using composite keys, because they have 3 and 5 column foreign keys. You can't really get rid of any of these fields, because then the records wouldn't be unique, since both students and instructors can have duplicate names. Heck, if you have two people from the same address with the same name, you're still in serious trouble.
Most of the answers given here don't seem to me to be given by people who work with Access on a regular basis, so I'll chime in from that perspective (though I'll be repeating what some of the others have said, just with some Access-specific comments).
I use surrogate a key only when there is no single-column candidate key. This means I have tables with surrogate PKs and with single-column natural PKs, but no composite keys (except in joins, where they are the composite of two FKs, surrogate or natural doesn't matter).
Jet/ACE clusters on the PK, and only on the PK. This has potential drawbacks and potential benefits (if you consider a random Autonumber as PK, for instance).
In my experience, the non-Null requirement for a composite PK makes most natural keys impossible without using potentially problematic default values. It likewise wrecks your unique index in Jet/ACE, so in an Access app (before 2010), you end up enforcing uniqueness in your application. Starting with A2010, table-level data macros (which work like triggers) can conceivably be used to move that logic into the database engine.
Composite keys can help you avoid joins, because they repeat data that with surrogate keys you'd have to get from the source table via a join. While joins can be expensive, it's mostly outer joins that are a performance drain, and it's only with non-required FKs that you'd get the full benefit of avoiding outer joins. But that much data repetition has always bothered me a lot, since it seems to go against everything we've ever been taught about normalization!
As I mentioned above, the only composite keys in my apps are in N:N join tables. I would never add a surrogate key to a join table except in the relatively rare case in which the join table is itself a parent to a related tables (e.g., Person/Company N:N record might have related JobTitles, i.e., multiple jobs within the same company). Rather than store the composite key in the child table, you'd store the surrogate key. I'd likely not make the surrogate key the PK, though -- I'd keep the composite PK on the pair of FK values. I would just add an Autonumber with a unique index for joining to the child table(s).
I'll add more as I think of it.
It complicates queries and maintenance. If you are really interested in this subject I'd recommend looking over the number of posts that already cover this. This will give you better info than any one response here.
https://stackoverflow.com/search?q=composite+primary+key
In the first place composite keys are bad for performance in joins. Further they are much worse for updating records as you have to update all the child records as well. Finally very few composite keys are actually really good keys. To be a good key it should be unique and not be subject to change. The example you gave as a composite key you used fails both tests. It is not unique (there are people with the same name born on the same day) and names change frequently causing much unnecessary updating of all the child tables.
As far as table with autogenrated keys casuing duplicates, that is mostly due to several factors:
the rest of the data in the table
can't be identified in any way as
unique
a design failure of forgetting to
create a unique index on the possible
composite key
Poor design of the user interface
which doesn't attempt to find
matching records or which allows data
entry when a pull down might be more
appropriate.
None of those are the fault of the surrogate key, they just indicate incompetent developers.
I think some coders see the complexity but want to avoid it, and most coders don't even think to look for the complexity at all.
Let's consider a common example of a table that had more than one candidate key: a Payroll table with columns employee_number, salary_amount, start_date and end_date.
The four candidate keys are as follows:
UNIQUE (employee_number, start_date); -- simple constraint
UNIQUE (employee_number, end_date); -- simple constraint
UNIQUE (employee_number, start_date, end_date); -- simple constraint
CHECK (
NOT EXISTS (
SELECT Calendar.day_date
FROM Calendar, Payroll AS P1
WHERE P1.start_date <= Calendar.day_date
AND Calendar.day_date < P1.end_date
GROUP
BY P1.employee_number, Calendar.day_date
)
); -- sequenced key i.e. no over-lapping periods for the same employee
Only one of those keys are required to be enforced i.e. the sequenced key. However, most coders wouldn't think to add such a key, let alone know how to code it in the first place. In fact, I would wager that most Access coders would add an incrementing autonumber column to the table, make the autonumber column the PRIMARY KEY, fail to add constraints for any of the candidate keys and will have convinced themselves that their table has a key!

Pros and Cons of autoincrement keys on "every table"

We are having a rather long discussion in our company about whether or not to put an autoincrement key on EVERY table in our database.
I can understand putting one on tables that would have a FK reference to, but I kind-of dislike putting such keys on each and every one of our tables, even though the keys would never be used.
Please help with pros and cons for putting autoincrement keys on every table apart from taking extra space and slowing everything a little bit (we have some tables with hundreds of millions of records).
Thanks
I'm assuming that almost all tables will have a primary key - and it's just a question of whether that key consists of one or more natural keys or a single auto-incrementing surrogate key. If you aren't using primary keys then you will generally get a lot of advantages of using them on almost all tables.
So, here are some pros & cons of surrogate keys. First off, the pros:
Most importantly: they allow the natural keys to change. Trivial example, a table of persons should have a primary key of person_id rather than last_name, first_name.
Read performance - very small indexes are faster to scan. However, this is only helpful if you're actually constraining your query by the surrogate key. So, good for lookup tables, not so good for primary tables.
Simplicity - if named appropriately, it makes the database easy to learn & use.
Capacity - if you're designing something like a data warehouse fact table - surrogate keys on your dimensions allow you to keep a very narrow fact table - which results in huge capacity improvements.
And cons:
They don't prevent duplicates of the natural values. So, you'll still usually want a unique constraint (index) on the logical key.
Write performance. With an extra index you're going to slow down inserts, updates and deletes that much more.
Simplicity - for small tables of data that almost never changes they are unnecessary. For example, if you need a list of countries you can use the ISO list of countries. It includes meaningful abbreviations. This is better than a surrogate key because it's both small and useful.
In general, surrogate keys are useful, just keep in mind the cons and don't hesitate to use natural keys when appropriate.
You need primary keys on these tables. You just don't know it yet.
If you use small keys like this for Clustered Indexes, then there's quite significant advantages.
Like:
Inserts will always go at the end of pages.
Non-Clustered Indexes (which need a reference to the CIX key(s)) won't have long row addresses to consider.
And more... Kimberly Tripp's stuff is the best resource for this. Google her...
Also - if you have nothing else ensuring uniqueness, you have a hook into each row that you wouldn't otherwise have. You should still put unique indexes on fields that should be unique, and use FKs onto appropriate fields.
But... please consider the overhead of creating such things on existing tables. It could be quite scary. You can put unique indexes on tables without needing to create extra fields. Those unique indexes can then be used for FKs.
I'm not a fan of auto-increment primary keys on every table. The ideas that these give you fast joins and fast row inserts are really not true. My company calls this meatloaf thinking after the story about the woman who always cut the ends off her meatloaf just because her mother always did it. Her mother only did it because the pan was too short--the tradition keeps going even though the reason no longer exists.
When the driving table in a join has an auto-increment key, the joined table frequently shouldn't because it must have the FK to the driving table. It's the same column type, but not auto-increment. You can use the FK as the PK or part of a composite PK.
Adding an auto-increment key to a table with a naturally unique key will not always speed things up--how can it? You are adding more work by maintaining an extra index. If you never use the auto-increment key, this is completely wasted effort.
It's very difficult to predict optimizer performance--and impossible to predict future performance. On some databases, compressed or clustered indexes will decrease the costs of naturally unique PKs. On some parallel databases, auto-increment keys are negotiated between nodes and that increases the cost of auto-increment. You can only find out by profiling, and it really sucks to have to change Company Policy just to change how you create a table.
Having autoincrementing primary keys may make it easier for you to switch ORM layers in the future, and doesn't cost much (assuming you retain your logical unique keys).
You add surrogate auto increment primary keys as part of the implementation after logical design to respect the physical, on-disk architecture of the db engine.
That is, they have physcial properties (narrow, numeric, strictly monotonically increasing) that suit use as clustered keys, in joins etc.
Example: If you're modelling your data, then "product SKU" is your key. "product ID" is added afterwards, (with a unique constraint on "product SKU") when writing your "CREATE TABLE" statements because you know SQL Server.
This is the main reason.
The other reason a brain dead ORM that can't work without one...
Many tables are better off with a compound PK, composed of two or more FKs. These tables correspond to relationships in the Entity-Relationship (ER) model. The ER model is useful for conceptualizing a schema and understanding the requirements, but it should not be confused with a database design.
The tables that represent entities from an ER model should have a smiple PK. You use a surrogate PK when none of the natural keys can be trusted. The decision about whether a key can be trusted or not is not a technical decision. It depends on the data you are going to be given, and what you are expected to do with it.
If you use a surrogate key that's autoincremented, you now have to make sure that duplicate references to the same entity don't creep into your databases. These duplicates would show up as two or more rows with a distinct PK (because it's been autoincremented), but otherwise duplicates of each other.
If you let duplicates into your database, eventually your use of the data is going to be a mess.
The simplest approach is to always use surrogate keys that are either auto-incremented by the db or via an orm. And on every table. This is because they are the generally fasted method for joins and also they make learning the database extremely simple, i.e. none of this whats my key for a table nonsense as they all use the same kind of key. Yes they can be slower but in truth the most important part of design is something that wont break over time. This is proven for surrogate keys. Remember, maintenance of the system happens a lot longer than development. Plan for a system that can be maintained. Also, with current hardware the potential performance loss is really negligable.
Consider this:
A record is deleted in one table that has a relationship with another table. The corresponding record in the second table cannot be deleted for auditing reasons. This record becomes orphaned from the first table. If a new record is inserted into the first table, and a sequential primary key is used, this record is now linked to the orphan. Obviously, this is bad. By using an auto incremented PK, an id that has never been used before is always guaranteed. This means that orphans remain orphans, which is correct.
I would never use natural keys as a PK. A numeric PK, like an auto increment is the ideal choice the majority of the time, because it can be indexed efficiently. Auto increments are guaranteed to be unique, even when records are deleted, creating trusted data relationships.

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.