When would you ever need to change a primary key's value? - sql

I've been reading up on foreign keys and such for postgres and I noticed that it allows a cascading update for foreign keys.
Well, my question is, when would you need to update the primary key of a row?
Apparently this guy needs to http://www.oreillynet.com/onlamp/blog/2004/10/hey_sql_fans_check_out_foreign.html but I'm not quite understanding how it could ever be useful.
Edit:
I see for natural primary keys, how this could be used. But what about technical primary keys? Ones that have no meaning and are almost always auto generated on insert?

Well... we have a lot of primary keys that are defined as a human readable code. Terrible idea, but not much choice in the matter.
It is very very handy to be able to fix that PK, and all dependent records, when someone realizes it is misspelled, or the meaning has changed.

You would need to do it if you chose your primary key as a natural key instead of a surrogate key, and then later found out that the user changed their surname, or that they wrote their SSN incorrectly on the application form.
Moral of story: don't use natural keys as primary keys.

I had to change my PK several times, when exposing my PK to a third party system. From time to time they called us asking to change the PKs, to fit the records in their database (from time to time due to tech problem, the synchronization between there two systems - fails).
After several times we just stopped exposing the PK and add a new column.

For a synthetic, meaningless primary key like an autoincrementing column there should (with a few exceptions) never be any reason to update the PK value. If the PK is a user-visible value you might have to update it (which is one of the many arguments in favour of synthetic keys). An example of this situation is an insurance policy number. In some cases the year is a part of the number, and may tick over on every renewal. In some data models the record is just updated in situ.
Where this happens you would be better off to use a synthetic key, so that other items are not dependent on the visible number.
One possible scenario where you would need to update a synthetic key is if you were merging two or more application databases together. In this case you may need to shift keys en masse to avoid collissions with the keys of records from the other source.

You may get in this situation if you use natural primary key.
Here's one very fresh example: in Croatia government changed tax identification numbers for both companies and individuals. New law was introduced with January 1st 2010.
Last year, I was consultant in several projects where companies were changing natural key (old tax number) to surrogate key in existing applications. Natural key seemed logical selection to original designers of those apps because it was defined by law. And then it changed.

For autogenerated keys, one example that I came across here (can't remember question) is if you need to merge two database tables together. In this case, you'll likely have duplicates unless your keys happened to be offset enough.

You might need to do this if using a natural primary key (one that has an actual meaning in the problem domain). If the meaning changed, then you'd need to cascade the change.
I supposed a bad example of this would be a database of buildings on a school campus, with the building name as a primary key (don't do this at home). If the building is renamed to bribe honor a new donor, then the key would need to change.

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

Does every table really need an auto-incrementing artificial primary key? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.
This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.
Is there any real, practical reason we need to use auto-incrementing keys?
This question has been asked numerous times on SO and has been the subject of much debate over the years amongst (and between) developers and DBAs.
Let me start by saying that the premise of you question implies that one approach is universally superior to the other ... this is rarely the case in real life. Surrogate keys and natural keys both have their uses and challenges - and it's important to understand what they are. Whichever choice you make in your system, keep in mind there is benefit to consistency - it makes the data model easier to understand and easier to develop queries and applications for. I also want to say that I tend to prefer surrogate keys over natural keys for PKs ... but that doesn't mean that natural keys can't sometimes be useful in that role.
It is important to realize that surrogate and natural keys are NOT mutually exclusive - and in many cases they can complement each other. Keep in mind that a "key" for a database table is simply something that uniquely identifies a record (row). It's entirely possible for a single row to have multiple keys representing the different categories of constraints that make a record unique.
A primary key, on the other hand, is a particular unique key that the database will use to enforce referential integrity and to represent a foreign key in other tables. There can only be a single primary key for any table. The essential quality of a primary key is that it be 100% unique and non-NULL. A desirable quality of a primary key is that it be stable (unchanging). While mutable primary keys are possible - they cause many problems for database that are better avoided (cascading updates, RI failures, etc). If you do choose to use a surrogate primary key for your table(s) - you should also consider creating unique constraints to reflect the existence of any natural keys.
Surrogate keys are beneficial in cases where:
Natural keys are not stable (values may change over time)
Natural keys are large or unwieldy (multiple columns or long values)
Natural keys can change over time (columns added/removed over time)
By providing a short, stable, unique value for every row, we can reduce the size of the database, improve its performance, and reduce the volatility of dependent tables which store foreign keys. There's also the benefit of key polymorphism, which I'll get to later.
In some instances, using natural keys to express relationships between tables can be problematic. For instance, imagine you had a PERSON table whose natural key was {LAST_NAME, FIRST_NAME, SSN}. What happens if you have some other table GRANT_PROPOSAL in which you need to store a reference to a Proposer, Reviewer, Approver, and Authorizer. You now need 12 columns to express this information. You also need to come up with a naming convention of some kind to identify which columns belong to which kind of individual. But what if your PERSON table required 6, or 8, or 24 columns to for a natural key? This rapidly becomes unmanageable. Surrogate keys resolve such problems by divorcing the semantics (meaning) of a key from its use as an identifier.
Let's also take a look at the example you described in your question.
Should the 2-character abbreviation of a state be used as the primary key of that table.
On the surface, it looks like the abbreviation field meets the requirements of a good primary key. It's relatively short, it is easy to propagate as a foreign key, it looks stable. Unfortunately, you don't control the set of abbreviations ... the postal service does. And here's an interesting fact: in 1973 the USPS changed the abbreviation of Nebraska from NB to NE to minimize confusion with New Brunswick, Canada. The moral of the story is that natural keys are often outside of the control of the database ... and they can change over time. Even when you think they cannot. This problem is even more pronounced for more complicated data like people, or products, etc. As businesses evolve, the definitions for what makes such entities unique can change. And this can create significant problems for data modelers and application developers.
Earlier I mentioned that primary keys can support key polymorphism. What does that mean? Well, polymorphism is the ability of one type, A, to appear as and be used like another type, B. In databases, this concept refers to the ability to combine keys from different classes of entities into a single table. Let's look at an example. Imagine for a moment that you want have an audit trail in your system that identifies which entities were modified by which user on what date. It would be nice to create a table with the fields: {ENTITY_ID, USER_ID, EDIT_DATE}. Unfortunately, using natural keys, different entities have different keys. So now we need to create a separate linking table for each kind of entity ... and build our application in a manner where it understand the different kinds of entities and how their keys are shaped.
Don't get me wrong. I'm not advocating that surrogate keys should ALWAYS be used. In the real world never, ever, and always are a dangerous position to adopt. One of the biggest drawbacks of surrogate keys is that they can result in tables that have foreign keys consisting of lots of "meaningless" numbers. This can make it cumbersome to interpret the meaning of a record since you have to join or lookup records from other tables to get a complete picture. It also can make a distributed database deployment more complicated, as assigning unique incrementing numbers across servers isn't always possible (although most modern database like Oracle and SQLServer mitigate this via sequence replication).
No.
In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.
In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.
A primary key must be:
unique (100% guaranteed! Not just "almost" unique)
NON NULL
A primary key should be:
stable if ever possible (not change - or at least not too frequently)
State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.
So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.
I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.
First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,
Then, having any key serves several often mutually inconsistent purposes.
Purpose 1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
Purpose 2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)
Purpose 3. To increase performance of queries that need to rapidly locate a specific record/row in the table.
Purpose 4. (Most Important from data consistency perspective!) To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)
Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying Purpose 4.
But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.
The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...
The only time you can get away without both is for a completely stand alone table that participates in no relationships with other tables and has an obvious and reliable natural key.
In general, a numeric primary key will perform better than a string. You can additionaly create unique keys to prevent duplicates from creeping in. That way you get the assurance of no duplicates, but you also get the performance of numbers (vs. strings in your scenario).
In all likelyhood, the major databases have some performance optimizations for integer-based primary keys that are not present for string-based primary keys. But, that is only a reasonable guess.
Yes, in my opinion every table needs an auto incrementing integer key because it makes both JOINs and (especially) front-end programming much, much, much easier. Others feel differently, but this is over 20 years of experience speaking.
The single exception is small "code" or "lookup" tables in which I'm willing to substitute a short (4 or 5 character) TEXT code value. I do this because the I often use a lot of these in my databases and it allows me to present a meaningful display to the user without having to look up the description in the lookup table or JOIN it into a result set. Your example of a States table would fit in this category.
No, absolutely not.
Having a primary key which can't change is a good idea (UPDATE is legal for primary key columns, but in general potentially confusing and can create problems for child rows). But if your application has some other candidate which is more suitable than an auto-incrementing value, then you should probably use that instead.
Performance-wise, in general fewer columns are better, and particularly fewer indexes. If you have another column which has a unique index on it AND can never be changed by any business process, then it may be a suitable primary key.
Speaking from a MySQL (Innodb) perspective, it's also a good idea to use a "real" column as a primary key rather than an "artificial" one, as InnoDB always clusters the primary key and includes it in secondary indexes (that is how it finds the rows in them). This gives it potential to do useful optimisation with a primary key which it can't with any other unique index. MSSQL users often choose to cluster the primary key, but it can also cluster a different unique index.
EDIT:
But if it's a small database and you don't really care about performance or size too much, adding an unnecessary auto-increment column isn't that bad.
A non auto-incrementing value (e.g. UUID, or some other string generated according to your own algorithm) may be useful for distributed, sharded, or diverse systems where maintaining a consistent auto-incrementing ID is difficult (or impossible - think of a distributed system which continues to insert rows on both sides of a network partition).
I think there are two things that may explain the reason why auto-incrementing keys are sometimes used:
Space consideration; ok your state name doesn't amount to much, but the space it takes may add up. If you really want to store the state with its name as a primary key, then go ahead, but it will take more place. That may not be a problem in certain cases, and it sounds like a problem of olden days, but the habit is perhaps ingrained. And we programmers and DBA do love habits :D
Defensive consideration: i recently had the following problem; we have users in the database where the email is the key to all identification. Why not make the email the promary key? except suddenly border cases creep in where one guy must be there twice to have two different adresses, and nobody talked about it in the specs so the adress is not normalized, and there's this situation where two different emails must point to the same person and... After a while, you stop pulling your hairs out and add the damn integer id column
I'm not saying it's a bad habit, nor a good one; i'm sure good systems can be designed around reasonable primary keys, but these two points lead me to believe fear and habit are two among the culprits
It's a key component of relational databases. Having an integer relate to a state instead of having the whole state name saves a bunch of space in your database! Imagine you have a million records referencing your state table. Do you want to use 4 bytes for a number on each of those records or do you want to use a whole crapload of bytes for each state name?
Here are some practical considerations.
Most modern ORMs (rails, django, hibernate, etc.) work best when there is a single integer column as the primary key.
Additionally, having a standard naming convention (e.g. id as primary key and table_name_id for foreign keys) makes identifying keys easier.

Why do I read so many negative opinions on using composite keys?

I was working on an Access database which loved auto-numbered identifiers. Every table used them except one, which used a key made up of the first name, last name and birthdate of a person. Anyways, people started running into a lot of problems with duplicates, as tables representing relationships could hold the same relationship twice or more. I decided to get around this by implementing composite keys for the relationship tables and I haven't had a problem with duplicates since.
So I was wondering what's the deal with the bad rep of composite keys in the Access world? I guess it's slightly more difficult to write a query, but at least you don't have to put in place tons of checks every time data is entered or even edited in the front end. Are they incredibly super inefficient or something?
A composite key works fine for a single table, but when you start to create relations between tables it can get a bit much.
Consider two tables Person and Event, and a many-to-many relations between them called Appointment.
If you have a composite key in the Person table made up of the first name, last name and birth date, and a compossite key in the Event table made up of place and name, you will get five fields in the Appointment table to identify the relation.
A condition to bind the relation will be quite long:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.FirstName = Appointment.PersonFirstName and
Person.LastName = Appointment.PersonLastName and
Person.BirthDate = Appointment.PersonBirthDate and
Event.Place = Appointment.EventPlace and
Event.Name = Appointment.EventName`.
If you on the other hand have auto-numbered keys for the Person and Event tables, you only need two fields in the Appointment table to identify the relation, and the condition is a lot smaller:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.Id = Appointment.PersonId and Event.Id = Appointment.EventId
If you only use pure self-written SQL to access your data, they are OK.
However, some ORMs, adapters etc. require having a single PK field to identify a record.
Also note that a composite primary key is almost invariably a natural key (there is hardly a point in creating a surrogate composite key, you can as well use a single-field one).
The most common usage of a composite primary key is a many-to-many link table.
When using the natural keys, you should ensure they are inherently unique and immutable, that is an entity is always identified by the same value of the key, once been reflected by the model, and only one entity can be identified by any value.
This it not so in your case.
First, a person can change their name and even the birthdate
Second, I can easily imagine two John Smiths born at the same day.
The former means that if a person changes their name, you will have to update it in each and every table that refers to persons; the latter means that the second John Smith will not be able to make it into your database.
For the case like yours, I would really consider adding a surrogate identifier to your model.
Unfortunately one reason for those negative opinions is probably ignorance. Too many people don't understand the concept of Candidate Keys properly. There are people who seem to think that every table needs only one key, that one key is sufficient for data integrity and that choosing that one key is all that matters.
I have often speculated that it would be a good thing to deprecate and phase out the use of the term "primary key" altogether. Doing that would focus database designers minds on the real issue: that a table should have as many keys as are necessary to ensure the correctness of the data and that some of those keys will probably be composite. Abolishing the primary key concept would do away with all those fatuous debates about what the primary key ought to be or not be.
If your RDBMS supports them and if you use them correctly (and consistently), unique keys on the composite PK should be sufficient to avoid duplicates. In SQL Server at least, you can also create FKs against a unique key instead of the PK, which can be useful.
The advantage of a single "id" column (or surrogate key) is that it can improve performance by making for a narrower key. Since this key may be carried to indexes on that table (as a pointer back to the physical row from the index row) and other tables as a FK column that can decrease space and improve performance. A lot of it depends on the specific architecture of your RDBMS though. I'm not familiar enough with Access to comment on that unfortunately.
As Quassnoi points out, some ORMs (and other third party applications, ETL solutions, etc.) don't have the capability to handle composite keys. Other than some ORMs though, most recent third party apps worth anything will support composite keys though. ORMs have been a little slower in adopting that in general though.
My personal preference for composite keys is that although a unique index can solve the problem of duplicates, I've yet to see a development shop that actually fully used them. Most developers get lazy about it. They throw on an auto-incrementing ID and move on. Then, six months down the road they pay me a lot of money to fix their duplicate data issues.
Another issue, is that auto-incrementing IDs aren't generally portable. Sure, you can move them around between systems, but since they have no actual basis in the real world it's impossible to determine one given everything else about an entity. This becomes a big deal in ETL.
PKs are a pretty important thing in the data modeling world and they generally deserve more thought then, "add an auto-incrementing ID" if you want your data to be consistent and clean.
Surrogate keys are also useful, but I prefer to use them when I have a known performance issue that I'm trying to deal with. Otherwise it's the classic problem of wasting time trying to solve a problem that you might not even have.
One last note... on cross-reference tables (or joining tables as some call them) it's a little silly (in my opinion) to add a surrogate key unless required by an ORM.
Composite Keys are not just composite primary keys, but composite foreign keys as well. What do I mean by that? I mean that each table that refers back to the original table needs a column for each column in the composite key.
Here's a simple example, using a generic student/class arrangement.
Person
FirstName
LastName
Address
Class
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
StudentClass - a many to many join table
StudentFirstName
StudentLastName
StudentAddress
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
You just went from having a 2-column many-to-many table using surrogate keys to having an 8-column many-to-many table using composite keys, because they have 3 and 5 column foreign keys. You can't really get rid of any of these fields, because then the records wouldn't be unique, since both students and instructors can have duplicate names. Heck, if you have two people from the same address with the same name, you're still in serious trouble.
Most of the answers given here don't seem to me to be given by people who work with Access on a regular basis, so I'll chime in from that perspective (though I'll be repeating what some of the others have said, just with some Access-specific comments).
I use surrogate a key only when there is no single-column candidate key. This means I have tables with surrogate PKs and with single-column natural PKs, but no composite keys (except in joins, where they are the composite of two FKs, surrogate or natural doesn't matter).
Jet/ACE clusters on the PK, and only on the PK. This has potential drawbacks and potential benefits (if you consider a random Autonumber as PK, for instance).
In my experience, the non-Null requirement for a composite PK makes most natural keys impossible without using potentially problematic default values. It likewise wrecks your unique index in Jet/ACE, so in an Access app (before 2010), you end up enforcing uniqueness in your application. Starting with A2010, table-level data macros (which work like triggers) can conceivably be used to move that logic into the database engine.
Composite keys can help you avoid joins, because they repeat data that with surrogate keys you'd have to get from the source table via a join. While joins can be expensive, it's mostly outer joins that are a performance drain, and it's only with non-required FKs that you'd get the full benefit of avoiding outer joins. But that much data repetition has always bothered me a lot, since it seems to go against everything we've ever been taught about normalization!
As I mentioned above, the only composite keys in my apps are in N:N join tables. I would never add a surrogate key to a join table except in the relatively rare case in which the join table is itself a parent to a related tables (e.g., Person/Company N:N record might have related JobTitles, i.e., multiple jobs within the same company). Rather than store the composite key in the child table, you'd store the surrogate key. I'd likely not make the surrogate key the PK, though -- I'd keep the composite PK on the pair of FK values. I would just add an Autonumber with a unique index for joining to the child table(s).
I'll add more as I think of it.
It complicates queries and maintenance. If you are really interested in this subject I'd recommend looking over the number of posts that already cover this. This will give you better info than any one response here.
https://stackoverflow.com/search?q=composite+primary+key
In the first place composite keys are bad for performance in joins. Further they are much worse for updating records as you have to update all the child records as well. Finally very few composite keys are actually really good keys. To be a good key it should be unique and not be subject to change. The example you gave as a composite key you used fails both tests. It is not unique (there are people with the same name born on the same day) and names change frequently causing much unnecessary updating of all the child tables.
As far as table with autogenrated keys casuing duplicates, that is mostly due to several factors:
the rest of the data in the table
can't be identified in any way as
unique
a design failure of forgetting to
create a unique index on the possible
composite key
Poor design of the user interface
which doesn't attempt to find
matching records or which allows data
entry when a pull down might be more
appropriate.
None of those are the fault of the surrogate key, they just indicate incompetent developers.
I think some coders see the complexity but want to avoid it, and most coders don't even think to look for the complexity at all.
Let's consider a common example of a table that had more than one candidate key: a Payroll table with columns employee_number, salary_amount, start_date and end_date.
The four candidate keys are as follows:
UNIQUE (employee_number, start_date); -- simple constraint
UNIQUE (employee_number, end_date); -- simple constraint
UNIQUE (employee_number, start_date, end_date); -- simple constraint
CHECK (
NOT EXISTS (
SELECT Calendar.day_date
FROM Calendar, Payroll AS P1
WHERE P1.start_date <= Calendar.day_date
AND Calendar.day_date < P1.end_date
GROUP
BY P1.employee_number, Calendar.day_date
)
); -- sequenced key i.e. no over-lapping periods for the same employee
Only one of those keys are required to be enforced i.e. the sequenced key. However, most coders wouldn't think to add such a key, let alone know how to code it in the first place. In fact, I would wager that most Access coders would add an incrementing autonumber column to the table, make the autonumber column the PRIMARY KEY, fail to add constraints for any of the candidate keys and will have convinced themselves that their table has a key!

What should I consider when selecting a data type for my primary key?

When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.

What are the down sides of using a composite/compound primary key?

What are the down sides of using a composite/compound primary key?
Could cause more problems for normalisation (2NF, "Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF")
More unnecessary data duplication. If your composite key consists of 3 columns, you will need to create the same 3 columns in every table, where it is used as a foreign key.
Generally avoidable with the help of surrogate keys (read about their advantages and disadvantages)
I can imagine a good scenario for composite key -- in a table representing a N:N relation, like Students - Classes, and the key in the intermediate table will be (StudentID, ClassID). But if you need to store more information about each pair (like a history of all marks of a student in a class) then you'll probably introduce a surrogate key.
There's nothing wrong with having a compound key per se, but a primary key should ideally be as small as possible (in terms of number of bytes required). If the primary key is long then this will cause non-clustered indexes to be bloated.
Bear in mind that the order of the columns in the primary key is important. The first column should be as selective as possible i.e. as 'unique' as possible. Searches on the first column will be able to seek, but searches just on the second column will have to scan, unless there is also a non-clustered index on the second column.
I think this is a specialisation of the synthetic key debate (whether to use meaningful keys or an arbitrary synthetic primary key). I come down almost completely on the synthetic key side of this debate for a number of reasons. These are a few of the more pertinent ones:
You have to keep dependent child
tables on the end of a foriegn key
up to date. If you change the the
value of one of the primary key
fields (which can happen - see
below) you have to somehow change
all of the dependent tables where
their PK value includes these
fields. This is a bit tricky
because changing key values will
invalidate FK relationships with
child tables so you may (depending
on the constraint validation options
available on your platform) have to
resort to tricks like copying the
record to a new one and deleting the
old records.
On a deep schema the keys can get
quite wide - I've seen 8 columns
once.
Changes in primary key values can be
troublesome to identify in ETL
processes loading off the system.
The example I once had occasion to
see was an MIS application
extracting from an insurance
underwriting system. On some
occasions a policy entry would be
re-used by the customer, changing
the policy identifier. This was a
part of the primary key of the
table. When this happens the
warehouse load is not aware of what
the old value was so it cannot match
the new data to it. The developer
had to go searching through audit
logs to identify the changed value.
Most of the issues with non-synthetic primary keys revolve around issues when PK values of records change. The most useful applications of non-synthetic values are where a database schema is intended to be used, such as an M.I.S. application where report writers are using the tables directly. In this case short values with fixed domains such as currency codes or dates might reasonably be placed directly on the table for convenience.
I would recommend a generated primary key in those cases with a unique not null constraint on the natural composite key.
If you use the natural key as primary then you will most likely have to reference both values in foreign key references to make sure you are identifying the correct record.
Take the example of a table with two candidate keys: one simple (single-column) and one compound (multi-column). Your question in that context seems to be, "What disadvantage may I suffer if I choose to promote one key to be 'primary' and I choose the compound key?"
First, consider whether you actually need to promote a key at all: "the very existence of the PRIMARY KEY in SQL seems to be an historical accident of some kind. According to author Chris Date the earliest incarnations of SQL didn't have any key constraints and PRIMARY KEY was only later addded to the SQL standards. The designers of the standard obviously took the term from E.F.Codd who invented it, even though Codd's original notion had been abandoned by that time! (Codd originally proposed that foreign keys must only reference one key - the primary key - but that idea was forgotten and ignored because it was widely recognised as a pointless limitation)." [source: David Portas' Blog: Down with Primary Keys?
Second, what criteria would you apply to choose which key in a table should be 'primary'?
In SQL, the choice of key PRIMARY KEY is arbitrary and product specific. In ACE/Jet (a.k.a. MS Access) the two main and often competing factors is whether you want to use PRIMARY KEY to favour clustering on disk or whether you want the columns comprising the key to appears as bold in the 'Relationships' picture in the MS Access user interface; I'm in the minority by thinking that index strategy trumps pretty picture :) In SQL Server, you can specify the clustered index independently of the PRIMARY KEY and there seems to be no product-specific advantage afforded. The only remaining advantage seems to be the fact you can omit the columns of the PRIMARY KEY when creating a foreign key in SQL DDL, being a SQL-92 Standard behaviour and anyhow doesn't seem such a big deal to me (perhaps another one of the things they added to the Standard because it was a feature already widespread in SQL products?) So, it's not a case of looking for drawbacks, rather, you should be looking to see what advantage, if any, your SQL product gives the PRIMARY KEY. Put another way, the only drawback to choosing the wrong key is that you may be missing out on a given advantage.
Third, are you rather alluding to using an artificial/synthetic/surrogate key to implement in your physical model a candidate key from your logical model because you are concerned there will be performance penalties if you use the natural key in foreign keys and table joins? That's an entirely different question and largely depends on your 'religious' stance on the issue of natural keys in SQL.
Need more specificity.
Taken too far, it can overcomplicate Inserts (Every key MUST exist) and documentation and your joined reads could be suspect if incomplete.
Sometimes it can indicate a flawed data model (is a composite key REALLY what's described by the data?)
I don't believe there is a performance cost...it just can go really wrong really easily.
when you se it on a diagram are less readable
when you use it on a query join are less
readable
when you use it on a foregein key
you have to add a check constraint
about all the attribute have to be
null or not null (if only one is
null the key is not checked)
usualy need more storage when use it
as foreign key
some tool doesn't manage composite
key
The main downside of using a compound primary key, is that you will confuse the hell out of typical ORM code generators.