This feels like a very basic question, but I really don't see the obvious answer at the moment.
I have a simple table that maps object ids between two namespaces:
|---------------------|------------------|
| id_in_ns1 | id_in_ns2 |
|---------------------|------------------|
| 1 | 5 |
|---------------------|------------------|
| 2 | 17 |
|---------------------|------------------|
| 3 | NULL |
|---------------------|------------------|
| NULL | 1 |
|---------------------|------------------|
The mapping is basically 1:1, but as you can see, some objects from namespace 1 do not exist in namespace 2, and vice versa, so that there are NULL values in the table.
So, what would be the primary key of this table? As a PK cannot be NULL, I can neither use (id_in_ns1) nor (id_in_ns2) nor the composite.
The only idea I have is to replace NULL by a definite value, say -1, and to use (id_in_ns1, id_in_ns2)as PK. However, this feels not only hackish but also "unnormal" because the non-NULL (or non--1)) value alone is already sufficient to uniquely identify an object.
Only add entries that have a valid id on both sides. This will effectively get rid of all NULL values, allowing you to specify a proper composite key on (id_in_ns1, id_in_ns2).
Ultimately, those are the values that allow you to identify a single row and you will not lose relevant information - a SELECT id_in_ns2 FROM mapping_table WHERE id_in_ns1 = x will return NULL either way, whether there is a (x, NULL) row or not.
If you insist on keeping those NULLs you could add another column with an artificial (auto incrementing) primary key, but that feels as hacky as using -1.
Use a synthetic primary key and use unique constraints for the rest:
create table mapping (
mappingId int auto_increment primary key, -- or whatever for your database
id_in_ns1 int references ns1(id),
id_in_ns2 int references ns2(id),
unique (id_in_ns1),
unique (id_in_ns2)
);
Just one caveat: some databases only allow one NULL value for UNIQUE constraints. You might need to use a filtered unique index instead (or some other mechanism) for this construct.
Take three layers of information:
Layer 1: Information
This layer contains data with UNIQUE natural indexes and a surrogate key that is easily transferrable.
Table Surnames:
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Surname |
+-----------------------------+--------------+
| 1 | Smith |
| 2 | Edwards |
| 3 | Brown |
+-----------------------------+--------------+
Table FirstNames
+-----------------------------+--------------+
| ID (Auto Increment, PK) | FirstName |
+-----------------------------+--------------+
| 1 | John |
| 2 | Bob |
| 3 | Mary |
| 4 | Kate |
+-----------------------------+--------------+
Natural Keys
Alternatively, the two tables above can be without ID and utilize Surname and FirstName as Natural Primary Keys, as explained by Mike Sherrill. In this instance, assume the layer below references varchar rather than int.
Layer 2: People
In this layer a composite index is used. This value can be UNIQUE or PRIMARY, depending on whether a surrogate key is utilized as the Primary Key.
+-----------------+--------------+
| FirstName | LastName |
+-----------------+--------------+
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 3 | 1 |
| 4 | 2 |
| ... | ... |
+-----------------+--------------+
Layer 3: Parents
In this layer, relationships between people are explored through a ParentsOf table.
ParentsOf
+-----------------+-----------------+
| Person | PersonParent |
+-----------------+-----------------+
OR
+-----------------+-----------------+-----------------+-----------------+
| PersonFirstName | PersonSurname | ParentFirstName | ParentSurname |
+-----------------+-----------------+-----------------+-----------------+
The Question
Assuming that referential integrity is VERY important to me at its very core, and I will have FOREIGN KEYS on these indexes so that I keep the database responsible for monitoring its own integrity on this front, and that, if I were to use an ORM, it would be one like Doctrine which has native support for Compound Primary Keys...
Please help me to understand:
The list of trade-offs that take place with utilizing surrogate keys vs. natural keys on the 1st Layer.
The list of trade-offs that take place with utilizing compound keys vs. surrogate keys on the 2nd Layer which can be transferred over to the 3rd Layer.
I am not interested in hearing which is better, because I understand that there are significant disagreements among professionals on this topic and it would be sparking a religious war. Instead, I am asking, very simply and as objectively as is humanly possible, what trade-offs will you be taking by passing surrogate keys to each Layer vs maintaining Primary keys (natural/composite, or surrogate/composite). Anyone will be able to find someone saying NEVER or ALWAYS use surrogate keys on SO and other websites. Instead, a reasoned analyses of trade-offs is what I will most appreciate in your answers.
EDIT: It has been pointed out that a surname example is a poor example for a use of 6NF. For the sake of keeping the question intact, I am going to leave it be. If you are having trouble imagining the use case for this, a better one might be a list of "Grocery Items". AKA:
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Grocery |
+-----------------------------+--------------+
| 1 | Sponges |
| 2 | Tomato Soup |
| 3 | Ice Cream |
| 4 | Lemons |
| 5 | Strawberries |
| 6 | Whipped Cream|
+-----------------------------+--------------+
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Brand |
+-----------------------------+--------------+
| 1 | Bright |
| 2 | Ben & Jerry's|
| 3 | Store Brand |
| 4 | Campbell's |
| 5 | Cool Whip |
+-----------------------------+--------------+
Natural Composite Key Example:
+-----------------------------+--------------+
| Grocery | Brand |
+-----------------------------+--------------+
| Sponges | Bright |
| Ice Cream | Ben & Jerry's|
| Ice Cream | Store Brand |
| Tomato Soup | Campbell's |
| Tomato Soup | Store Brand |
| Lemons | Store Brand |
| Whipped Cream | Cool Whip |
+-----------------------------+--------------+
Recommended Pairings
+-----------------+-----------------+-----------------+-----------------+
| Grocery1 | Brand1 | Grocery2 | Brand2 |
+-----------------+-----------------+-----------------+-----------------+
To reiterate, this is also just an example. This is not how I would recommend proceeding, but it should help to illustrate my question.
There ARE shortfalls to this method. I'll reiterate that this question was to request walking through the benefits and drawbacks of each method below, not to highlight one as better than another. I believe most people were able to look past the questionable nature of this specific example to answer the core question. This edit is for those that cannot.
There are some very good answers below and if you are curious about which direction to go, please read them.
END EDIT
Thank you!
Here's some trade-offs:
Single Surrogate (artificially created):
All child tables foreign keys only need a single column to reference the primary key.
Very easy to update the natural keys in table without needing to update every child table with foreign keys
Smaller primary/foreign key indexes (ie. not a wide) This can make the database run faster, for example when a record is deleted in a parent table, the child tables need to be searched to make sure this will not create orphans. Narrow indexes are faster to scan (just sightly).
you will have more indexes because you most likely will also want to index whatever natural keys exists in the data.
Natural composite keyed tables:
fewer indexes in the database
less columns in the database
easier/faster to insert a ton of records as you will not need to grab the sequence generator
updating one of the keys in the compound requires that every child table also be updated.
Then there is another category: artificial composite primary keys
I've only found one instance where this makes sense. When you need to tag every record in every table for row level security.
For example, suppose you had an database which stored data for 50,000 clients and each client was not supposed to see other client's data--very common in web application development.
If each record was tagged with a client_id field, you are creating a row level security environment. Most databases have the tools to enforce row level security when setup correctly.
First thing to do is setup primary and foreign keys. Normally a table with have an id field as the primary key. By adding client_id the key is now composite key. And it is necessary to carry client_id to all child table.
The composite key is based on 2 surrogate keys and is a bulletproof way to ensure data integrity among clients and within the database a whole.
After this you would create views (or if using Oracle EE setup Virtual Private Database) and other various structures to allow the database to enforce row level security (which is a topic all it own).
Granted that this data structure is no longer normalized to the nth degree. The client_id field in each pk/fk denormalizes an otherwise normal model. The benefit of the model is the ease of enforcing row level security at the database level (which is what databases should do). Every select, insert, update, delete is restricted to whatever client_id your session is currently set. The database has session awareness.
Summary
Surrogate keys are always the safe bet. They require a little more work to setup and require more storage.
The biggest benefit in my opinion is:
Being able to update the PK in one table and all other child tables are instantaneously changed without ever being touched.
When data gets messed up--and it will at some point due to a programming mistake, surrogate keys make the clean up much much easier and in some cases only possible to do because there are surrogate keys.
Query performance is improved as the db is able to search attributes to locate the s.key and then join all child table by a single numeric key.
Natural Keys especially composite NKeys make writing code a pain. When you need to join 4 tables the "where clause" will be much longer (and easier to mess up) than when single SKeys were used.
Surrogate keys are the "safe" route. Natural keys are beneficial in a few places, I'd say around 1% of the tables in a db.
First of all, your second layer can be expressed at least four different ways, and they're all relevant to your question. Below I'm using pseudo-SQL, mainly with PostgreSQL syntax. Certain kinds of queries will require recursion and more than one additional index regardless of the structure, so I won't say any more about that. Using a dbms that supports clustered indexes can affect some decisions here, but don't assume that six joins on clustered indexes will be faster than simply reading values from a single, covering index; test, test, test.
Second, there really aren't many tradeoffs at the first layer. Foreign keys can reference a column declared not null unique in exactly the same way they can reference a column declared primary key. The surrogate key increases the width of the table by 4 bytes; that's trivial for most, but not all, database applications.
Third, correct foreign keys and unique constraints will maintain referential integrity in all four of these designs. (But see below, "About Cascades".)
A. Foreign keys to surrogate keys
create table people (
FirstName integer not null
references FirstNames (ID),
LastName integer not null
references Surnames (ID),
primary key (FirstName, LastName)
);
B. Foreign keys to natural keys
create table people (
FirstName varchar(n) not null
references FirstNames (FirstName),
LastName varchar(n) not null
references Surnames (Surname),
primary key (FirstName, Surname)
);
C. Foreign keys to surrogate keys, additional surrogate key
create table people (
ID serial primary key,
FirstName integer not null
references FirstNames (ID),
LastName integer not null
references Surnames (ID),
unique (FirstName, LastName)
);
D. Foreign keys to natural keys, additional surrogate key
create table people (
ID serial primary key,
FirstName varchar(n) not null
references FirstNames (FirstName),
LastName varchar(n) not null
references Surnames (Surname),
unique (FirstName, Surname)
);
Now let's look at the ParentsOf table.
A. Foreign keys to surrogate keys in A, above
create table ParentsOf (
PersonFirstName integer not null,
PersonSurname integer not null,
foreign key (PersonFirstName, PersonSurname)
references people (FirstName, LastName),
ParentFirstName integer not null,
ParentSurname integer not null,
foreign key (ParentFirstName, ParentSurname)
references people (FirstName, LastName),
primary key (PersonFirstName, PersonSurname, ParentFirstName, ParentSurname)
);
To retrieve the names for a given row, you'll need four joins. You can join directly to the "FirstNames" and "Surnames" tables; you don't need to join through the "People" table to get the names.
B. Foreign keys to natural keys in B, above
create table ParentsOf (
PersonFirstName varchar(n) not null,
PersonSurname varchar(n) not null,
foreign key (PersonFirstName, PersonSurname)
references people (FirstName, LastName),
ParentFirstName varchar(n) not null,
ParentSurname varchar(n) not null,
foreign key (ParentFirstName, ParentSurname)
references people (FirstName, LastName),
primary key (PersonFirstName, PersonSurname, ParentFirstName, ParentSurname)
);
This design needs zero joins to retrieve the names for a given row. Many SQL platforms won't need to read the table at all, because they can get all the data from the index on the primary key.
C. Foreign keys to surrogate keys, additional surrogate key in C, above
create table ParentsOf (
Person integer not null
references People (ID),
PersonParent integer not null
references People (ID),
primary key (Person, PersonParent)
);
To retrieve names, you must join through the "people" table. You'll need a total of six joins.
D. Foreign keys to natural keys, additional surrogate key in D, above
This design has the same structure as in C immediately above. Because the "people" table in D, farther above, has natural keys referencing the tables "FirstNames" and "Surnames", you'll only need two joins to the table "people" to get the names.
About ORMs
ORMs don't build SQL the way a SQL developer writes SQL. If a SQL developer writes a SELECT statement that needs six joins to get the names, an ORM is liable to execute seven simpler queries to get the same data. This might be a problem; it might not.
About Cascades
Surrogate ID numbers make every foreign key reference an implicit, undeclared "ON UPDATE CASCADE". For example, if you run this update statement against your table of surnames . . .
update surnames
set surname = 'Smythe'
where surname = 'Smith';
then all the Smiths will become Smythes. The only way to prevent that is to revoke update permissions on "surnames". Implicit, undeclared "ON UPDATE CASCADE" is not always a Good Thing. Revoking permissions solely to prevent unwanted implicit "cascades" is not always a Good Thing.
Using natural keys can enable simpler, faster queries since one needn't join all the way up the foreign key chain to find the "natural" value e.g. for display on-screen.
I will avoid a pure academic discussion here and look at a few pragmatic considerations given a modern database design typically needs to consider scalability, mobility (disconnected operation), and conflict resolution where choice of key can have a large impact.
Things that can influence your choice are:
how to deal with distinct records which may have the same natural keys. Eg identical firstname and surname.
how does a web or mobile client persist a complex model graph if server assigned surrogate keys are used (requires some kind of mapping layer). Alternative is to avoid the mapping problem and use client assigned v4 UUIDs.
following on from the above, how do you deal with conflict resolution in temporarily disconnected environments such as mobile apps or where clients can peer/share with one another without having to first sync with a server . Object identity is an important concept to support and solve these problems.
scalability through sharding your database can be easy or difficult based on choice of key. Auto incrementing surrogate keys are hard to shard and require choosing a fixed number of shards a-priori so keys don't clash, whilst v4 UUID based surrogate keys are easy and can be client assigned. Composite and natural keys are hard because the key whilst relatively stable may still change and this requires the ability to migrate records from one shard to another.
how do your clients manage object identity? Often user interfaces require building a local graph of models for later persistence to a "server in the cloud". During this time before persistence those objects need identity and after persistence there needs to be an agreement or mapping between the server object identity and client object identity.
Do you force everything above the database (including the application server) to deal with an identity mapping problem or build it into the database key design and also help solve scalabilty/sharding for the db whilst your at it?
My advice is look at the characteristics of the system as a whole and look beyond the theoretical db design to what will work well for the non trivial full stack that sits above the database. The choice you make for key design can make or break usability of the system and help or harm the development complexity, thus increasing or decreasing your resulting time to market and overall tradeoff for quality and reliability.
I once saw this list of criteria for a primary key. I find it a rather good starting point for this kind of discussion
unique
stable (not necessarily immutable)
irreducible
simple
familiar
Sometimes there's a conflict between two or more criteria and we have to compromise between them. Unfortunate, many people never even reflect on how to design the key, they go with some kind of auto generated key, may it be an identity column, a guid or whatever.
One drawback with surrogate keys is that it becomes more difficult to enforce rules declarative (most DBMS don't support sub queries in check constraints). I'm thinking rules like:
CHECK ( jobtitle <> 'BOSS' OR salary > 100 )
However, I find the biggest problem with surrogate keys to be that you can get away with really weird constructions, and not even notice.
A frequent use case you can find in databases is versioning an history
Exemple with an user table:
ID Name Value DeletedFlag
1 Smith 78 0
2 Martin 98 0
3 John 78 1
4 Edouard 54 0
5 John 64 0
John has filled an info, then decided to delete it and fill a new one.
If you do not use a unique pk, you wont be able to manage this situation.
It makes it extremely easy in development and production to mark some data as deleted, and unmark them to do some tests or data corrections, instead of backing up or restoring or getting a lot of confusion.
It also faster to rebuild indexes on integers and it takes less disk space.
If our composite key consists of columns STUDENT and COURSE, the
database will ensure that we never enter duplicate values.
Example.
With composite key, this is not allowed, database will prevent it.
STUDENT COURSE
1 CS101
1 CS101
But if we choose a surrogate key as the key, we will need to find another way to prevent such duplications.
Thinking about
which combinations of fields are possible keys can help you discover and understand the problem better.
I think you have misunderstood something fundamental as regards data:
1) You are taking a single identifier (person name - assuming that does uniquely identify a person), splitting it into subatomic parts then, because of 6NF, putting them into separate relation variables. Often such a split is made for practical reasons and first name/last name is a common example; the decision is usually made on grounds of complexity, frequency, etc of splitting when compared to those of putting the attribute back together again. Here the split is not practical.
2) 6NF is always achievable but not always desirable. In this case, it makes it harder to define a constraint that would be able to verify the parts as being valid in combination (imagine you had split a date by time granules day, month and year and stored each part in separate relvars!).
3) For person identifiers, a compound of first and last names is rarely adequate. Identifiers are usually chosen based on the level of trust required. An employer checks references, qualifications, etc then issues a payroll reference. A police offer might require sight of your driver licence by the roadside but fingerprints will be taken if you are convicted of a crime. A DBMS cannot verify a person and therefore an auto increment integer is rarely adequate either.
Example:
survivor table
owner_id unique_id isDead
|__________ |__________|__________
| foreign key | |
| 1234 ------------> | 1234 | true <--- this row is referenced
|__________ |__________|__________
| |
| 1234 | false <--- this row should be referenced
|__________|__________
Currently the code is like that:
CONSTRAINT `key_2` FOREIGN KEY (`owner_id`) REFERENCES `survivor` (`unique_id`)
I would like to have it function like that:
CONSTRAINT `key_2` FOREIGN KEY (`owner_id`) REFERENCES `survivor` (`unique_id`) WHERE `isDead` = false
but thats obviously not possible like that.
Does anyone have an idea how I could achieve anything like this?
Your unique_id is not actually unique. You should create a key by two columns in your second table (I mean unique_id and isDead). - then it will be unique combination. But you still won't be able to refer that by foreign key properly.
If you need to refer that via foreign key, you should store only valid owners in your second table: for example, delete those, who are not valid. If you want to keep them, you have to update their unique_id according to some algorithm, that preserves unique values and in the same time allows to separate valid values from invalid. Besides this, you'll need to store old values of unique_id - for example, create a new column named old_id and that will be:
+-----------+--------+--------+
| unique_id | isDead | old_id |
+-----------+--------+--------+
| 1234 | false | 1234 |
+-----------+--------+--------+
| -1 | true | 1234 |
+-----------+--------+--------+
as you see, one of possible (but not best) solutions is to set unique_id to negative value. In sample above you will refer to correct row via foreign key and still able to restore connection of owner since old_id is stored. I'm not sure what you're trying to achieve, so in general that could be not an optimal way.
Suppose I have a table, Document, that looks something like this:
_______________________
| Document |
|-----------------------|
| DocumentId int PK |
| (add't fields) |
|_______________________|
Now suppose I have a second table:
_______________________
| DocumentVersion |
|-----------------------|
| DocumentId int PK, FK |
| VersionId int PK |
| (add't fields) |
|_______________________|
Finally, suppose I wish to create a third table that references DocumentVersion, perhaps an audit of users that have accessed each version (assume a User table exists):
_______________________
| VersionAccessLog |
|-----------------------|
| DocumentId int PK, FK |
| VersionId int PK, FK |
| UserId int PK, FK |
| AccessTime DateTime PK|
|_______________________|
Maybe not the best example but hopefully enough to illustrate my question.
Focusing on VersionAccessLog, we have:
PK(DocumentId, VersionId, UserId, AccessTime)
FK(DocumentId, VersionId) REFERENCES DocumentVersion
FK(userId) REFERENCES User
Now my question is, should I also in VersionAccessLog create a FK(DocumentId)? At first glance, the key seems superfluous-- referential integrity is enforced by the FK to DocumentVersion. However, from a relational algebra standpoint, should this key exist or not? From a practical standpoint (assume SQL Server 2012 if necessary), are there any performance implications of including or excluding the key?
There are performance impacts to including the additional key, because the database will check the FK on every insert (child) or delete(parent). From a practical perspective, I generally do not include the extra keys because there really isn't any value in the validations and there is the slight performance cost.
I would make sure that if the schema was ever changed and the FK between DocumentVersion and VersionAccessLog was removed, to add the FK between VersionAccessLog and Document. The only other time I could see adding it is if some Schema aware DAO library was used that reverses Has-a relationships from FK's, and it was worthwhile to include that.
I understand the value of primary keys.
I understand the value of indexes.
Should every MySQL table have an auto-incremented primary key (ideally with INT field type)?
Update
#Raj More's answer seems most efficient. The issue when I think about it, however, is how this auto-incremented primary key ID will relate to other tables. For example:
table 1
ID | firstname | lastname | email
----------------------------------------
1 | john | doe | 1#email.com
2 | sarah | stow | 2#email.com
3 | mike | bro | 3#email.com
table 2
ID | memberid | display | address
--------------------------------------------
1 | 1 | funtime zone | 123 street
2 | 3 | silly place llc | 944 villa dr
In the example above, a consumer may come to the site and choose to register for a free product/service. If the consumer chooses, they are able to give additional information (stored in table 2) for additional mailing, etc. The problem I see is in how these tables relate to the 'primary key auto-incremented field'. In table 2, the 'memberid' relates to table 1's ID but this is not 'extremely' clear. Any new information placed into table 2 will increment by 1 whereas not all consumers will choose to participate in the table 2 required data.
I am not a huge fan of surrogate keys. I have yet to see a scenario where I would prefer to use one for every table of a database.
I would say No.
Read up on this answer: surrogate-vs-natural-business-keys
The above may be seen as sarcastic or flaming (despite the surprisingly many upvotes) so it's deleted.
In the general case, there have been many questions and answers on surrogate and natural keys so I felt this question is more like a duplicate. My view is that surrogate keys are fine and very useful, mainly because natural keys can lead to very big primary keys in the low end of a chain of connected tables - and this is not handled well by many RDBMS, clustered indexes get big, etc. But saying that "every MySQL table should have an auto-incremented primary key" is a very absolute statement and I think there are cases when they really offer little or nothing.
Since the OP updated the question, I'll try to comment on that specific topic.
I think this is exactly a case where an autoincrementing primary key is not only useless but adds negative value. Supposing that table1 and table2 are in 1:1 relationship, the memberid can be both the Primary Key and a Foreign Key to table1.
Adding an autoincrementing id column adds one index and if it's a clustered one (like InnoDB PK indexes) increases the size of the memberid index. Even more, if you have such an auto-incrementing id, some JOIN of table2 to other tables will have to be done using this id (the JOINs to tables in 1:n relation to table2) and some using memberid (the JOINs to tables in 1:n relation to table1). If you only have memberid both these types of JOINs can be
done using memberid.
I am a huge fan of surrogate keys. I have yet to see a scenario where I would prefer not use one.
I would say Yes.
Read up on this answer Surrogate vs. natural/business keys
Edit
I will change my answer to include the following:
There are certain scenarios that I now use the actual value as a surrogate key:
DimDate (20151031, 20151101, 20151102....)
DimZipCode (10001, 10002, 10003...)
Everything else gets Surrogate Keys.
Yes, with one exception:
A table which implements a n:m relationship between two other tables, a pure link table, does not need such a field if it has two fields only, referencing the two primary keys of the linked tables. The primary key of the link table then consists of the two fields.
As soon as the link table has extra information, it needs a simple single-field primary key.
Having said that, there may be more exceptions; database design is a very broad field...
EDIT: Added more in the previous sentence.