Do link tables need a meaningless primary key field? - sql

I am working on a couple of link tables and I got to thinking (Danger Will Robinson, Danger) what are the possible structures of a link table and what are their pro's and con's.
I came up with a few possible strictures for the link table:
Traditional 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key
table2fk - foreign key
It's a classic, in most of the books, 'nuff said.
Indexed 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key INDEX ('table1fk')
table2fk - foreign key INDEX ('table2fk')
In my own experience, the fields that you are querying against are not indexed in the traditional model. I have found that indexing the foreign key fields does improve performance as would be expected. Not a major change but a nice optimizing tweak.
Composite key 2 columns ADD PRIMARY KEY ('table1fk' , 'table2fk')
table1fk - foreign key
table2fk - foreign key
With this I use a composite key so that a record from table1 can only be linked to a record on table2 once. Because the key is composite I can add records (1,1), (1,2), (2,2) without any duplication errors.
Any potential problems with the composite key 2 columns option? Is there an indexing issue that this might cause? A performance hit? Anything that would disqualify this as a possible option?

I would use composite key, and no extra meaningless key.
I would not use a ORM system that enforces such rules on my db structure.

For true link tables, they typically do not exist as object entities in my object models. Thus the surrogate key is not ever used. The removable of an item from a collection results in a removal of an item from a link relationship where both foreign keys are known (Person.Siblings.Remove(Sibling) or Person.RemoveSibling(Sibling) which is appropriately translated at the data access layer as usp_Person_RemoveSibling(PersonID, SiblingID)).
As Mike mentioned, if it does become an actual entity in your object model, then it may merit an ID. However, even with addition of temporal factors like effective start and end dates of the relationship and things like that, it's not always clear. For instance, the collection may have an effective date associated at the aggregate level, so the relationship itself may still not become an entity with any exposed properties.
I'd like to add that you might very well need the table indexed both ways on the two foreign key columns.

If this is a true many-to-many join table, then dump unecessary id column (unless your ORM requires one. in that case you've got to decide whether your intellect is going to trump your practicality).
But I find that true join tables are pretty rare. It usually isn't long before I start wanting to put some other data in that table. Because of that I almost always model these join tables as entities from the beginning and stick an id in there.

Having a single column pk can help out alot in disaster recovery situation. So though while correct in theory that you only need the 2 foreign keys. In practice when the shit hits the fan you may want the single column key. I have never been in a situation where i was screwed because I had a single column identifier but I have been in ones where I was screwed because I didn't.

Composite PK and turn off clustering.

I have used composite key to prevent duplicate entry and let the database handle the exception. With a single key, you are rely on the front-end application to check the database for duplicate before adding a new record.

There is something called identifying and non-identifying relationship. With identifying relationships the FK is a part of the PK in the many-to-many table. For example, say we have tables Person, Company and a many-to-many table Employment. In an identifying relationship both fk PersonID and CompanyID are part of the pk, so we can not repeat PersonID, CompanyID combination.
TABLE Employment(PersonID int (PK,FK), CompanyID int (PK,FK))
Now, suppose we want to capture history of employment, so a person can leave a company, work somewhere else and return to the same company later. The relationship is non-identifying here, combination of PersonID, CompanyID can now repeat, so the table would look something like:
TABLE Employment(EmploymentID int (PK), PersonID int (FK), CompanyID int (FK),
FromDate datetime, ToDate datetime)

If you are using an ORM to get to/alter the data, some of them require a single-column primary key (Thank you Tom H for pointing this out) in order to function correctly (I believe Subsonic 2.x was this way, not sure about 3.x).
In my mind, having the primary key doesn't impact performance to any measurable degree, so I usually use it.

If you need to traverse the join table 'in both directions', that is starting with a table1fk or a table2fk key only, you might consider adding a second, reversed, composite index.
ADD KEY ('table2fk', 'table1fk')

The correct answer is:
Primary key is ('table1fk' , 'table2fk')
Another index on ('table2fk' , 'table1fk')
Because:
You don't need an index on table1fk or table2fk alone: the optimiser will use the PK
You'll most likely use the table "both" ways
Adding a surrogate key is only needed because of braindead ORMs

i've used both, the only benefit of using the first model (with uid) is that you can transport the identifier around as a number, whereas in some cases you would have to do some string concatenation with the composite key to transport it around.
i agree that not indexing the foreign keys is a bad idea whichever way you go.

I (almost) always use the additional single-column primary key. This generally makes it easier to build user interfaces, because when a user selects that particular linking entity I can identify with a single integer value rather than having to create and then parse compound identifiers.

Related

SQL many-to-many

I need to build a simple forum (message board) as a school project. But I came across one problem. In the img above there are 2 tables: post and category, which have many-to-many relationship. I made a bridge table, which stores the postKey and categoryKey. Is it a bad practice to create a composite primary key from those 2 keys, or I need something like postCategoryKey? And what else should I improve?
On my opinion, there is no need for PostCategoryKey, due to it's only a relationship table and you won't accés it by postCategoryKey.
I would create the PK using the 2 others FK (postKey and categoryKey).
Hope it helps!
It depends, if you plan later on to add some extra metadata to postCategoryKey in a separate table, then it makes sense.
In your case - I'd go with a composite primary key and get rid of postCategoryKey
You will have to make postKey and categoryKey not null and create a unique constraint on them anyway. That makes them a key for the table, no matter whether you call this the "primary key" or not.
So, there are three options:
Leave this as described with NOT NULL and the unique constraint.
Declare the two columns the table's primary key.
Create an additional column postCategoryKey and make this the primary key.
The decision doesn't really matter. Some companies have a database style convention. In that case it's easy; just follow the company rules. Some people want every table to have a single-column primary key. If so, add that PK column. Some people want bridge tables to have a composite primary key to imediately show what identifies a row. My personal preference is the latter, but any method is as good as the other actually. Just stay consistent in your database.

What would it mean If I change the identifying relationship from this part of a database design to a non-identifying relationship?

I have a question regarding this database design. I am a bit unsure of the difference between identifying and non-identifying relationships in a database leading me to some puzzles in my head.
I have this database design: (kind of like a movie rental stores. "friend" are those who borrow the movie. "studio" is the production studios that collaborated in making the movie.)
I somewhat understand how it works. However, I was wondering what if I create a loan_id in the loan table, and use movie_id and friend_id as normal foreign keys?
Some of my questions are:
What are the advantages or disadvantages of the later approach?
A situation where the initial or later model is better?
Does the initial model enable a friend to borrow a movie more than once?
Any thorough explanation would be much appreciated.
The way you have all of your many-to-many tables (tables collaboration, loan, role), is called a composite primary key: Where two (or more) columns form a unique value.
When you have a composite pk, a lot of db designers prefer to create a surrogate primary key (like your proposed loan_id). I'm one of them. This post does a good job going through the arguments of why or why not: Composite primary keys versus unique object ID field.
My relatively simple reason for it, is composite keys tend to grow: Using the loan example, what happens if that movies loaned more than once? Using the composite approach, you would then have to add loan_date to the composite key.
What if you then wanted to track re-loans of some sort? You would then have to have a 2nd table carrying all the composite pk fields from the loan table (original_loan_movie_id, original_loan_friend_id, original_loan_date) just to refer to the original loan...
In the LOAN table, you'd need to guarantee the following columns are unique:
movie_id (replace with copy_id assuming there are multiple copies of a movie)
friend_id
loan_date
...because I, or anyone else, should be able to rent the same movie more than once. These are also the columns most likely to be searched on...
With that in mind, the idea of defining a column called loan_id as the primary key for the table to be redundant. ORMs have been mandating the use of non-composite primary keys to simplify queries...
But it Makes Queries Easier...
At first glance, it makes deleting or updating a specific loan/etc easier - until you realize that you need to know the applicable id value first. If you have to search for that id value based on a movie, user/friend, and date then you'd have been better off using the criteria directly in the first place.
But Composite keys are Complex...
In this example, a primary key constraint will ensure that the three columns--movie_id, friend_id and loan_date--will be unique and indexed (most DBs these days automatically index primary keys if the clustered index doesn't already exist) using the best index possible for the table.
The lone primary key approach means the loan_id is indexed with the best index for the table (SQL Server & MySQL call them clustered indexes, to Oracle they're all just indexes), and requires an additional composite unique constraint/index. Some databases might require additional indexing beyond the unique constraint... So this makes the data model more involved/complex, and for no benefit:
Some databases, like MySQL, put a limit on the amount of space you can use for indexes.
the primary key is getting the most ideal index yet the value has no relevance to the data in the table, so making use of the index related to the primary key will be seldom if ever.
Conclusion
I've yet to see a legitimate justification for a single column primary key over a composite primary key.

in general, should every table in a database have an identity field to use as a PK?

I'm running into an issue with a join: getting back too many records. I added a table to the set of joins and the number of rows expanded. Usually when this happens I add a select of all the ID fields that are involved in the join. That way it's pretty obvious where the expansion is happening and I can change the ON of the join to fix it. Except in this case, the table that I added doesn't have an ID field. This is a problem. But perhaps I'm wrong.
Should every table in a database have an IDENTITY field that's used as the PK? Are there any drawbacks to having an ID field in every table? What if you're reasonably sure this table will never be used in a PK/FK relationship?
When having an identity column is not a good idea?
Surrogate vs. natural/business keys
Wikipedia Surrogate Key article
There are two concepts that are close but should not be confused: IDENTITY and PRIMARY KEY
Every table (except for the rare conditions) should have a PRIMARY KEY, that is a value or a set of values that uniquely identify a row.
See here for discussion why.
IDENTITY is a property of a column in SQL Server which means that the column will be filled automatically with incrementing values.
Due to the nature of this property, the values of this column are inherently UNIQUE.
However, no UNIQUE constraint or UNIQUE index is automatically created on IDENTITY column, and after issuing SET IDENTITY_INSERT ON it's possible to insert duplicate values into an IDENTITY column, unless it had been explicity UNIQUE constrained.
The IDENTITY column should not necessarily be a PRIMARY KEY, but most often it's used to fill the surrogate PRIMARY KEYs
It may or may not be useful in any particular case.
Therefore, the answer to your question:
The question: should every table in a database have an IDENTITY field that's used as the PK?
is this:
No. There are cases when a database table should NOT have an IDENTITY field as a PRIMARY KEY.
Three cases come into my mind when it's not the best idea to have an IDENTITY as a PRIMARY KEY:
If your PRIMARY KEY is composite (like in many-to-many link tables)
If your PRIMARY KEY is natural (like, a state code)
If your PRIMARY KEY should be unique across databases (in this case you use GUID / UUID / NEWID)
All these cases imply the following condition:
You shouldn't have IDENTITY when you care for the values of your PRIMARY KEY and explicitly insert them into your table.
Update:
Many-to-many link tables should have the pair of id's to the table they link as the composite key.
It's a natural composite key which you already have to use (and make UNIQUE), so there is no point to generate a surrogate key for this.
I don't see why would you want to reference a many-to-many link table from any other table except the tables they link, but let's assume you have such a need.
In this case, you just reference the link table by the composite key.
This query:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (a_id, b_id, PRIMARY KEY (a_id, b_id))
CREATE TABLE business_rule (id, a_id, b_id, FOREIGN KEY (a_id, b_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a
ON a.id = br.a_id
is much more efficient than this one:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (id, a_id, b_id, PRIMARY KEY (id), UNIQUE KEY (a_id, b_id))
CREATE TABLE business_rule (id, ab_id, FOREIGN KEY (ab_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a_to_b ab
ON br.ab_id = ab.id
JOIN a
ON a.id = ab.a_id
, for obvious reasons.
Almost always yes. I generally default to including an identity field unless there's a compelling reason not to. I rarely encounter such reasons, and the cost of the identity field is minimal, so generally I include.
Only thing I can think of off the top of my head where I didn't was a highly specialized database that was being used more as a datastore than a relational database where the DBMS was being used for nearly every feature except significant relational modelling. (It was a high volume, high turnover data buffer thing.)
I'm a firm believer that natural keys are often far worse than artificial keys because you often have no control over whether they will change which can cause horrendous data integrity or performance problems.
However, there are some (very few) natural keys that make sense without being an identity field (two-letter state abbreviation comes to mind, it is extremely rare for these official type abbreviations to change.)
Any table which is a join table to model a many to many relationship probably also does not need an additional identity field. Making the two key fields together the primary key will work just fine.
Other than that I would, in general, add an identity field to most other tables unless given a compelling reason in that particular case not to. It is a bad practice to fail to create a primary key on a table or if you are using surrogate keys to fail to place a unique index on the other fields needed to guarantee uniqueness where possible (unless you really enjoy resolving duplicates).
Every table should have some set of field(s) that uniquely identify it. Whether or not there is a numeric identifier field separate from the data fields will depend on the domain you are attempting to model. Not all data easily falls into the 'single numeric id' paradigm, and as such it would be inappropriate to force it. Given that, a lot of data does easily fit in this paradigm and as such would call for such an identifier. There is no one answer to always do X in any programming environment, and this is another example.
If you have modelled, designed, normalised etc, then you will have no identity columns.
You will have identified natural and candidate keys for your tables.
You may decide on a surrogate key because of the physical architecture (eg narrow, numeric, strictly monotonically increasing), say, because using a nvarchar(100) column is not a good idea (still need unique constraint).
Or because of ideology: they appeal to OO developers I've found.
Ok, assume ID columns. As your db gets more complex, say several layers, how can you jon parent and grand-.child tables directly. You can't: you always need intermediate tables and well indexed PK-FL columns. With a composite key, it's all there for you...
Don't get me wrong: I use them. But I know why I use them...
Edit:
I'd be interested to collate "always ID"+"no stored procs" matches on one hand, with "use stored procs"+"IDs when they benefit" on the other...
No. Whenever you have a table with an artificial identity column, you also need to identify the natural primary key for the table and ensure that there is a unique constraint on that set of columns too so that you don't get two rows that are identical apart from the meaningless identity column by accident.
Adding an identity column is not cost free. There is an overhead in adding an unnecessary identity column to a table - typically 4 bytes per row of storage for the identity value, plus a whole extra index (which will probably weigh in at 8-12 bytes per row plus overhead). It also takes slightly to work out the most cost-effective query plan because there is an extra index per table. Granted, if the table is small and the machine is big, this overhead is not critical - but for the biggest systems, it matters.
Yes, for the vast majority of cases.
Edge cases or exceptions might be things like:
two-way join tables to model m:n relationships
temporary tables used for bulk-inserting huge amounts of data
But other than that, I think there is no good reason against having a primary key to uniquely identify each row in a table, and in my opinion, using an IDENTITY field is one of the best choices (I prefer surrogate keys over natural keys - they're more reliable, stable, never changing etc.).
Marc
I can't think of any drawback about having an ID field in each table. Providing your the type of your ID field provides enough space for your table to grow.
However, you don't necessarily need a single field to ensure the identity of your rows.
So no, a single ID field is not mandatory.
Primary and Foreign Keys can consist not only of one field, but of multiple fields. This is typical for tables implementing a N-N relationship.
You can perfectly have PRIMARY KEY (fa, fb) on your table:
CREATE TABLE t(fa INT , fb INT);
ALTER TABLE t ADD PRIMARY KEY(fa , fb);
Recognize the distinction between an Identity field and a key... Every table should have a key, to eliminate the data corruption of inadvertently entering multiple rows that represent the same 'entity'. If the only key a table has is a meaningless surrogate key, then this function is effectively missing.
otoh, No table 'needs' an identity, and certainly not every table benefits from one... Examples are: A table with a short and functional key, a table which does not have any other table referencing it through a foreign Key, or a table which is in a one to zero-or-one relationship with another table... none of these need an Identity
I'd say, if you can find a simple, natural key in your table (i.e. one column), use that as a key instead of an identity column.
I generally give every table some kind of unique identifier, whether it is natural or generated, because then I am guaranteed that every row is uniquely identified somehow.
Personally, I avoid IDENTITY (incrementing identity columns, like 1, 2, 3, 4) columns like the plague. They cause a lot of hassle, especially if you delete rows from that table. I use generated uniqueidentifiers instead if there is no natural key in the table.
Anyway, no idea if this is the accepted practice, just seems right to me. YMMV.

SQL: Do you need an auto-incremental primary key for Many-Many tables?

Say you have a Many-Many table between Artists and Fans. When it comes to designing the table, do you design the table like such:
ArtistFans
ArtistFanID (PK)
ArtistID (FK)
UserID (FK)
(ArtistID and UserID will then be contrained with a Unique Constraint
to prevent duplicate data)
Or do you build use a compound PK for the two relevant fields:
ArtistFans
ArtistID (PK)
UserID (PK)
(The need for the separate unique constraint is removed because of the
compound PK)
Are there are any advantages (maybe indexing?) for using the former schema?
ArtistFans
ArtistID (PK)
UserID (PK)
The use of an auto incremental PK has no advantages here, even if the parent tables have them.
I'd also create a "reverse PK" index automatically on (UserID, ArtistID) too: you will need it because you'll query the table by both columns.
Autonumber/ID columns have their place. You'd choose them to improve certain things after the normalisation process based on the physical platform. But not for link tables: if your braindead ORM insists, then change ORMs...
Edit, Oct 2012
It's important to note that you'd still need unique (UserID, ArtistID) and (ArtistID, UserID) indexes. Adding an auto increments just uses more space (in memory, not just on disk) that shouldn't be used
Assuming that you're already a devotee of the surrogate key (you're in good company), there's a case to be made for going all the way.
A key point that is sometimes forgotten is that relationships themselves can have properties. Often it's not enough to state that two things are related; you might have to describe the nature of that relationship. In other words, there's nothing special about a relationship table that says it can only have two columns.
If there's nothing special about these tables, why not treat it like every other table and use a surrogate key? If you do end up having to add properties to the table, you'll thank your lucky presentation layers that you don't have to pass around a compound key just to modify those properties.
I wouldn't even call this a rule of thumb, more of a something-to-consider. In my experience, some slim majority of relationships end up carrying around additional data, essentially becoming entities in themselves, worthy of a surrogate key.
The rub is that adding these keys after the fact can be a pain. Whether the cost of the additional column and index is worth the value of preempting this headache, that really depends on the project.
As for me, once bitten, twice shy – I go for the surrogate key out of the gate.
Even if you create an identity column, it doesn't have to be the primary key.
ArtistFans
ArtistFanId
ArtistId (PK)
UserId (PK)
Identity columns can be useful to relate this relation to other relations. For example, if there was a creator table which specified the person who created the artist-user relation, it could have a foreign key on ArtistFanId, instead of the composite ArtistId+UserId primary key.
Also, identity columns are required (or greatly improve the operation of) certain ORM packages.
I cannot think of any reason to use the first form you list. The compound primary key is fine, and having a separate, artificial primary key (along with the unique contraint you need on the foreign keys) will just take more time to compute and space to store.
The standard way is to use the composite primary key. Adding in a separate autoincrement key is just creating a substitute that is already there using what you have. Proper database normalization patterns would look down on using the autoincrement.
Funny how all answers favor variant 2, so I have to dissent and argue for variant 1 ;)
To answer the question in the title: no, you don't need it. But...
Having an auto-incremental or identity column in every table simplifies your data model so that you know that each of your tables always has a single PK column.
As a consequence, every relation (foreign key) from one table to another always consists of a single column for each table.
Further, if you happen to write some application framework for forms, lists, reports, logging etc you only have to deal with tables with a single PK column, which simplifies the complexity of your framework.
Also, an additional id PK column does not cost you very much in terms of disk space (except for billion-record-plus tables).
Of course, I need to mention one downside: in a grandparent-parent-child relation, child will lose its grandparent information and require a JOIN to retrieve it.
In my opinion, in pure SQL id column is not necessary and should not be used. But for ORM frameworks such as Hibernate, managing many-to-many relations is not simple with compound keys etc., especially if join table have extra columns.
So if I am going to use a ORM framework on the db, I prefer putting an auto-increment id column to that table and a unique constraint to the referencing columns together. And of course, not-null constraint if it is required.
Then I treat the table just like any other table in my project.

Should I use Natural Identity Columns without GUID?

Is it ok to define tables primary key with 2 foreign key column that in combination must be unique? And thus not add a new id column (Guid or int)?
Is there a negative to this?
Yes, it's completely OK. Why not? The downside of composite primary keys is that it can be long and it might be harder to identify a single row uniquely from the application perspective. However, for a couple integer columns (specially in junction tables), it's a good practice.
Natural versus Artificial primary keys is one of those issues that gets widely debated and IMHO the discussion only seems to see the positions harden.
In my opinion both work so long as the developer knows how to avoid the downside of both. A natural primary key (whether composite or single column) more nearly ensures that duplicate rows are not added to the DB. Whereas with artificial primary keys it is necessary to first check the record is unique (as opposed to the primary key, which being artificial will always be unique). One effective way to achieve this is to have a unique or candidate index on the fields that make the record unique (e.g. the fields that make a candidate for a primary key)
Meanwhile an artificial primary key makes for easy joining. Relations can be made with a single field to single field join. With a composite key the writer of the SQL statement must know how many fields to include in the join.
For some definitions of "ok", yes. As long as you never intend to add additional fields to this intersection table, that'll be fine. However, if you intend to have more fields, it's good practice to have an ID field. It's still fine, come to think of it, but can be more awkward.
Unless, of course, disk space is at a serious premium!
If you look into any database textbook, you will find such tables en masse. This is the default way to define n-to-m relations. For example:
article = (id, title, text)
author = (id, name)
article_author = (article_id, author_id)
Semantically, article_author is not a new entity so you might refrain from defining it as the primary key and instead create it as a normal index with UNIQUE constraint.
Yes, I agree, with "some definition of OK" it is OK. But the moment you decide to reference this composite primary key from somewhere (i.e. move it to a foreign key), it quickly bocomes NG (Not Good).