Surrogate key as a foreign key over composite keys - sql

I realise there might be similar questions but I couldn't find one that was close enough for guidance.
Given this spec,
Site
---------------------------
SiteID int identity
Name varchar(50)
Series
---------------------
SiteID int
SeriesCode varchar(6)
...
--SeriesCode will be unique for every unique SiteID
Episode
----------------------
SiteID int
SeriesCode varchar(6)
EpisodeCode varchar(10)
...
my proposed design/implementation is
Site
----------------------------
SiteID int identity
Name varchar(50)
Series
-------------------------------------------
SeriesID int identity, surrogate key
SiteID int natural key
SeriesCode varchar(6) natural key
UNIQUE(SiteID, SeriesCode)
...
Episode
-------------------------------------------
EpisodeID int identity, surrogate key
SeriesID int foreign key
EpisodeCode varchar(6) natural key
...
Anything wrong with this? Is it okay to have the SeriesID surrogate as a foreign* key here? I'm not sure if I'm missing any obvious problems that can arise. Or would it be better to use composite natural keys (SiteID+SeriesCode / SiteID+EpisodeCode)? In essence that'd decouple the Episode table from the Series table and that doesn't sit right for me.
Worth adding is that SeriesCode looks like 'ABCD-1' and EpisodeCode like 'ABCD-1NMO9' in the raw input data that will populate these tables, so that's another thing that could be changed I suppose.
*: "virtual" foreign key, since it's been previously decided by the higher-ups we should not use actual foreign keys

Yes, it all looks fine. The only (minor) point I might make is that unless you have another 4th child table hanging off of Episode, you probably don't need EpisodeId, as Episode.EpisodeCode is a single attribute natural key sufficient to identify and locate rows in Episode. It's no harm to leave it there, of course, but as a general rule I add surrogate keys to act as targets for FKs in child tables, and try to add a narural key to every table to indentify and control redundant data rows... So if a table has no other table with a FK referencing it, (and never will) I sometimes don't bother including a surrogate key in it.

What's a "virtual" foreign key? Did the higher-ups decide not to use foreign key constraints? In that case, you're not using foreign keys at all. You're just pretending to.
And is Episode the best choice for an entity? Doesn't it really mean Show or Podcast or so, and just happens to always be part of a series right now? If so, will that change in the future? Will Episode eventually be abused to encompass Show outside of a Series? In that case, tying Episode to Site via Series might come back to haunt you.
Given all that, and assuming that you as a grunt probably can't change any of it: if i was you i'd feel safer using natural keys wherever possible. In absence of foreign key constraints, it makes recognizing bad data easier, and if you have to resort to some SeriesCode='EMPTY' trickery later on that's easier with natural keys, too.

My suggestion:
Use natural/business as primary key whenever possible except in the following 3 situations:
The natural/business key is unknown at the moment of inserting
The natural/business key is not good ( it's not unique, it's liable to change frequently )
The natural/business key is a composite of more than 3 columns and the table will have child tables
In situations 1 and 2 a surrogate key is requiered.
In situation 3 a surrogate key is strongly recommended.

Related

Identity field and primary key in SQL Server when values are unique

When a set of values that will be stored in a table have a name or a code that should be unique across the system, should it be created with a primary key of ID auto increment (int)?
Take the situation of State Abbreviations. Other than consistency, what would be the purpose of an ID on the table that was the primary key other than the state name or abbreviation?
If for example the foreign key from an shipping address referenced the state abbreviation that is not mutable then ... is there a purpose for having an auto increment int ID?
You highlighted one positive aspect of a separate table: consistency. It is much easier to have this:
CREATE TABLE dbo.States
(
StateID TINYINT PRIMARY KEY,
Name VARCHAR(32),
Abbreviation CHAR(2)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateID TINYINT NOT NULL FOREIGN KEY REFERENCES dbo.States(StateID)
);
Than to have a trigger or check constraint like:
CHECK StateAbbreviation IN ('AL', 'AK', /* 50+ more states/territories... */)
Now, with something static and small like a 2-character state abbreviation, this design might make more sense, eliminating some unnecessary mapping between the abbreviations and some surrogate ID:
CREATE TABLE dbo.States
(
Abbreviation CHAR(2) PRIMARY KEY,
Name VARCHAR(32)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateAbbreviation CHAR(2) FOREIGN KEY REFERENCES dbo.States(Abbreviation)
);
This constrains the data to the known set of states, allows you to store the actual data in the table (which can eliminate a lot of joins in queries), actually saves you some space, and avoids having any messy hard-coded check constraints (or constraints using UDFs, or triggers validating the data).
That all said, there is no magic blanket answer that satisfies all designs. As your string gets larger, it can make more sense to use an integer instead of just storing the string. A counter-example would be storing all of the User Agent strings from your web logs - it makes a lot of sense to store the same string once and assign an integer to it, than to store the same 255-character string over and over and over again.
Other things that can make this design troublesome:
What if you expand beyond the US later?
Forget about state abbreviations for a moment (which are pretty static); what if your lookups are things that do change frequently?
State Abbreviation is a rare example of a good non-increment primary key for the following reasons:
They are small (2-character)
They don't change
The set of values is relatively static - new records are unlikely
Just because the natural key is unique doesn't make it a good candidate for the primary key.
Even real-world values that are unique (like SSN) may nod be good candidates if they are entered in by humans. For example, suppose someone enters in a bunch of related data for a person, then get a letter that the SSN is wrong - now you can't just update the primary key - you need to update all of the foreign keys as well!
As a general rule (which may not apply in every single case), it's better to use integers as primary keys for performance reasons. So if your unique key is a string, create an autoincrement primary key.
Also, states don't have to be necessarily unique. It's true in one country but when you look at all countries in the world, same abbreviations may happen.
EDIT
I can't find a very good evidence of string vs. integer performance but take a look e.g. in here: Strings as Primary Keys in SQL Database
Having said that, there's never a lot of states so performance gain will be small in this case.

Surrogate key 'preference' explanation

As I understand there is a war going on between purists of natural key and purists of surrogate key. In likes to this this post (there are more) people say 'natural key is bad for you, always use surrogate...
However, either I am stupid or blind but I can not see a reason to have surrogate key always!
Say you have 3 tables in configuration like this:
Why would I need a surrogate key for it?? I mean it makes perfect sense not to have it.
Also, can someone please explain why primary keys should never change according to surrogate key purists? I mean, if I have say color_id VARCHAR(30) and a key is black, and I no longer need black because I am changing it to charcoal, why is it a bad idea to change black key to charcoal and all referencing columns too?
EDIT: Just noticed that I dont even need to change it! Just create new one, change referencing columns (same as I would have to do with surrogate key) and leave old one in peace....
In surrogate key mantra I need to then create additional entry with, say, id=232 and name=black. How does that benefit me really? I have a spare key in table which I don't need any more. Also I need to join to get a colour name while otherwise I can stay in one table and be merry?
Please explain like to a 5 year old, and please keep in mind that I am not trying to say 'surrogate key is bad', I am trying to understand why would someone say things like 'always use surrogate key!'.
Surrogate keys are useful where there is an suboptimal natural key: no more, no less.
A suboptimal natural key would be a GUID or varchar or otherwise wide/non-ordered.
However, the decision to use a surrogate is an implementation decision after the conceptual and logical modelling process, based on knowledge of how the chosen RDBMS works.
However, this best practice of "have a surrogate key" is now "always have a surrogate key" and it introduced immediately. Object Relation Mappers also often add surrogate keys to all tables whether needed or not which doesn't help.
For a link (many-many) table, you don't need one: SQL: Do you need an auto-incremental primary key for Many-Many tables?. For a table with 2 int columns, the overhead is an extra 50% of data for a surrogate column (assuming ints and ignoring row metadata)
Well, I am more on the natural keys myself :)
But surrogate keys can have its advantages, even if you like me want to go "natural" all the way :)
For example, I have a table that, due to various constraints, has to be defined as being dependent from others. Something like
Table Fee (
foreign_key1,
foreign_key2,
foreign_key3,
value)
the record is defined/identified by the three foreign keys but at the same time, you can have at most 2 of them to be null.
So you cannot create them as a primary keys (u'll just put an unique on the 3 columns)
In order to have a primary key on that table, the only way to do that is to use a surrogate :)
Now... why not to change a primary key... This can be considered pretty philosophical... I see it in this way, hope it will make sense...
A primary key, in itself, is not only a combination of unique+not null, it is more about "the real essence of the record", what it defines the record at the core.
In that sense, it is not something you could change easily, could you?
Consider yourself as an example. You have a nick, but it does not defines what u really are. You could change it, but the essence of being yourself would not change.
Now, if you maintain the nickname, but change your essence... would it still be the same person? Nope, it would make more sense to consider it a "new" person... And for records it's the same...
So that's why you usually do not change the primary key and define a new record from scratch
Always remember surrogate key is the additional column for our actual table columns let us take a table columns like below
patient_name
address
mobile_no
email_address
See here imagine we are working with admission of patient records so here we can't take mobile_no has primary key because we can take but some people might not have mobile no instead of this go for surrogate key and make it as primary key and make actual mobile_no, patient_name as primary key then we can easily perform ..here if mobile no changed no problem we can still search with the help of surrogate key
like below..
Here you can write surrogate key on the top of actual data
patient_no----->primary key[surrogate key]
patient_name ---->pk
address
mobile_no--->pk
email_address

Relational database design question - Surrogate-key or Natural-key?

Which one is the best practice and Why?
a) Type Table, Surrogate/Artificial Key
Foreign key is from user.type to type.id:
b) Type Table, Natural Key
Foreign key is from user.type to type.typeName:
I believe that in practice, using a natural key is rarely the best option. I would probably go for the surrogate key approach as in your first example.
The following are the main disadvantages of the natural key approach:
You might have an incorrect type name, or you may simply want to rename the type. To edit it, you would have to update all the tables that would be using it as a foreign key.
An index on an int field will be much more compact than one on a varchar field.
In some cases, it might be difficult to have a unique natural key, and this is necessary since it will be used as a primary key. This might not apply in your case.
The first one is more future proof, because it allows you to change the string representing the type without updating the whole user table. In other words you use a surrogate key, an additional immutable identifier introduced for the sake of flexibility.
A good reason to use a surrogate key (instead of a natural key like name) is when the natural key isn't really a good choice in terms of uniqueness. In my lifetime i've known no fewer than 4 "Chris Smith"s. Person names are not unique.
I prefer to use the surrogate key. It is often people will identity and use the natural key which will be fine for a while, until they decide they want to change the value. Then problems start.
You should probably always use an ID number (that way if you change the type name, you don't need to update the user table) it also allows you to keep your datasize down, as a table full of INTs is much smaller than one full of 45 character varchars.
If typeName is a natural key, then it's probably the preferable option, because it won't require a join to get the value.
You should only really use a surrogate key (id) when the name is likely to change.
Surrogate key for me too, please.
The other might be easier when you need to bang out some code, but it will eventually be harder. Back in the day, my tech boss decided using an email addr as a primary key was a good idea. Needless to say, when people wanted to change their addresses it really sucked.
Use natural keys whenever they work. Names usually don't work. They are too mutable.
If you are inventing your own data, you might as well invent a syntheic key. If you are building a database of data provided by other people or their software, analyze the source data to see how they identify things that need identification.
If they are managing data at all well, they will have natural keys that work for the important stuff. For the unimportant stuff, suit yourself.
well i think surrgote key is helpful when you don't have any uniquely identified key whose value is related and meaningful as is to be its primary key... moreover surrgote key is easier to implement and less overhead to maintain.
but on the other hand surrgote key is sometimes make extra cost by joining tables.
think about 'User' ... I have
UserId varchar(20), ID int, Name varchar(200)
as the table structure.
now consider that i want to take a track on many tables as who is inserting records... if i use Id as a primary key, then [1,2,3,4,5..] etc will be in foreign tables and whenever i need to know who is inserting data i've to join User Table with it because 1,2,3,4,5,6 is meaningless. but if i use UserId as a primary key which is uniquely identified then on other foreign tables [john, annie, nadia, linda123] etc will be saved which is sometimes easily distinguishable and meaningful . so i need not to join user table everytime when i do query.
but mind it, it takes some extra physical space as varchar is saved in foreign tables which takes extra bytes.. and ofcourse indexing has a significant performance issue where int performs better rather than varchar
Surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row in the table.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Do link tables need a meaningless primary key field?

I am working on a couple of link tables and I got to thinking (Danger Will Robinson, Danger) what are the possible structures of a link table and what are their pro's and con's.
I came up with a few possible strictures for the link table:
Traditional 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key
table2fk - foreign key
It's a classic, in most of the books, 'nuff said.
Indexed 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key INDEX ('table1fk')
table2fk - foreign key INDEX ('table2fk')
In my own experience, the fields that you are querying against are not indexed in the traditional model. I have found that indexing the foreign key fields does improve performance as would be expected. Not a major change but a nice optimizing tweak.
Composite key 2 columns ADD PRIMARY KEY ('table1fk' , 'table2fk')
table1fk - foreign key
table2fk - foreign key
With this I use a composite key so that a record from table1 can only be linked to a record on table2 once. Because the key is composite I can add records (1,1), (1,2), (2,2) without any duplication errors.
Any potential problems with the composite key 2 columns option? Is there an indexing issue that this might cause? A performance hit? Anything that would disqualify this as a possible option?
I would use composite key, and no extra meaningless key.
I would not use a ORM system that enforces such rules on my db structure.
For true link tables, they typically do not exist as object entities in my object models. Thus the surrogate key is not ever used. The removable of an item from a collection results in a removal of an item from a link relationship where both foreign keys are known (Person.Siblings.Remove(Sibling) or Person.RemoveSibling(Sibling) which is appropriately translated at the data access layer as usp_Person_RemoveSibling(PersonID, SiblingID)).
As Mike mentioned, if it does become an actual entity in your object model, then it may merit an ID. However, even with addition of temporal factors like effective start and end dates of the relationship and things like that, it's not always clear. For instance, the collection may have an effective date associated at the aggregate level, so the relationship itself may still not become an entity with any exposed properties.
I'd like to add that you might very well need the table indexed both ways on the two foreign key columns.
If this is a true many-to-many join table, then dump unecessary id column (unless your ORM requires one. in that case you've got to decide whether your intellect is going to trump your practicality).
But I find that true join tables are pretty rare. It usually isn't long before I start wanting to put some other data in that table. Because of that I almost always model these join tables as entities from the beginning and stick an id in there.
Having a single column pk can help out alot in disaster recovery situation. So though while correct in theory that you only need the 2 foreign keys. In practice when the shit hits the fan you may want the single column key. I have never been in a situation where i was screwed because I had a single column identifier but I have been in ones where I was screwed because I didn't.
Composite PK and turn off clustering.
I have used composite key to prevent duplicate entry and let the database handle the exception. With a single key, you are rely on the front-end application to check the database for duplicate before adding a new record.
There is something called identifying and non-identifying relationship. With identifying relationships the FK is a part of the PK in the many-to-many table. For example, say we have tables Person, Company and a many-to-many table Employment. In an identifying relationship both fk PersonID and CompanyID are part of the pk, so we can not repeat PersonID, CompanyID combination.
TABLE Employment(PersonID int (PK,FK), CompanyID int (PK,FK))
Now, suppose we want to capture history of employment, so a person can leave a company, work somewhere else and return to the same company later. The relationship is non-identifying here, combination of PersonID, CompanyID can now repeat, so the table would look something like:
TABLE Employment(EmploymentID int (PK), PersonID int (FK), CompanyID int (FK),
FromDate datetime, ToDate datetime)
If you are using an ORM to get to/alter the data, some of them require a single-column primary key (Thank you Tom H for pointing this out) in order to function correctly (I believe Subsonic 2.x was this way, not sure about 3.x).
In my mind, having the primary key doesn't impact performance to any measurable degree, so I usually use it.
If you need to traverse the join table 'in both directions', that is starting with a table1fk or a table2fk key only, you might consider adding a second, reversed, composite index.
ADD KEY ('table2fk', 'table1fk')
The correct answer is:
Primary key is ('table1fk' , 'table2fk')
Another index on ('table2fk' , 'table1fk')
Because:
You don't need an index on table1fk or table2fk alone: the optimiser will use the PK
You'll most likely use the table "both" ways
Adding a surrogate key is only needed because of braindead ORMs
i've used both, the only benefit of using the first model (with uid) is that you can transport the identifier around as a number, whereas in some cases you would have to do some string concatenation with the composite key to transport it around.
i agree that not indexing the foreign keys is a bad idea whichever way you go.
I (almost) always use the additional single-column primary key. This generally makes it easier to build user interfaces, because when a user selects that particular linking entity I can identify with a single integer value rather than having to create and then parse compound identifiers.

SQL: To primary key or not to primary key?

I have a table with sets of settings for users, it has the following columns:
UserID INT
Set VARCHAR(50)
Key VARCHAR(50)
Value NVARCHAR(MAX)
TimeStamp DATETIME
UserID together with Set and Key are unique. So a specific user cannot have two of the same keys in a particular set of settings. The settings are retrieved by set, so if a user requests a certain key from a certain set, the whole set is downloaded, so that the next time a key from the same set is needed, it doesn't have to go to the database.
Should I create a primary key on all three columns (userid, set, and key) or should I create an extra field that has a primary key (for example an autoincrement integer called SettingID, bad idea i guess), or not create a primary key, and just create a unique index?
----- UPDATE -----
Just to clear things up: This is an end of the line table, it is not joined in anyway. UserID is a FK to the Users table. Set is not a FK. It is pretty much a helper table for my GUI.
Just as an example: users get the first time they visit parts of the website, a help balloon, which they can close if they want. Once they click it away, I will add some setting to the "GettingStarted" set that will state they helpballoon X has been disabled. Next time when the user comes to the same page, the setting will state that help balloon X should not be shown anymore.
Having composite unique keys is mostly not a good idea.
Having any business relevant data as primary key can also make you troubles. For instance, if you need to change the value. If it is not possible in the application to change the value, it could be in the future, or it must be changed in an upgrade script.
It's best to create a surrogate key, a automatic number which does not have any business meaning.
Edit after your update:
In this case, you can think of having conceptually no primary key, and make this three columns either the primary key of a composite unique key (to make it changeable).
Should I create a primary key on all three columns (userid, set, and key)
Make this one.
Using surrogate primary key will result in an extra column which is not used for other purposes.
Creating a UNIQUE INDEX along with surrogate primary key is same as creating a non-clustered PRIMARY KEY, and will result in an extra KEY lookup which is worse for performance.
Creating a UNIQUE INDEX without a PRIMARY KEY will result in a HEAP-organized table which will need an extra RID lookup to access the values: also not very good.
How many Key's and Set's do you have? Do these need to be varchar(50) or can they point to a lookup table? If you can convert this Set and Key into SetId and KeyId then you can create your primary key on the 3 integer values which will be much faster.
I would probably try to make sure that UserID was a unique identifier, rather than having duplicates of UserID throughout the code. Composite keys tend to get confusing later on in your code's life.
I'm assuming this is a lookup field for config values of some kind, so you could probably go with the composite key if this is the case. The data is already there. You can guarantee it's uniqueness using the primary key. If you change your mind and decide later that it isn't appropriate for you, you can easily add a SettingId and make the original composite key a unique index.
Create one, separate primary key. No matter what how bussines logic will change, what new rules will have to be applied to your Key VARCHAR(50) field - having one primary key will make you completly independent of bussines logic.
In my experience it all depends how many tables will be using this table as FK information. Do you want 3 extra columns in your other tables just to carry over a FK?
Personally I would create another FK column and put a unique constraint over the other three columns. This makes foreign keys to this table a lot easier to swallow.
I'm not a proponent of composite keys, but in this case as an end of the line table, it might make sense. However, if you allow nulls in any of these three fields becasue one or more of the values is not known at the time of the insert, there can be difficulty and a unique index might be better.
Better have UserID as 32 bit newid() or unique identifier because UserID as int gives a hint to the User of the probable UserID. This will also solve your issue of composite key.