Identity field and primary key in SQL Server when values are unique - sql

When a set of values that will be stored in a table have a name or a code that should be unique across the system, should it be created with a primary key of ID auto increment (int)?
Take the situation of State Abbreviations. Other than consistency, what would be the purpose of an ID on the table that was the primary key other than the state name or abbreviation?
If for example the foreign key from an shipping address referenced the state abbreviation that is not mutable then ... is there a purpose for having an auto increment int ID?

You highlighted one positive aspect of a separate table: consistency. It is much easier to have this:
CREATE TABLE dbo.States
(
StateID TINYINT PRIMARY KEY,
Name VARCHAR(32),
Abbreviation CHAR(2)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateID TINYINT NOT NULL FOREIGN KEY REFERENCES dbo.States(StateID)
);
Than to have a trigger or check constraint like:
CHECK StateAbbreviation IN ('AL', 'AK', /* 50+ more states/territories... */)
Now, with something static and small like a 2-character state abbreviation, this design might make more sense, eliminating some unnecessary mapping between the abbreviations and some surrogate ID:
CREATE TABLE dbo.States
(
Abbreviation CHAR(2) PRIMARY KEY,
Name VARCHAR(32)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateAbbreviation CHAR(2) FOREIGN KEY REFERENCES dbo.States(Abbreviation)
);
This constrains the data to the known set of states, allows you to store the actual data in the table (which can eliminate a lot of joins in queries), actually saves you some space, and avoids having any messy hard-coded check constraints (or constraints using UDFs, or triggers validating the data).
That all said, there is no magic blanket answer that satisfies all designs. As your string gets larger, it can make more sense to use an integer instead of just storing the string. A counter-example would be storing all of the User Agent strings from your web logs - it makes a lot of sense to store the same string once and assign an integer to it, than to store the same 255-character string over and over and over again.
Other things that can make this design troublesome:
What if you expand beyond the US later?
Forget about state abbreviations for a moment (which are pretty static); what if your lookups are things that do change frequently?

State Abbreviation is a rare example of a good non-increment primary key for the following reasons:
They are small (2-character)
They don't change
The set of values is relatively static - new records are unlikely
Just because the natural key is unique doesn't make it a good candidate for the primary key.
Even real-world values that are unique (like SSN) may nod be good candidates if they are entered in by humans. For example, suppose someone enters in a bunch of related data for a person, then get a letter that the SSN is wrong - now you can't just update the primary key - you need to update all of the foreign keys as well!

As a general rule (which may not apply in every single case), it's better to use integers as primary keys for performance reasons. So if your unique key is a string, create an autoincrement primary key.
Also, states don't have to be necessarily unique. It's true in one country but when you look at all countries in the world, same abbreviations may happen.
EDIT
I can't find a very good evidence of string vs. integer performance but take a look e.g. in here: Strings as Primary Keys in SQL Database
Having said that, there's never a lot of states so performance gain will be small in this case.

Related

Having an Identity column in a table with primary key with varchar type

I was told to create an autID identity column in the table with GUID varchar(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
This causes a lot of problems like this
CREATE TABLE OauthClientInfo
(
autAppID INT IDENTITY(1,1)
strClientID VARCHAR(40), -- GUID
strClientSecret VARCHAR(40)
)
CREATE TABLE OAuth_AuthToken
(
autID INT IDENTITY(1,1)
strAuthToken VARCHAR(40),
autAppID_fk INT
FOREIGN KEY REFERENCES OauthClientInfo(autAppID)
)
I was told that having autAppID_fk helps in the joins vs having strClientID_fk of varchar(40), but my point to defend is we unnecessarily adding a new id as a reference that some times forces to make joins.
Example, to know what is the strClientID that the strAuthToken belongs, if we have strClientID_fk as the reference key then the OAuth_AuthToken table data make sense a lot for me. Please comment your views on this.
I was told to create an autID identity column in the table with GUID varchar
(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
You were told this by someone that confuses clustering and primary keys. They are not one and the same, despite the confusing implementation of the database engine that "helps" the lazy developer.
You might get arguments about adding an identity column to every table and designating it as the primary key. I'll disagree with all of this. One does not BLINDLY do anything of this type in a schema. You do the proper analysis, identify (and enforce) any natural keys, and then you decide an whether a synthetic key is both useful and needed. And then you determine which columns to use for the clustered index (because you only have one). And then you verify the appropriateness of your decisions based on how efficient and effective your schema is under load by testing. There are no absolute rules about how to implement your schema.
Many of your indexing (and again note - indexing and primary key are completely separate things) choices will be affected by how your tables are updated over time. Do you have hotspots that need to be minimized? Does your table experience lots of random inserts, updates, and deletes over time? Or maybe just lots of inserts but relatively few updates or deletes? These are just some of the factors that guide your decision.
You need to use UNIQUEIDENTIFIER data type for GUID columns not VARCHAR
As far as I have read, Auto increment int is the most suitable column for clustered index.
And strClientID is the worst candidate for PK or cluster index.
Most importantly you haven't mention the purpose of StrClientID. What kind of data does it hold, how does it get populated?

Defined patern for mysql table primary key

is there anyway to create lets say pattern for primary key i.e. for table products such pattern would by p-1,p-2... p-n etc.
Thanks
Well, you can manually create and enforce that pattern into your application (or using triggers). A primary key just needs to be unique to work.
But I don't recommend it. In your sample, seems P-1 have a business meaning. And, if it belongs to your business realm, it can be changed. While most database have a UPDATE CASCADE equivalent, it doesn't change basic reason you shouldn't use that as key: it's information, not data.
I suggest you to create a field named ProductCode char(10) NOT NULL UNIQUE and maybe to fill it with P-00000001, P-00000002, and so on. Maybe you do prefer to use a varchar: this doesn't matter, as it must fulfill your business requirement. Create an Id INTEGER AUTO_INCREMENT PRIMARY KEY field to use as primary key, as it doesn't never needs to be changed.

Surrogate key as a foreign key over composite keys

I realise there might be similar questions but I couldn't find one that was close enough for guidance.
Given this spec,
Site
---------------------------
SiteID int identity
Name varchar(50)
Series
---------------------
SiteID int
SeriesCode varchar(6)
...
--SeriesCode will be unique for every unique SiteID
Episode
----------------------
SiteID int
SeriesCode varchar(6)
EpisodeCode varchar(10)
...
my proposed design/implementation is
Site
----------------------------
SiteID int identity
Name varchar(50)
Series
-------------------------------------------
SeriesID int identity, surrogate key
SiteID int natural key
SeriesCode varchar(6) natural key
UNIQUE(SiteID, SeriesCode)
...
Episode
-------------------------------------------
EpisodeID int identity, surrogate key
SeriesID int foreign key
EpisodeCode varchar(6) natural key
...
Anything wrong with this? Is it okay to have the SeriesID surrogate as a foreign* key here? I'm not sure if I'm missing any obvious problems that can arise. Or would it be better to use composite natural keys (SiteID+SeriesCode / SiteID+EpisodeCode)? In essence that'd decouple the Episode table from the Series table and that doesn't sit right for me.
Worth adding is that SeriesCode looks like 'ABCD-1' and EpisodeCode like 'ABCD-1NMO9' in the raw input data that will populate these tables, so that's another thing that could be changed I suppose.
*: "virtual" foreign key, since it's been previously decided by the higher-ups we should not use actual foreign keys
Yes, it all looks fine. The only (minor) point I might make is that unless you have another 4th child table hanging off of Episode, you probably don't need EpisodeId, as Episode.EpisodeCode is a single attribute natural key sufficient to identify and locate rows in Episode. It's no harm to leave it there, of course, but as a general rule I add surrogate keys to act as targets for FKs in child tables, and try to add a narural key to every table to indentify and control redundant data rows... So if a table has no other table with a FK referencing it, (and never will) I sometimes don't bother including a surrogate key in it.
What's a "virtual" foreign key? Did the higher-ups decide not to use foreign key constraints? In that case, you're not using foreign keys at all. You're just pretending to.
And is Episode the best choice for an entity? Doesn't it really mean Show or Podcast or so, and just happens to always be part of a series right now? If so, will that change in the future? Will Episode eventually be abused to encompass Show outside of a Series? In that case, tying Episode to Site via Series might come back to haunt you.
Given all that, and assuming that you as a grunt probably can't change any of it: if i was you i'd feel safer using natural keys wherever possible. In absence of foreign key constraints, it makes recognizing bad data easier, and if you have to resort to some SeriesCode='EMPTY' trickery later on that's easier with natural keys, too.
My suggestion:
Use natural/business as primary key whenever possible except in the following 3 situations:
The natural/business key is unknown at the moment of inserting
The natural/business key is not good ( it's not unique, it's liable to change frequently )
The natural/business key is a composite of more than 3 columns and the table will have child tables
In situations 1 and 2 a surrogate key is requiered.
In situation 3 a surrogate key is strongly recommended.

Database Design and the use of non-numeric Primary Keys

I'm currently in the process of designing the database tables for a customer & website management application. My question is in regards to the use of primary keys as functional parts of a table (and not assigning "ID" numbers to every table just because).
For example, here are four related tables from the database so far, one of which uses the traditional primary key number, the others which use unique names as the primary key:
--
-- website
--
CREATE TABLE IF NOT EXISTS `website` (
`name` varchar(126) NOT NULL,
`client_id` int(11) NOT NULL,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`notes` text NOT NULL,
`website_status` varchar(26) NOT NULL,
PRIMARY KEY (`name`),
KEY `client_id` (`client_id`),
KEY `website_status` (`website_status`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- website_status
--
CREATE TABLE IF NOT EXISTS `website_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `website_status` (`name`) VALUES
('demo'),
('disabled'),
('live'),
('purchased'),
('transfered');
--
-- client
--
CREATE TABLE IF NOT EXISTS `client` (
`id` int(11) NOT NULL auto_increment,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`client_status` varchar(26) NOT NULL,
`firstname` varchar(26) NOT NULL,
`lastname` varchar(46) NOT NULL,
`address` varchar(78) NOT NULL,
`city` varchar(56) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` int(11) NOT NULL,
`country` varchar(3) NOT NULL,
`phone` text NOT NULL,
`email` varchar(78) NOT NULL,
`notes` text NOT NULL,
PRIMARY KEY (`id`),
KEY `client_status` (`client_status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;
--
-- client_status
---
CREATE TABLE IF NOT EXISTS `client_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `client_status` (`name`) VALUES
('affiliate'),
('customer'),
('demo'),
('disabled'),
('reseller');
As you can see, 3 of the 4 tables use their 'name' as the primary key. I know that these will always be unique. In 2 of the cases (the *_status tables) I am basically using a dynamic replacement for ENUM, since status options could change in the future, and for the 'website' table, I know that the 'name' of the website will always be unique.
I'm wondering if this is sound logic, getting rid of table ID's when I know the name is always going to be a unique identifier, or a recipe for disaster? I'm not a seasoned DBA so any feedback, critique, etc. would be extremely helpful.
Thanks for taking the time to read this!
There are 2 reasons I would always add an ID number to a lookup / ENUM table:
If you are referencing a single column table with the name then you may be better served by using a constraint
What happens if you wanted to rename one of the client_status entries? e.g. if you wanted to change the name from 'affiliate' to 'affiliate user' you would need to update the client table which should not be necessary. The ID number serves as the reference and the name is the description.
In the website table, if you are confident that the name will be unique then it is fine to use as a primary key. Personally I would still assign a numeric ID as it reduces the space used in foreign key tables and I find it easier to manage.
EDIT:
As stated above, you will run into problems if the website name is renamed. By making this the primary key you will be making it very difficult if not impossible for this to be changed at a later date.
When making natural PRIMARY KEY's, make sure their uniqueness is under your control.
If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's.
Since website_status and client_status seem to be generated and used by you and only by you, it's acceptable to use them as a PRIMARY KEY, though having a long key may impact performance.
website name seems be under control of the outer world, that's why I'd make it a plain field. What if they want to rename their website?
The counterexamples would be SSN and ZIP codes: it's not you who generates them and there is no guarantee that they won't be ever duplicated.
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
Using "Name" as your key, while it seems to satisfy #1, doesn't satisfy ANY of the other three.
Even for your "lookup" table, what if your boss decides to change all affiliates to partners instead? You'll have to modify all rows in the database that use this value.
From a performance perspective, I'm probably most concerned that a key be narrow. If your website name is actually a long URL, then that could really bloat the size of any non-clustered indexes, and all tables that use it as a foreign key.
Besides all the other excellent points that have already been made, I would add one more word of caution against using large fields as clustering keys in SQL Server (if you're not using SQL Server, then this probably doesn't apply to you).
I add this because in SQL Server, the primary key on a table by default also is the clustering key (you can change that, if you want to and know about it, but most of the cases, it's not done).
The clustering key that determines the physical ordering of the SQL Server table is also being added to every single non-clustered index on that table. If you have only a few hundred to a few thousand rows and one or two indices, that's not a big deal. But if you have really large tables with millions of rows, and potentially lots of indices to speed up the queries, this will indeed cause a lot of disk space and server memory to be wasted unnecessarily.
E.g. if your table has 10 million rows, 10 non-clustered indices, and your clustering key is 26 bytes instead of 4 (for an INT), then you're wasting 10 mio. by 10 by 22 bytes for a total of 2.2 billion bytes (or 2.2 GBytes approx.) - that's not peanuts anymore!
Again - this only applies to SQL Server, and only if you have really large tables with lots of non-clustered indices on them.
Marc
"If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's."
If you're absolutely sure you will never ever have uniqueness violation, then don't bother to define the key.
Personally, I think you will run into trouble using this idea. As you end up with more parent child relationships, you end up with a huge amount of work when the names change (As they always will sooner or later). There can be a big performance hit when having to update a child table that has thousands of rows when the name of the website changes. And you have to plan for how do make sure that those changes happen. Otherwise, the website name changes (oops we let the name expire and someone else bought it.) either break because of the foreign key constraint or you need to put in an automated way (cascade update) to propagate the change through the system. If you use cascading updates, then you can suddenly bring your system to a dead halt while a large chage is processed. This is not considered to be a good thing. It really is more effective and efficient to use ids for relationships and then put unique indexes on the name field to ensure they stay unique. Database design needs to consider maintenance of the data integrity and how that will affect performance.
Another thing to consider is that websitenames tend to be longer than a few characters. This means the performance difference between using an id field for joins and the name for joins could be quite significant. You have to think of these things at the design phase as it is too late to change to an ID when you have a production system with millions of records that is timing out and the fix is to completely restructure the databse and rewrite all of the SQL code. Not something you can fix in fifteen minutes to get the site working again.
This just seems like a really bad idea. What if you need to change the value of the enum? The idea is to make it a relational database and not a set of flat files. At this point, why have the client_status table? Moreover, if you are using the data in an application, by using a type like a GUID or INT, you can validate the type and avoid bad data (in so far as validating the type). Thus, it is another of many lines to deter hacking.
I would argue that a database that is resistant to corruption, even if it runs a little slower, is better than one that isn’t.
In general, surrogate keys (such as arbitrary numeric identifiers) undermine the integrity of the database. Primary keys are the main way of identifying rows in the database; if the primary key values are not meaningful, the constraint is not meaningful. Any foreign keys that refer to surrogate primary keys are therefore also suspect. Whenever you have to retrieve, update or delete individual rows (and be guaranteed of affecting only one), the primary key (or another candidate key) is what you must use; having to work out what a surrogate key value is when there is a meaningful alternative key is a redundant and potentially dangerous step for users and applications.
Even if it means using a composite key to ensure uniqueness, I would advocate using a meaningful, natural set of attributes as the primary key, whenever possible. If you need to record the attributes anyway, why add another one? That said, surrogate keys are fine when there is no natural, stable, concise, guaranteed-to-be-unique key (e.g. for people).
You could also consider using index key compression, if your DBMS supports it. This can be very effective, especially for indexes on composite keys (think trie data structures), and especially if the least selective attributes can appear first in the index.
I think I am in agreement with cheduardo. It has been 25 years since I took a course in database design but I recall being told that database engines can more efficiently manage and load indexes that use character keys. The comments about the database having to update thousands of records when a key is changed and on all of the added space being taken up by the longer keys and then having to be transferred across systems, assumes that the key is actually stored in the records and that it does not have to be transferred across systems anyway. If you create an index on a column(s) of a table, I do not think the value is stored in the records of the table (unless you set some option to do so).
If you have a natural key for a table, even if it is changed occassionally, creating another key creates a redundancy that could result in data integrity issues and actually creates even more information that needs to be stored and transferred across systems. I work for a team that decided to store the local application settings in the database. They have an identity column for each setting, a section name, a key name, and a key value. They have a stored procedure (another holy war) to save a setting that ensures it does not appear twice. I have yet to find a case where I would use a setting's ID. I have, however, ended up with multiple records with the same section and key name that caused my application to fail. And yes, I know that could have been avoided by defining a constraint on the columns.
Here few points should be considered before deciding keys in table
Numeric key is more suitable when you
use references ( foreign keys), since
you not using foreign keys, it ok in
your case to use non numeric key.
Non-numeric key uses more space than
numeric keys, can decrease
performance.
Numeric keys make db look simpler to
understand ( you can easily know no
of rows just by looking at last row)
You NEVER know when the company you work for suddenly explodes in growth and you have to hire 5 developers overnight. Your best bet is to use numeric (integer) primary keys because they will be much easier for the entire team to work with AND will help your performance if and when the database grows. If you have to break records out and partition them, you might want to use the primary key. If you are adding records with a datetime stamp (as every table should), and there is an error somewhere in the code that updates that field incorrectly, the only way to confirm if the record was entered in the proper sequence it to check the primary keys. There are probably 10 more TSQL or debugging reasons to use INT primary keys, not the least of which is writing a simple query to select the last 5 records entered into the table.

SQL: To primary key or not to primary key?

I have a table with sets of settings for users, it has the following columns:
UserID INT
Set VARCHAR(50)
Key VARCHAR(50)
Value NVARCHAR(MAX)
TimeStamp DATETIME
UserID together with Set and Key are unique. So a specific user cannot have two of the same keys in a particular set of settings. The settings are retrieved by set, so if a user requests a certain key from a certain set, the whole set is downloaded, so that the next time a key from the same set is needed, it doesn't have to go to the database.
Should I create a primary key on all three columns (userid, set, and key) or should I create an extra field that has a primary key (for example an autoincrement integer called SettingID, bad idea i guess), or not create a primary key, and just create a unique index?
----- UPDATE -----
Just to clear things up: This is an end of the line table, it is not joined in anyway. UserID is a FK to the Users table. Set is not a FK. It is pretty much a helper table for my GUI.
Just as an example: users get the first time they visit parts of the website, a help balloon, which they can close if they want. Once they click it away, I will add some setting to the "GettingStarted" set that will state they helpballoon X has been disabled. Next time when the user comes to the same page, the setting will state that help balloon X should not be shown anymore.
Having composite unique keys is mostly not a good idea.
Having any business relevant data as primary key can also make you troubles. For instance, if you need to change the value. If it is not possible in the application to change the value, it could be in the future, or it must be changed in an upgrade script.
It's best to create a surrogate key, a automatic number which does not have any business meaning.
Edit after your update:
In this case, you can think of having conceptually no primary key, and make this three columns either the primary key of a composite unique key (to make it changeable).
Should I create a primary key on all three columns (userid, set, and key)
Make this one.
Using surrogate primary key will result in an extra column which is not used for other purposes.
Creating a UNIQUE INDEX along with surrogate primary key is same as creating a non-clustered PRIMARY KEY, and will result in an extra KEY lookup which is worse for performance.
Creating a UNIQUE INDEX without a PRIMARY KEY will result in a HEAP-organized table which will need an extra RID lookup to access the values: also not very good.
How many Key's and Set's do you have? Do these need to be varchar(50) or can they point to a lookup table? If you can convert this Set and Key into SetId and KeyId then you can create your primary key on the 3 integer values which will be much faster.
I would probably try to make sure that UserID was a unique identifier, rather than having duplicates of UserID throughout the code. Composite keys tend to get confusing later on in your code's life.
I'm assuming this is a lookup field for config values of some kind, so you could probably go with the composite key if this is the case. The data is already there. You can guarantee it's uniqueness using the primary key. If you change your mind and decide later that it isn't appropriate for you, you can easily add a SettingId and make the original composite key a unique index.
Create one, separate primary key. No matter what how bussines logic will change, what new rules will have to be applied to your Key VARCHAR(50) field - having one primary key will make you completly independent of bussines logic.
In my experience it all depends how many tables will be using this table as FK information. Do you want 3 extra columns in your other tables just to carry over a FK?
Personally I would create another FK column and put a unique constraint over the other three columns. This makes foreign keys to this table a lot easier to swallow.
I'm not a proponent of composite keys, but in this case as an end of the line table, it might make sense. However, if you allow nulls in any of these three fields becasue one or more of the values is not known at the time of the insert, there can be difficulty and a unique index might be better.
Better have UserID as 32 bit newid() or unique identifier because UserID as int gives a hint to the User of the probable UserID. This will also solve your issue of composite key.