Should I use an ENUM for primary and foreign keys? - sql

An associate has created a schema that uses an ENUM() column for the primary key on a lookup table. The table turns a product code "FB" into it's name "Foo Bar".
This primary key is then used as a foreign key elsewhere. And at the moment, the FK is also an ENUM().
I think this is not a good idea. This means that to join these two tables, we end up with four lookups. The two tables, plus the two ENUM(). Am I correct?
I'd prefer to have the FKs be CHAR(2) to reduce the lookups. I'd also prefer that the PKs were also CHAR(2) to reduce it completely.
The benefit of the ENUM()s is to get constraints on the values. I wish there was something like: CHAR(2) ALLOW('FB', 'AB', 'CD') that we could use for both the PK and FK columns.
What is: Best PracticeYour preference
This concept is used elsewhere too. What if the ENUM()'s values are longer? ENUM('Ding, dong, dell', 'Baa baa black sheep'). Now the ENUM() is useful from a space point-of-view. Should I only care about this if there are several million rows using the values? In which case, the ENUM() saves storage space.

ENUM should be used to define a possible range of values for a given field. This also implies that you may have multiple rows which have the same value for this perticular field.
I would not recommend using an ENUM for a primary key type of foreign key type.
Using an ENUM for a primary key means that adding a new key would involve modifying the table since the ENUM has to be modified before you can insert a new key.
I am guessing that your associate is trying to limit who can insert a new row and that number of rows is limited. I think that this should be achieved through proper permission settings either at the database level or at the application and not through using an ENUM for the primary key.
IMHO, using an ENUM for the primary key type violates the KISS principle.

but when you only trapped with differently 10 or less rows that wont be a problem
e.g's
CREATE TABLE `grade`(
`grade` ENUM('A','B','C','D','E','F') PRIMARY KEY,
`description` VARCHAR(50) NOT NULL
)
This table it is more than diffecult to get a DML

We've had more discussion about it and here's what we've come up with:
Use CHAR(2) everywhere. For both the PK and FK. Then use mysql's foreign key constraints to disallow creating an FK to a row that doesn't exist in the lookup table.
That way, given the lookup table is L, and two referring tables X and Y, we can join X to Y without any looking up of ENUM()s or table L and can know with certainty that there's a row in L if (when) we need it.
I'm still interested in comments and other thoughts.

Having a lookup table and a enum means you are changing values in two places all the time. Funny... We spent to many years using enums causing issues where we need to recompile to add values. In recent years, we have moved away from enums in many situations an using the values in our lookup tables. The biggest value I like about lookup tables is that you add or change values without needing to compile. Even with millions of rows I would stick to the lookup tables and just be intelligent in your database design

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

is it necessary to have foreign key for simple tables

have a table called RoundTable
It has the following columns
RoundName
RoundDescription
RoundType
RoundLogo
Now the RoundType will be having values like "Team", "Individual", "Quiz"
is it necessary to have a one more table called "RoundTypes" with columns
TypeID
RoundType
and remove the RoundType from the rounds table and have a column "TypeID" which has a foreign key to this RoundType table?
Some say that if you have the RoundType in same table it is like hard-coding as there will be lot of round types in future.
is it like if there are going to be only 2-3 round types, i need not have foreign key??
Is it necessary? Obviously not. SQL works fine either way. In a properly defined database, you would do one of two things for RoundType:
Have a lookup table
Have a constraint that checks that values are within an agreed upon set (and I would put enums into this category)
If you have a lookup table, I would advocate having an auto-incremented id (called RoundTypeId) for it. Remember, that in a larger database, such a table would often have more than two columns:
CreatedAt -- when it was created
CreatedBy -- who created it
CreatedOn -- where it was created (important for distributed systems)
Long name
In a more advanced system, you might also need to internationalize the system -- that is, make it work for multiple languages. Then you would be looking up the actual string value in other tables.
is it like if there are going to be only 2-3 round types, i need not
have foreign key??
Usually it's just the opposite: If you have a different value for most of the records (like in a "lastName" column) you won't use a lookup table.
If, however, you know that you will have a limited set of allowed/possible values, a lookup table referenced via a foreign key is probably the better solution.
Maybe read up on "database normalization", starting perhaps # Wikipedia.
Actually you need to have separate table if you have following association between entities,
One to many
Many to many
because of virtue of these association simple DBMS becomes **R**DBMS ( Relation .)
Now ask simple question,
Whether my single record in round table have multiple roundTypes?
If so.. Make a new table and have foreign key in ROUNDTable.
Otherwise no.
yeah I think you should normalize it. Because if you will not do so then definitely you have to enter the round types (value) again and again for each record which is not good practice at all in case if you have large data. so i will suggest you to make another table
however later on you can make a view to get the desired result as fallow
create view vw_anyname
as
select RoundName, RoundDescription , RoundLogo, RoundType from roundtable join tblroundtype
on roundtable.TypeID = tblroundtype .typeid
select * from vw_anyname

SQL: what exactly do Primary Keys and Indexes do?

I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:
Primary Key
Index
I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?
I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?
edit: should one index colums one is likely to want to ORDER BY?
Thanks a lot,
Mala
Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.
For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.
Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.
However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.
Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.
The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.
Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.
Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.
Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.
The difference between a primary key and a unique key is best explained through an example.
We have a table of users:
USER_ID number
NAME varchar(30)
EMAIL varchar(50)
In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.
Why then do we need a separate primary key? Three reasons:
the numeric key is more efficient
when used in foreign key
relationships as it takes less space
the email can change (for example
swapping provider) but the user is
still the same; rippling a change of
a primary key value throughout a schema
is always a nightmare
it is always a bad idea to use
sensitive or private information as
a foreign key
In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.
In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.
For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.
Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.
An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.
Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.

in general, should every table in a database have an identity field to use as a PK?

I'm running into an issue with a join: getting back too many records. I added a table to the set of joins and the number of rows expanded. Usually when this happens I add a select of all the ID fields that are involved in the join. That way it's pretty obvious where the expansion is happening and I can change the ON of the join to fix it. Except in this case, the table that I added doesn't have an ID field. This is a problem. But perhaps I'm wrong.
Should every table in a database have an IDENTITY field that's used as the PK? Are there any drawbacks to having an ID field in every table? What if you're reasonably sure this table will never be used in a PK/FK relationship?
When having an identity column is not a good idea?
Surrogate vs. natural/business keys
Wikipedia Surrogate Key article
There are two concepts that are close but should not be confused: IDENTITY and PRIMARY KEY
Every table (except for the rare conditions) should have a PRIMARY KEY, that is a value or a set of values that uniquely identify a row.
See here for discussion why.
IDENTITY is a property of a column in SQL Server which means that the column will be filled automatically with incrementing values.
Due to the nature of this property, the values of this column are inherently UNIQUE.
However, no UNIQUE constraint or UNIQUE index is automatically created on IDENTITY column, and after issuing SET IDENTITY_INSERT ON it's possible to insert duplicate values into an IDENTITY column, unless it had been explicity UNIQUE constrained.
The IDENTITY column should not necessarily be a PRIMARY KEY, but most often it's used to fill the surrogate PRIMARY KEYs
It may or may not be useful in any particular case.
Therefore, the answer to your question:
The question: should every table in a database have an IDENTITY field that's used as the PK?
is this:
No. There are cases when a database table should NOT have an IDENTITY field as a PRIMARY KEY.
Three cases come into my mind when it's not the best idea to have an IDENTITY as a PRIMARY KEY:
If your PRIMARY KEY is composite (like in many-to-many link tables)
If your PRIMARY KEY is natural (like, a state code)
If your PRIMARY KEY should be unique across databases (in this case you use GUID / UUID / NEWID)
All these cases imply the following condition:
You shouldn't have IDENTITY when you care for the values of your PRIMARY KEY and explicitly insert them into your table.
Update:
Many-to-many link tables should have the pair of id's to the table they link as the composite key.
It's a natural composite key which you already have to use (and make UNIQUE), so there is no point to generate a surrogate key for this.
I don't see why would you want to reference a many-to-many link table from any other table except the tables they link, but let's assume you have such a need.
In this case, you just reference the link table by the composite key.
This query:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (a_id, b_id, PRIMARY KEY (a_id, b_id))
CREATE TABLE business_rule (id, a_id, b_id, FOREIGN KEY (a_id, b_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a
ON a.id = br.a_id
is much more efficient than this one:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (id, a_id, b_id, PRIMARY KEY (id), UNIQUE KEY (a_id, b_id))
CREATE TABLE business_rule (id, ab_id, FOREIGN KEY (ab_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a_to_b ab
ON br.ab_id = ab.id
JOIN a
ON a.id = ab.a_id
, for obvious reasons.
Almost always yes. I generally default to including an identity field unless there's a compelling reason not to. I rarely encounter such reasons, and the cost of the identity field is minimal, so generally I include.
Only thing I can think of off the top of my head where I didn't was a highly specialized database that was being used more as a datastore than a relational database where the DBMS was being used for nearly every feature except significant relational modelling. (It was a high volume, high turnover data buffer thing.)
I'm a firm believer that natural keys are often far worse than artificial keys because you often have no control over whether they will change which can cause horrendous data integrity or performance problems.
However, there are some (very few) natural keys that make sense without being an identity field (two-letter state abbreviation comes to mind, it is extremely rare for these official type abbreviations to change.)
Any table which is a join table to model a many to many relationship probably also does not need an additional identity field. Making the two key fields together the primary key will work just fine.
Other than that I would, in general, add an identity field to most other tables unless given a compelling reason in that particular case not to. It is a bad practice to fail to create a primary key on a table or if you are using surrogate keys to fail to place a unique index on the other fields needed to guarantee uniqueness where possible (unless you really enjoy resolving duplicates).
Every table should have some set of field(s) that uniquely identify it. Whether or not there is a numeric identifier field separate from the data fields will depend on the domain you are attempting to model. Not all data easily falls into the 'single numeric id' paradigm, and as such it would be inappropriate to force it. Given that, a lot of data does easily fit in this paradigm and as such would call for such an identifier. There is no one answer to always do X in any programming environment, and this is another example.
If you have modelled, designed, normalised etc, then you will have no identity columns.
You will have identified natural and candidate keys for your tables.
You may decide on a surrogate key because of the physical architecture (eg narrow, numeric, strictly monotonically increasing), say, because using a nvarchar(100) column is not a good idea (still need unique constraint).
Or because of ideology: they appeal to OO developers I've found.
Ok, assume ID columns. As your db gets more complex, say several layers, how can you jon parent and grand-.child tables directly. You can't: you always need intermediate tables and well indexed PK-FL columns. With a composite key, it's all there for you...
Don't get me wrong: I use them. But I know why I use them...
Edit:
I'd be interested to collate "always ID"+"no stored procs" matches on one hand, with "use stored procs"+"IDs when they benefit" on the other...
No. Whenever you have a table with an artificial identity column, you also need to identify the natural primary key for the table and ensure that there is a unique constraint on that set of columns too so that you don't get two rows that are identical apart from the meaningless identity column by accident.
Adding an identity column is not cost free. There is an overhead in adding an unnecessary identity column to a table - typically 4 bytes per row of storage for the identity value, plus a whole extra index (which will probably weigh in at 8-12 bytes per row plus overhead). It also takes slightly to work out the most cost-effective query plan because there is an extra index per table. Granted, if the table is small and the machine is big, this overhead is not critical - but for the biggest systems, it matters.
Yes, for the vast majority of cases.
Edge cases or exceptions might be things like:
two-way join tables to model m:n relationships
temporary tables used for bulk-inserting huge amounts of data
But other than that, I think there is no good reason against having a primary key to uniquely identify each row in a table, and in my opinion, using an IDENTITY field is one of the best choices (I prefer surrogate keys over natural keys - they're more reliable, stable, never changing etc.).
Marc
I can't think of any drawback about having an ID field in each table. Providing your the type of your ID field provides enough space for your table to grow.
However, you don't necessarily need a single field to ensure the identity of your rows.
So no, a single ID field is not mandatory.
Primary and Foreign Keys can consist not only of one field, but of multiple fields. This is typical for tables implementing a N-N relationship.
You can perfectly have PRIMARY KEY (fa, fb) on your table:
CREATE TABLE t(fa INT , fb INT);
ALTER TABLE t ADD PRIMARY KEY(fa , fb);
Recognize the distinction between an Identity field and a key... Every table should have a key, to eliminate the data corruption of inadvertently entering multiple rows that represent the same 'entity'. If the only key a table has is a meaningless surrogate key, then this function is effectively missing.
otoh, No table 'needs' an identity, and certainly not every table benefits from one... Examples are: A table with a short and functional key, a table which does not have any other table referencing it through a foreign Key, or a table which is in a one to zero-or-one relationship with another table... none of these need an Identity
I'd say, if you can find a simple, natural key in your table (i.e. one column), use that as a key instead of an identity column.
I generally give every table some kind of unique identifier, whether it is natural or generated, because then I am guaranteed that every row is uniquely identified somehow.
Personally, I avoid IDENTITY (incrementing identity columns, like 1, 2, 3, 4) columns like the plague. They cause a lot of hassle, especially if you delete rows from that table. I use generated uniqueidentifiers instead if there is no natural key in the table.
Anyway, no idea if this is the accepted practice, just seems right to me. YMMV.

Decision between storing lookup table id's or pure data

I find this comes up a lot, and I'm not sure the best way to approach it.
The question I have is how to make the decision between using foreign keys to lookup tables, or using lookup table values directly in the tables requesting it, avoiding the lookup table relationship completely.
Points to keep in mind:
With the second method you would
need to do mass updates to all
records referencing the data if it
is changed in the lookup table.
This is focused more
towards tables that have a lot of
the column's referencing many lookup
tables.Therefore lots of foreign
keys means a lot of
joins every time you query the
table.
This data would be coming from drop
down lists which would be pulled
from the lookup tables. In order to match up data when reloading, the values need to be in the existing list (related to the first point).
Is there a best practice here, or any key points to consider?
You can use a lookup table with a VARCHAR primary key, and your main data table uses a FOREIGN KEY on its column, with cascading updates.
CREATE TABLE ColorLookup (
color VARCHAR(20) PRIMARY KEY
);
CREATE TABLE ItemsWithColors (
...other columns...,
color VARCHAR(20),
FOREIGN KEY (color) REFERENCES ColorLookup(color)
ON UPDATE CASCADE ON DELETE SET NULL
);
This solution has the following advantages:
You can query the color names in the main data table without requiring a join to the lookup table.
Nevertheless, color names are constrained to the set of colors in the lookup table.
You can get a list of unique colors names (even if none are currently in use in the main data) by querying the lookup table.
If you change a color in the lookup table, the change automatically cascades to all referencing rows in the main data table.
It's surprising to me that so many other people on this thread seem to have mistaken ideas of what "normalization" is. Using a surrogate keys (the ubiquitous "id") has nothing to do with normalization!
Re comment from #MacGruber:
Yes, the size is a factor. In InnoDB for example, every secondary index stores the primary key value of the row(s) where a given index value occurs. So the more secondary indexes you have, the greater the overhead for using a "bulky" data type for the primary key.
Also this affects foreign keys; the foreign key column must be the same data type as the primary key it references. You might have a small lookup table so you think the primary key size in a 50-row table doesn't matter. But that lookup table might be referenced by millions or billions of rows in other tables!
There's no right answer for all cases. Any answer can be correct for different cases. You just learn about the tradeoffs, and try to make an informed decision on a case by case basis.
In cases of simple atomic values, I tend to disagree with the common wisdom on this one, mainly on the complexity front. Consider a table containing hats. You can do the "denormalized" way:
CREATE TABLE Hat (
hat_id INT NOT NULL PRIMARY KEY,
brand VARCHAR(255) NOT NULL,
size INT NOT NULL,
color VARCHAR(30) NOT NULL /* color is a string, like "Red", "Blue" */
)
Or you can normalize it more by making a "color" table:
CREATE TABLE Color (
color_id INT NOT NULL PRIMARY KEY,
color_name VARCHAR(30) NOT NULL
)
CREATE TABLE Hat (
hat_id INT NOT NULL PRIMARY KEY,
brand VARCHAR(255) NOT NULL,
size INT NOT NULL,
color_id INT NOT NULL REFERENCES Color(color_id)
)
The end result of the latter is that you've added some complexity - instead of:
SELECT * FROM Hat
You now have to say:
SELECT * FROM Hat H INNER JOIN Color C ON H.color_id = C.color_id
Is that extra join a huge deal? No - in fact, that's the foundation of the relational design model - normalizing allows you to prevent possible inconsistencies in the data. But every situation like this adds a little bit of complexity, and unless there's a good reason, it's worth asking why you're doing it. I consider possible "good reasons" to include:
Are there other attributes that "hang off of" this attribute? Are you capturing, say, both "color name" and "hex value", such that hex value is always dependent on color name? If so, then you definitely want a separate color table, to prevent situations where one row has ("Red", "#FF0000") and another has ("Red", "#FF3333"). Multiple correlated attributes are the #1 signal that an entity should be normalized.
Will the set of possible values change frequently? Using a normalized lookup table will make future changes to the elements of the set easier, because you're just updating a single row. If it's infrequent, though, don't balk at statements that have to update lots of rows in the main table instead; databases are quite good at that. Do some speed tests if you're not sure.
Will the set of possible values be directly administered by the users? I.e. is there a screen where they can add / remove / reorder the elements in the list? If so, a separate table is a must, obviously.
Will the list of distinct values power some UI element? E.g. is "color" a droplist in the UI? Then you'll be better off having it in its own table, rather than doing a SELECT DISTINCT on the table every time you need to show the droplist.
If none of those apply, I'd be hard pressed to find another (good) reason to normalize. If you just want to make sure that the value is one of a certain (small) set of legal values, you're better off using a CONSTRAINT that says the value must be in a specific list; keeps things simple, and you can always "upgrade" to a separate table later if the need arises.
One thing no one has considered is that you would not join to the lookup table if the data in it can change over time and the records joined to are historical. The example is a parts table and an order table. The vendors may drop parts or change part numbers, but the orders table should alawys have exactly what was ordered at the time it was ordered. Therefore, it should lookup the data to do the record insert but should never join to the lookup table to get information about an existing order. Instead the part number and description and price, etc. should be stored in the orders table. This is espceially critical so that price changes do not propagate through historical data and make your financial records inaccurate. In this case, you would also want to avoid using any kind of cascading update as well.
rauhr.myopenid.com wrote:
The way we decided to solve this problem is with 4th normal form.
...
That is not 4th normal form. That is a common mistake called One True Lookup:
http://www.dbazine.com/ofinterest/oi-articles/celko22
4th normal form is :
http://en.wikipedia.org/wiki/Fourth_normal_form
Normalization is pretty universally regarded as part of best practices in databases, and normalization says yeah, you push the data out and refer to it by key.
Since no one else has addressed your second point: When queries become long and difficult to read and write due to all those joins, a view will usually resolve that.
You can even make it a rule to always program against the views, having the view get the lookups.
This makes it possible to optimize the view and make your code resistant to changes in the tables.
In oracle, you could even convert the view into a materialized view if you ever need to.