id columns or clustered primary keys/database consistency - sql

If I had a table with the columns:
Artist
Album
Song
NumberOfListens
...is it better to put a clustered primary key on Artist, Album, and Song or to have an autoincrementing id column and put a unique constraint on Artist, Album, and Song.
How important is database consistency? If half of my tables have clustered primary keys and the other half an id column with unique constraints, is that bad or does it not matter? Both ways seem the same to me but I do not know what the industry standard is or which is better and why.

I would never put a primary key on columns of long text like: Artist, Album, and Song. Use an auto increment ID that is the clustered PK. If you want the Artist, Album, and Song to be unique, ad an Unique Index on the three. If you want to search by Album or Song, independent of independent Artist, you'll need an index for each, which pulls in the PK, so having a small PK saves you on each other index. The savings are not just disk space but in memory cache, and more keys on a page.

You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario. You reference primary keys in your foreign key constraints, so those are crucial for the integrity of your database. Use them - always - period.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, unique, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can easily pick a column that is not your primary key to be your clustering key.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a VARCHAR(20) or so as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc

Clustered indexes are great for range based queries. For example, a log date or order date. Putting one on Artist, Album, and Song will [probably] cause fragmentation when you insert new rows.
If your DB supports it, add a non-clustered primary key on Artist, Album, and Song and call it good. Or just add a unique key on Artist, Album, and Song.
Having an autoincrementing primary key would only really be useful if you had to had referential integrity to another table.

Without knowing the exact requirements, in general you would probably have an artist table, and possibly album table too. A song table would then be a unique combination of artist id, album id and then song. I'd enforce the uniqueness by an index or constraint depending on application, and use an id for a primary key.

First of all, there's already a problem here because the data is not normalized. Creating any sort of index on a bunch of text columns is something that should be avoided whenever possible. Even if these columns aren't text (and I suspect that they are), it still doesn't make sense to have artist, album and song in the same table. A much better design for this would be:
Artists (
ArtistID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
ArtistName varchar(100) NOT NULL)
Albums (
AlbumID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
ArtistID int NOT NULL,
AlbumName varchar(100) NOT NULL,
CONSTRAINT FK_Albums_Artists FOREIGN KEY (ArtistID)
REFERENCES Artists (ArtistID))
Songs (
SongID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
AlbumID int NOT NULL,
SongName varchar(100) NOT NULL,
NumberOfListens int NOT NULL DEFAULT 0
CONSTRAINT FK_Songs_Albums FOREIGN KEY (AlbumID)
REFERENCES Albums (AlbumID))
Once you have this design, you have the ability to search for individual albums and artists as well as songs. You can also add covering indexes to speed up queries, and the indexes will be much smaller and therefore faster than the original design.
If you don't need to do range queries (which you probably don't), then you could replace the IDENTITY key with a ROWGUID if that suits your design better; it doesn't really matter much in this case, I would stick with the simple IDENTITY.
You have to be careful with clustering keys. If you cluster on a key that is completely not even remotely sequential (and an artist, album, and song name definitely qualify as non-sequential), then you end up with page splits and other nastiness. You don't want this. And as Marc says, a copy of this key gets added to every index, and you definitely don't want this when your key is 300 or 600 bytes long.
If you want to be able to quickly query for the number of listens for a specific song by the artist, album, and song name, it's actually quite simple with the above design, you just need to index properly:
CREATE UNIQUE INDEX IX_Artists_Name ON Artists (ArtistName)
CREATE UNIQUE INDEX IX_Albums_Artist_Name ON Albums (ArtistID, AlbumName)
CREATE UNIQUE INDEX IX_Songs_Album_Name ON Songs (AlbumID, SongName)
INCLUDE (NumberOfListens)
Now this query will be fast:
SELECT ArtistName, AlbumName, SongName, NumberOfListens
FROM Artists ar
INNER JOIN Albums al
ON al.ArtistID = ar.ArtistID
INNER JOIN Songs s
ON s.AlbumID = al.AlbumID
WHERE ar.ArtistName = #ArtistName
AND al.AlbumName = #AlbumName
AND s.SongName = #SongName
If you check out the execution plan you'll see 3 index seeks - it's as fast as you can get it. We've guaranteed the exact same uniqueness as in the original design and optimized for speed. More importantly, it's normalized, so both an artist and an album have their own specific identity, which makes this a great deal easier to manage over the long term. It's much easier to search for "all albums by artist X." It's much much easier and faster to search for "all songs on album Y."
When designing a database, normalization should be your first concern, indexing should be your second. And you're likely to find that once you have a normalized design, the best indexing strategy becomes kind of obvious.

Related

Designing the primary key in associative table

Suppose I have an artist table like:
id
name
1
John Coltrane
2
Springsteen
and a song table like:
id
title
1
Singing in the rain
2
Mimosa
Now an artist can write more than one song, and a song can be written by more than one artist. We have a many-to-many relation. We need an associative table!
How to design the primary key of the associative table?
One way would be to define a composite key of the two foreign keys, like this:
CREATE TABLE artist_song_map(
artist_id INTEGER,
song_id INTEGER,
PRIMARY KEY(artist_id, song_id),
FOREIGN KEY(artist_id) REFERENCES artist(id),
FOREIGN KEY(song_id) REFERENCES song(id)
)
Another way would be to have a synthetic primary key, and impose an unique constraint on the tuple of the two foreign keys:
CREATE TABLE artist_song_map(
id INTEGER PRIMARY KEY AUTOINCREMENT,
artist_id INTEGER,
song_id INTEGER,
UNIQUE(artist_id, song_id),
FOREIGN KEY(artist_id) REFERENCES artist(id),
FOREIGN KEY(song_id) REFERENCES song(id)
)
Which design choice is better?
Unless you define the table as WITHOUT ROWID both queries will create the same table.
The column id in your 2nd way adds nothing but an alias for the column rowid that will be created in any of the 2 ways.
Since this is a bridge table, you only need to define the combination of the columns artist_id and song_id as UNIQUE.
If you want to extend your design with other tables, like a playlist table, you will have to decide how it will be linked to the existing tables:
If there is no id column in artist_song_map then you will link
playlist to song and artist, just like you did with
artist_song_map.
If there is an id column in artist_song_map then you can link playlist directly to that id.
I suggest that you base your decision not only on these 3 tables (song, artist and artist_song_map), but also on the tables that you plan to add.
Logically the both design is the same. But from administration aspect the identity design is more efficient. Less disk fragmentation and future redesign or maintenance will be easier.
Bridge tables normally don't require a ID(auto_inCREMNT) to identify the rows.
The linking columns(foreign key) are the main point, as thea link artists to a 8or songs)
only when you need special attributes to that bridge or you want to reference a row of that bridge table and don't want to have ttwo linking columns, then you would use such an ID field, but as i said normally you never need it
While, generally, the differences are minor, the composite/compound foreign key design sounds more natural. A separate primary key together with the associated index take additional space in the database. Further, if you use a composite primary key, you can declare the table as WITHOUT ROWID. According to the official docs, "in some cases, a WITHOUT ROWID table can use about half the amount of disk space and can operate nearly twice as fast".

Is it good to have 4 Columns as Primary key?

I have in my database tables Students with PK Student_ID, Course with PK Course_ID.
And two tables to save a feedback result for each course, table Questions I saved in the questions for the feedback with PK question_ID, and table feedback.
I'm wondering if I can use the 3 foreign keys in the feedback table (course_ID, student_Id, question_ID) with PK feedback_ID
I think it's useful to have result for each question or student or course but I don't know if using 4 Columns as Primary key possible and good or not.
Because most people assume the Primary Key is the Clustering Key, I am going to interpret your question as "Is it good or bad to have 4 columns as a Clustering Key".
The situation you are considering is related to debates like The Clustered Index Debate and Surrogate Key vs. Natural Key.
In this situation, I would want to consider the impact of a 12 byte wide composite clustering key vs a 4 byte wide clustering key (double those if you are going to use bigint). My decision tree would look something like this:
Will we be using Hekaton (In-Memory OLTP)?
Yes => composite key. Run away.
No => Good call, continue...
How many rows will this table have?
Tens of millions, maybe more! => surrogate key (probably).
If the data length of each row will be variable and not narrow => surrogate key.
If the data length of the row will be fixed and narrow, and it results in optimal page usage => continue...
Less than that => continue...
How will the feedback table be queried?
Various combinations of and not always all of course_id, student_Id, question_id => surrogate key.
In this case you may want to be able to have multiple supporting indexes the combinations of course_id, student_Id, question_id for your queries.
The clustering key is included in all non-clustered indexes, and the larger it is the more space/pages each index entry will require. => surrogate key.
Almost always by all three course_id, student_Id, question_id or almost always by course_id or course_id, student_Id;
but not by student_id without course_id and not question_id without course_id, student_id (zero or only a couple of non-clustered indexes on this table) => continue...
Will any other table reference this table?
Yes: e.g. Course Instructors will be able to leave a comment or remark regarding the response to a feedback question. => surrogate key.
Kind of... e.g. Audit/History table will be tracking inserts/updates/deletes to rows in this table.
a surrogate key may make change tracking less complex to review, and the audit table's clustering key would most likely be the surrogate key & datetime or surrogate key vs composite key & datetime or surrogate key.
No => composite key is a reasonable option
Even if my first run through of the above decision tree leads me to a composite key, I would probably start my design using a surrogate key because it is easier to get rid of it (because it isn't being used) than to go back and add it and implement its use.
Just to clarify, I have had cases where I did find that the composite key was a better solution and did refactor the design to drop the surrogate key. I don't want to leave the impression that the surrogate key is always the better solution, even if it is a common default for many designers (including myself).
I would start out with something like this:
create table feedback (
feedback_id int not null identity(1,1)
, course_id int not null
, student_id int not null
, question_id int not null
, response_added datetime not null
constraint df_feedback_response_added_gd default getdate()
, response nvarchar(max) null
, constraint pk_feedback primary key clustered (feedback_id)
, constraint fk_feedback_course foreign key (course_id)
references course(course_id)
, constraint fk_feedback_students foreign key (student_id)
references student(student_id)
, constraint fk_feedback_question foreign key (question_id)
references question(question_id)
, constraint uq_feedback_course_student_question
unique (course_id, student_Id, question_id)
/* or create a unique index to use include() instead */
);
/* unique index that includes response */
create unique nonclustered index ux_feedback_course_student_question_covering
on feedback (course_id, student_Id, question_id)
include (response_added, response);
Reference:
A Simple Start – Table Creation Best Practices - Kimberly Tripp - concerning row size and page utilization
Ever-increasing clustering key – the Clustered Index Debate……….again! - Kimberly Tripp
The Clustered Index Debate Continues… - Kimberly Tripp
More considerations for the clustering key – the clustered index debate continues! - Kimberly Tripp
How much does that key cost? (plus sp_helpindex9) - Kimberly Tripp
101 Things I Wish You Knew About Sql Server - Thomas LaRock
SQL Server: Natural Key Verses Surrogate Key - Database Journal - Gregory A. Larsen
Ten Common Database Design Mistakes - Simple Talk - Louis Davidson
Hekaton (In-Memory OLTP) - dbareactions.com
If Feedback_Id uniquely identifies the record, that having it as a Primary Key should work just fine. Including multiple columns in the PK can create trouble in the future.
Let's say you want to persist other feedback details (like comments). You want to define a table called FeedbackComment that should have Feedback as a parent. FK can only go to one or more columns that have an UNIQUE constraint defined on them. Generally, the PK is the target of a PK.
Of course, you can have all columns of the PK defined (feedback_id, course_id etc.) in the child table, but this will make joins more complicated.
Also, if you are using some kind of ORM (i.e. Entity Framework) in the service layer of the application, having a single integer primary key might be useful (e.g. have generic methods that retrieve entity based on an integer identifier).
As Gordon mentioned, there is nothing wrong with composite primary keys, but think about what you do with the table and not to complicate your life when application extensions must be made.

Choosing indexes and primary keys for performance

I am new to database design and I am having a lot of trouble on designing a PostgreSQL database for a combat game.
On this game, players will fight between them, gaining resources to buy better weapons and armors. Combats will be recorded for future review and the number of combats is expected to grow rapidly, as, for example, 1k players fighting 1k rounds will produce 500k records.
Game interactivity is reduced to spend points to upgrade the fighter equipment and habilities. Combats are resolved by the machine.
Details:
A specific type of weapon or armor can only be possesed once by each fighter.
Fighters will almost exclusively searched by id.
I will often need to search what pieces of equipment (weapons and/or armor) are possesed by a specific fighter, but I do not expect to search which fighters posseses a specific type of weapon.
Combats will be often searched by winner or loser.
Two given fighters can fight multiple times on different dates, so the tuple winner-loser is not unique on table combats
fighters table contains a lot of columns that will be often retrieved all at the same time (I create two objects of class "Fighter" with all the related information anytime a combat begins)
This is my current design:
CREATE TABLE IF NOT EXISTS weapons (
id serial PRIMARY KEY,
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS armors (
id serial PRIMARY KEY,
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS fighters (
id serial PRIMARY KEY,
preferred_weapon INT references weapons(id),
preferred_armor INT references armors(id),
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS combats (
id serial PRIMARY KEY,
winner INT references fighters(id),
loser INT references fighters(id),
*** Game stuff ***
);
CREATE TABLE IF NOT EXISTS fighters_weapons (
fighter INT NOT NULL references fighters(id),
weapon INT NOT NULL references weapons(id),
PRIMARY KEY(fighter, weapon)
);
CREATE TABLE IF NOT EXISTS fighters_armors (
fighter INT NOT NULL references fighters(id),
armor INT NOT NULL references armors(id),
PRIMARY KEY(fighter, armor)
);
My questions are:
Do you think my design is well suited?
I have seen a lot of example databases containing an id column as primary key on every table. Is there any reason for that? Should I do that instead of the multiple column primary keys I am using on fighters_weapons and fighters_armors?
PostgreSQL creates indexes automatically for each primary key, but there are several tables which I do not expect to search by it (i. e. combats). Should I remove the index for performance? PostgreSQL complains about an existing constraint.
As I will search fighters_weapons and fighters_armors by fighter, as well as combats by winner and loser, do you think I should create indexes for all of this columns on these tables?
Any performance improvement advice? The most used operations will be: insert and query fighters, query equipment for a given fighter and insert combats.
Thanks a lot :)
To address your explicit questions:
2) It can be preferable to use a "natural" value as a primary key, i.e. not a serial id, if one exists. In cases where you are unlikely to use a serial id as an identifier, I would say it's slight better not to add it.
3) Unless you intend to insert many rows very quickly into the combats table, it probably won't hurt you too much to have the index on the id column.
4) Creating an index on {fighter} is not necessary if the index {fighter, weapon} exists, and similarly creating an index on {fighter} is not necessary if the index {fighter, armor} exists. In general, you don't benefit from creating an index that is the prefix of another multi-column index. Separately, creating {winner} and {loser} indexes on combats seems like a good idea given the access pattern you've described.
5) Beyond table design, there are a few database tuning parameters that you might want to set if you've installed the database yourself. If an experienced database administrator has set up the database, he/she has probably already done this for you.

mysql: difference between primary key and unique index? [duplicate]

At work we have a big database with unique indexes instead of primary keys and all works fine.
I'm designing new database for a new project and I have a dilemma:
In DB theory, primary key is fundamental element, that's OK, but in REAL projects what are advantages and disadvantages of both?
What do you use in projects?
EDIT: ...and what about primary keys and replication on MS SQL server?
What is a unique index?
A unique index on a column is an index on that column that also enforces the constraint that you cannot have two equal values in that column in two different rows. Example:
CREATE TABLE table1 (foo int, bar int);
CREATE UNIQUE INDEX ux_table1_foo ON table1(foo); -- Create unique index on foo.
INSERT INTO table1 (foo, bar) VALUES (1, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (2, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (3, 1); -- OK
INSERT INTO table1 (foo, bar) VALUES (1, 4); -- Fails!
Duplicate entry '1' for key 'ux_table1_foo'
The last insert fails because it violates the unique index on column foo when it tries to insert the value 1 into this column for a second time.
In MySQL a unique constraint allows multiple NULLs.
It is possible to make a unique index on mutiple columns.
Primary key versus unique index
Things that are the same:
A primary key implies a unique index.
Things that are different:
A primary key also implies NOT NULL, but a unique index can be nullable.
There can be only one primary key, but there can be multiple unique indexes.
If there is no clustered index defined then the primary key will be the clustered index.
You can see it like this:
A Primary Key IS Unique
A Unique value doesn't have to be the Representaion of the Element
Meaning?; Well a primary key is used to identify the element, if you have a "Person" you would like to have a Personal Identification Number ( SSN or such ) which is Primary to your Person.
On the other hand, the person might have an e-mail which is unique, but doensn't identify the person.
I always have Primary Keys, even in relationship tables ( the mid-table / connection table ) I might have them. Why? Well I like to follow a standard when coding, if the "Person" has an identifier, the Car has an identifier, well, then the Person -> Car should have an identifier as well!
Foreign keys work with unique constraints as well as primary keys. From Books Online:
A FOREIGN KEY constraint does not have
to be linked only to a PRIMARY KEY
constraint in another table; it can
also be defined to reference the
columns of a UNIQUE constraint in
another table
For transactional replication, you need the primary key. From Books Online:
Tables published for transactional
replication must have a primary key.
If a table is in a transactional
replication publication, you cannot
disable any indexes that are
associated with primary key columns.
These indexes are required by
replication. To disable an index, you
must first drop the table from the
publication.
Both answers are for SQL Server 2005.
The choice of when to use a surrogate primary key as opposed to a natural key is tricky. Answers such as, always or never, are rarely useful. I find that it depends on the situation.
As an example, I have the following tables:
CREATE TABLE toll_booths (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
...
UNIQUE(name)
)
CREATE TABLE cars (
vin VARCHAR(17) NOT NULL PRIMARY KEY,
license_plate VARCHAR(10) NOT NULL,
...
UNIQUE(license_plate)
)
CREATE TABLE drive_through (
id INTEGER NOT NULL PRIMARY KEY,
toll_booth_id INTEGER NOT NULL REFERENCES toll_booths(id),
vin VARCHAR(17) NOT NULL REFERENCES cars(vin),
at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
amount NUMERIC(10,4) NOT NULL,
...
UNIQUE(toll_booth_id, vin)
)
We have two entity tables (toll_booths and cars) and a transaction table (drive_through). The toll_booth table uses a surrogate key because it has no natural attribute that is not guaranteed to change (the name can easily be changed). The cars table uses a natural primary key because it has a non-changing unique identifier (vin). The drive_through transaction table uses a surrogate key for easy identification, but also has a unique constraint on the attributes that are guaranteed to be unique at the time the record is inserted.
http://database-programmer.blogspot.com has some great articles on this particular subject.
There are no disadvantages of primary keys.
To add just some information to #MrWiggles and #Peter Parker answers, when table doesn't have primary key for example you won't be able to edit data in some applications (they will end up saying sth like cannot edit / delete data without primary key). Postgresql allows multiple NULL values to be in UNIQUE column, PRIMARY KEY doesn't allow NULLs. Also some ORM that generate code may have some problems with tables without primary keys.
UPDATE:
As far as I know it is not possible to replicate tables without primary keys in MSSQL, at least without problems (details).
If something is a primary key, depending on your DB engine, the entire table gets sorted by the primary key. This means that lookups are much faster on the primary key because it doesn't have to do any dereferencing as it has to do with any other kind of index. Besides that, it's just theory.
In addition to what the other answers have said, some databases and systems may require a primary to be present. One situation comes to mind; when using enterprise replication with Informix a PK must be present for a table to participate in replication.
As long as you do not allow NULL for a value, they should be handled the same, but the value NULL is handled differently on databases(AFAIK MS-SQL do not allow more than one(1) NULL value, mySQL and Oracle allow this, if a column is UNIQUE)
So you must define this column NOT NULL UNIQUE INDEX
There is no such thing as a primary key in relational data theory, so your question has to be answered on the practical level.
Unique indexes are not part of the SQL standard. The particular implementation of a DBMS will determine what are the consequences of declaring a unique index.
In Oracle, declaring a primary key will result in a unique index being created on your behalf, so the question is almost moot. I can't tell you about other DBMS products.
I favor declaring a primary key. This has the effect of forbidding NULLs in the key column(s) as well as forbidding duplicates. I also favor declaring REFERENCES constraints to enforce entity integrity. In many cases, declaring an index on the coulmn(s) of a foreign key will speed up joins. This kind of index should in general not be unique.
There are some disadvantages of CLUSTERED INDEXES vs UNIQUE INDEXES.
As already stated, a CLUSTERED INDEX physically orders the data in the table.
This mean that when you have a lot if inserts or deletes on a table containing a clustered index, everytime (well, almost, depending on your fill factor) you change the data, the physical table needs to be updated to stay sorted.
In relative small tables, this is fine, but when getting to tables that have GB's worth of data, and insertrs/deletes affect the sorting, you will run into problems.
I almost never create a table without a numeric primary key. If there is also a natural key that should be unique, I also put a unique index on it. Joins are faster on integers than multicolumn natural keys, data only needs to change in one place (natural keys tend to need to be updated which is a bad thing when it is in primary key - foreign key relationships). If you are going to need replication use a GUID instead of an integer, but for the most part I prefer a key that is user readable especially if they need to see it to distinguish between John Smith and John Smith.
The few times I don't create a surrogate key are when I have a joining table that is involved in a many-to-many relationship. In this case I declare both fields as the primary key.
My understanding is that a primary key and a unique index with a not‑null constraint, are the same (*); and I suppose one choose one or the other depending on what the specification explicitly states or implies (a matter of what you want to express and explicitly enforce). If it requires uniqueness and not‑null, then make it a primary key. If it just happens all parts of a unique index are not‑null without any requirement for that, then just make it a unique index.
The sole remaining difference is, you may have multiple not‑null unique indexes, while you can't have multiple primary keys.
(*) Excepting a practical difference: a primary key can be the default unique key for some operations, like defining a foreign key. Ex. if one define a foreign key referencing a table and does not provide the column name, if the referenced table has a primary key, then the primary key will be the referenced column. Otherwise, the the referenced column will have to be named explicitly.
Others here have mentioned DB replication, but I don't know about it.
Unique Index can have one NULL value. It creates NON-CLUSTERED INDEX.
Primary Key cannot contain NULL value. It creates CLUSTERED INDEX.
In MSSQL, Primary keys should be monotonically increasing for best performance on the clustered index. Therefore an integer with identity insert is better than any natural key that might not be monotonically increasing.
If it were up to me...
You need to satisfy the requirements of the database and of your applications.
Adding an auto-incrementing integer or long id column to every table to serve as the primary key takes care of the database requirements.
You would then add at least one other unique index to the table for use by your application. This would be the index on employee_id, or account_id, or customer_id, etc. If possible, this index should not be a composite index.
I would favor indices on several fields individually over composite indices. The database will use the single field indices whenever the where clause includes those fields, but it will only use a composite when you provide the fields in exactly the correct order - meaning it can't use the second field in a composite index unless you provide both the first and second in your where clause.
I am all for using calculated or Function type indices - and would recommend using them over composite indices. It makes it very easy to use the function index by using the same function in your where clause.
This takes care of your application requirements.
It is highly likely that other non-primary indices are actually mappings of that indexes key value to a primary key value, not rowid()'s. This allows for physical sorting operations and deletes to occur without having to recreate these indices.

Primary key or Unique index?

At work we have a big database with unique indexes instead of primary keys and all works fine.
I'm designing new database for a new project and I have a dilemma:
In DB theory, primary key is fundamental element, that's OK, but in REAL projects what are advantages and disadvantages of both?
What do you use in projects?
EDIT: ...and what about primary keys and replication on MS SQL server?
What is a unique index?
A unique index on a column is an index on that column that also enforces the constraint that you cannot have two equal values in that column in two different rows. Example:
CREATE TABLE table1 (foo int, bar int);
CREATE UNIQUE INDEX ux_table1_foo ON table1(foo); -- Create unique index on foo.
INSERT INTO table1 (foo, bar) VALUES (1, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (2, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (3, 1); -- OK
INSERT INTO table1 (foo, bar) VALUES (1, 4); -- Fails!
Duplicate entry '1' for key 'ux_table1_foo'
The last insert fails because it violates the unique index on column foo when it tries to insert the value 1 into this column for a second time.
In MySQL a unique constraint allows multiple NULLs.
It is possible to make a unique index on mutiple columns.
Primary key versus unique index
Things that are the same:
A primary key implies a unique index.
Things that are different:
A primary key also implies NOT NULL, but a unique index can be nullable.
There can be only one primary key, but there can be multiple unique indexes.
If there is no clustered index defined then the primary key will be the clustered index.
You can see it like this:
A Primary Key IS Unique
A Unique value doesn't have to be the Representaion of the Element
Meaning?; Well a primary key is used to identify the element, if you have a "Person" you would like to have a Personal Identification Number ( SSN or such ) which is Primary to your Person.
On the other hand, the person might have an e-mail which is unique, but doensn't identify the person.
I always have Primary Keys, even in relationship tables ( the mid-table / connection table ) I might have them. Why? Well I like to follow a standard when coding, if the "Person" has an identifier, the Car has an identifier, well, then the Person -> Car should have an identifier as well!
Foreign keys work with unique constraints as well as primary keys. From Books Online:
A FOREIGN KEY constraint does not have
to be linked only to a PRIMARY KEY
constraint in another table; it can
also be defined to reference the
columns of a UNIQUE constraint in
another table
For transactional replication, you need the primary key. From Books Online:
Tables published for transactional
replication must have a primary key.
If a table is in a transactional
replication publication, you cannot
disable any indexes that are
associated with primary key columns.
These indexes are required by
replication. To disable an index, you
must first drop the table from the
publication.
Both answers are for SQL Server 2005.
The choice of when to use a surrogate primary key as opposed to a natural key is tricky. Answers such as, always or never, are rarely useful. I find that it depends on the situation.
As an example, I have the following tables:
CREATE TABLE toll_booths (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
...
UNIQUE(name)
)
CREATE TABLE cars (
vin VARCHAR(17) NOT NULL PRIMARY KEY,
license_plate VARCHAR(10) NOT NULL,
...
UNIQUE(license_plate)
)
CREATE TABLE drive_through (
id INTEGER NOT NULL PRIMARY KEY,
toll_booth_id INTEGER NOT NULL REFERENCES toll_booths(id),
vin VARCHAR(17) NOT NULL REFERENCES cars(vin),
at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
amount NUMERIC(10,4) NOT NULL,
...
UNIQUE(toll_booth_id, vin)
)
We have two entity tables (toll_booths and cars) and a transaction table (drive_through). The toll_booth table uses a surrogate key because it has no natural attribute that is not guaranteed to change (the name can easily be changed). The cars table uses a natural primary key because it has a non-changing unique identifier (vin). The drive_through transaction table uses a surrogate key for easy identification, but also has a unique constraint on the attributes that are guaranteed to be unique at the time the record is inserted.
http://database-programmer.blogspot.com has some great articles on this particular subject.
There are no disadvantages of primary keys.
To add just some information to #MrWiggles and #Peter Parker answers, when table doesn't have primary key for example you won't be able to edit data in some applications (they will end up saying sth like cannot edit / delete data without primary key). Postgresql allows multiple NULL values to be in UNIQUE column, PRIMARY KEY doesn't allow NULLs. Also some ORM that generate code may have some problems with tables without primary keys.
UPDATE:
As far as I know it is not possible to replicate tables without primary keys in MSSQL, at least without problems (details).
If something is a primary key, depending on your DB engine, the entire table gets sorted by the primary key. This means that lookups are much faster on the primary key because it doesn't have to do any dereferencing as it has to do with any other kind of index. Besides that, it's just theory.
In addition to what the other answers have said, some databases and systems may require a primary to be present. One situation comes to mind; when using enterprise replication with Informix a PK must be present for a table to participate in replication.
As long as you do not allow NULL for a value, they should be handled the same, but the value NULL is handled differently on databases(AFAIK MS-SQL do not allow more than one(1) NULL value, mySQL and Oracle allow this, if a column is UNIQUE)
So you must define this column NOT NULL UNIQUE INDEX
There is no such thing as a primary key in relational data theory, so your question has to be answered on the practical level.
Unique indexes are not part of the SQL standard. The particular implementation of a DBMS will determine what are the consequences of declaring a unique index.
In Oracle, declaring a primary key will result in a unique index being created on your behalf, so the question is almost moot. I can't tell you about other DBMS products.
I favor declaring a primary key. This has the effect of forbidding NULLs in the key column(s) as well as forbidding duplicates. I also favor declaring REFERENCES constraints to enforce entity integrity. In many cases, declaring an index on the coulmn(s) of a foreign key will speed up joins. This kind of index should in general not be unique.
There are some disadvantages of CLUSTERED INDEXES vs UNIQUE INDEXES.
As already stated, a CLUSTERED INDEX physically orders the data in the table.
This mean that when you have a lot if inserts or deletes on a table containing a clustered index, everytime (well, almost, depending on your fill factor) you change the data, the physical table needs to be updated to stay sorted.
In relative small tables, this is fine, but when getting to tables that have GB's worth of data, and insertrs/deletes affect the sorting, you will run into problems.
I almost never create a table without a numeric primary key. If there is also a natural key that should be unique, I also put a unique index on it. Joins are faster on integers than multicolumn natural keys, data only needs to change in one place (natural keys tend to need to be updated which is a bad thing when it is in primary key - foreign key relationships). If you are going to need replication use a GUID instead of an integer, but for the most part I prefer a key that is user readable especially if they need to see it to distinguish between John Smith and John Smith.
The few times I don't create a surrogate key are when I have a joining table that is involved in a many-to-many relationship. In this case I declare both fields as the primary key.
My understanding is that a primary key and a unique index with a not‑null constraint, are the same (*); and I suppose one choose one or the other depending on what the specification explicitly states or implies (a matter of what you want to express and explicitly enforce). If it requires uniqueness and not‑null, then make it a primary key. If it just happens all parts of a unique index are not‑null without any requirement for that, then just make it a unique index.
The sole remaining difference is, you may have multiple not‑null unique indexes, while you can't have multiple primary keys.
(*) Excepting a practical difference: a primary key can be the default unique key for some operations, like defining a foreign key. Ex. if one define a foreign key referencing a table and does not provide the column name, if the referenced table has a primary key, then the primary key will be the referenced column. Otherwise, the the referenced column will have to be named explicitly.
Others here have mentioned DB replication, but I don't know about it.
Unique Index can have one NULL value. It creates NON-CLUSTERED INDEX.
Primary Key cannot contain NULL value. It creates CLUSTERED INDEX.
In MSSQL, Primary keys should be monotonically increasing for best performance on the clustered index. Therefore an integer with identity insert is better than any natural key that might not be monotonically increasing.
If it were up to me...
You need to satisfy the requirements of the database and of your applications.
Adding an auto-incrementing integer or long id column to every table to serve as the primary key takes care of the database requirements.
You would then add at least one other unique index to the table for use by your application. This would be the index on employee_id, or account_id, or customer_id, etc. If possible, this index should not be a composite index.
I would favor indices on several fields individually over composite indices. The database will use the single field indices whenever the where clause includes those fields, but it will only use a composite when you provide the fields in exactly the correct order - meaning it can't use the second field in a composite index unless you provide both the first and second in your where clause.
I am all for using calculated or Function type indices - and would recommend using them over composite indices. It makes it very easy to use the function index by using the same function in your where clause.
This takes care of your application requirements.
It is highly likely that other non-primary indices are actually mappings of that indexes key value to a primary key value, not rowid()'s. This allows for physical sorting operations and deletes to occur without having to recreate these indices.