Should every MySQL table have an auto-incremented primary key? - sql

I understand the value of primary keys.
I understand the value of indexes.
Should every MySQL table have an auto-incremented primary key (ideally with INT field type)?
Update
#Raj More's answer seems most efficient. The issue when I think about it, however, is how this auto-incremented primary key ID will relate to other tables. For example:
table 1
ID | firstname | lastname | email
----------------------------------------
1 | john | doe | 1#email.com
2 | sarah | stow | 2#email.com
3 | mike | bro | 3#email.com
table 2
ID | memberid | display | address
--------------------------------------------
1 | 1 | funtime zone | 123 street
2 | 3 | silly place llc | 944 villa dr
In the example above, a consumer may come to the site and choose to register for a free product/service. If the consumer chooses, they are able to give additional information (stored in table 2) for additional mailing, etc. The problem I see is in how these tables relate to the 'primary key auto-incremented field'. In table 2, the 'memberid' relates to table 1's ID but this is not 'extremely' clear. Any new information placed into table 2 will increment by 1 whereas not all consumers will choose to participate in the table 2 required data.

I am not a huge fan of surrogate keys. I have yet to see a scenario where I would prefer to use one for every table of a database.
I would say No.
Read up on this answer: surrogate-vs-natural-business-keys
The above may be seen as sarcastic or flaming (despite the surprisingly many upvotes) so it's deleted.
In the general case, there have been many questions and answers on surrogate and natural keys so I felt this question is more like a duplicate. My view is that surrogate keys are fine and very useful, mainly because natural keys can lead to very big primary keys in the low end of a chain of connected tables - and this is not handled well by many RDBMS, clustered indexes get big, etc. But saying that "every MySQL table should have an auto-incremented primary key" is a very absolute statement and I think there are cases when they really offer little or nothing.
Since the OP updated the question, I'll try to comment on that specific topic.
I think this is exactly a case where an autoincrementing primary key is not only useless but adds negative value. Supposing that table1 and table2 are in 1:1 relationship, the memberid can be both the Primary Key and a Foreign Key to table1.
Adding an autoincrementing id column adds one index and if it's a clustered one (like InnoDB PK indexes) increases the size of the memberid index. Even more, if you have such an auto-incrementing id, some JOIN of table2 to other tables will have to be done using this id (the JOINs to tables in 1:n relation to table2) and some using memberid (the JOINs to tables in 1:n relation to table1). If you only have memberid both these types of JOINs can be
done using memberid.

I am a huge fan of surrogate keys. I have yet to see a scenario where I would prefer not use one.
I would say Yes.
Read up on this answer Surrogate vs. natural/business keys
Edit
I will change my answer to include the following:
There are certain scenarios that I now use the actual value as a surrogate key:
DimDate (20151031, 20151101, 20151102....)
DimZipCode (10001, 10002, 10003...)
Everything else gets Surrogate Keys.

Yes, with one exception:
A table which implements a n:m relationship between two other tables, a pure link table, does not need such a field if it has two fields only, referencing the two primary keys of the linked tables. The primary key of the link table then consists of the two fields.
As soon as the link table has extra information, it needs a simple single-field primary key.
Having said that, there may be more exceptions; database design is a very broad field...
EDIT: Added more in the previous sentence.

Related

Which column for foreign key: id or any other column and why?

TL;DR
Should a foreign key always refer to the id column of another table? Why or why not? Is there a standard rule for this?
Is there a cost associated with using any other unique column other than id column for foreign key? Performance / storage? How significant? Is it frowned in the industry?
Example: this is the schema for my sample problem:
In my schema sometimes I use id column as the foreign key and sometimes use some other data column.
In vehicle_detail table I use a unique size column as a foreign key from vehicle_size table and unique color column as the foreign key from vehicle_color table.
But in vehicle_user I used the user_identifier_id as a foreign key which refers to id primary key column in user_identifier table.
Which is the correct way?
On a side note, I don't have id columns for the garage_level, garage_spaceid, vehicle_garage_status and vehicle_parking_status tables because they only have one column which is the primary key and the data they store is just at most 15 rows in each table and it is probably never going to change. Should I still have an id column in those ?
A foreign key has to target a primary key or unique constraint. It is normal to reference the primary key, because you typically want to reference an individual row in another table, and the primary key is the identifier of a table row.
From a technical point of view, it does not matter whether a foreign key references the primary key or another unique constraint, because in PostgreSQL both are implemented in the same way, using a unique index.
As to your concrete examples, there is nothing wrong with having the unique size column of vehicle_size be the target of a foreign key, although it begs the question why you didn't make size the primary key and omit the id column altogether. There is no need for each table to have an id column that is the automatically generated numeric primary key, except that there may be ORMs and other software that expect that.
A foreign key is basically a column of a different table(it is always of a different table, since that is the role it serves). It is used to join/ get data from a different table. Think of it like say school is a database and there are many different table for different aspects of student.
say by using Admission number 1234, from accounts table you can get the fees and sports table you can get the sports he play.
Now there is no rule that foreign key should be id column, you can keep it whatever you want. But,to use foreign key you should have a matching column in both tables therefore usually id column is only used. As I stated in the above example the only common thing in say sports table and accounts table would be admission number.
admn_no | sports |
+---------+------------+
| 1234 | basketball
+---------+---------+
| admn_no | fees |
+---------+---------+
| 1234 | 1000000 |
+---------+---------+
Now say using the query\
select * from accounts join sports using (admn_no);
you will get:
+---------+---------+------------+
| admn_no | fees | sports |
+---------+---------+------------+
| 1234 | 1000000 | basketball |
+---------+---------+------------+
PS: sorry for bad formatting
A foreign key is a field or a column that is used to establish a link between two tables. A FOREIGN KEY is a column (or collection of columns) in one table, that refers to the PRIMARY KEY in another table.
There is no rule that it should refer to a id column but the column it refers to should be the primary key. In real scenarios, it usually refers to Id column as in most cases it is the primary key in the tables.
OP question is about "correct way".
I will try to provide some kind of summary from existing comments and answers, general DO and general DONT for FKs.
What was already said
A. "A foreign key has to target a primary key or unique constraint"
Literally from Laurenz Albe answer and it was noted in comments
B. "stick with whatever you think will change the least"
It was noted by Adrian Klavier in comments.
Notes
There is no such general rule that PK or unique constraint must be defined on a single column.
So the question title itself must be corrected: "Which column(s) for foreign key: id or any other column(s) and why?"
Let's talk about "why".
Why: General DO, general DONT and an advice
Is there a cost associated with using any other unique column other than id column for foreign key? Performance / storage? How significant? Is it frowned in the industry?
General DO: Analyze requirements, use logic, use math (arithmetics is enough usually). There is no a single database design that's always good for all cases. Always ask yourself: "Can it be improved?". Never be content with design of existing FKs, if requirements changed or DBMS changed or storage options changed - revise design.
General DONT: Don't think that there is a single correct rule for all cases. Don't think: "if that worked in that database/table than it will work for this case too".
Let me illustrate this points with a common example.
Example: PK on id uuid field
We look into database and see a table has a unique constraint on two fields of types integer (4 bytes) + date (4 bytes)
Additionally: this table has a field id of uuid type (16 bytes)
PK is defined on id
All FKs from other tables are targeting id field
It this a correct design or not?
Case A. Common case - not OK
Let's use math:
unique constraint on int+date: it's 4+4=8 bytes
data is never changed
so it's a good candidate for primary key in this table
and nothing prevents to use it for foreign keys in related tables
So it looks like additional 16 bytes per each row + indexes costs is a mistake.
And that's a very common mistake especially in combination of MSSQL + CLUSTERED indexes on random uuids
Is it always a mistake?
No.
Consider latter cases.
Case B. Distributed system - OK
Suppose that you have a distributed system:
ServerA, ServerB, ServerC are sources of data
HeadServer - is data aggregator
data on serverA-ServerC could be duplicated: the same record could exists on several instances
aggregated data must not have duplicates
data for related tables can come from different instances: data for table with PK from serverA and data for tables with FKs from serverB-serverC
you need to log from where each record is originated
In such case existence of PK on id uuid is justified:
unique constraint allows to deduplicate records
surrogate key allows related data come from different sources
Case C. 'id' is used to expose data through API - OK
Suppose that you have an API to access data for external consumers.
There is a good unique constraint on:
client_id: incrementing integer in range 1..100000
invoice_date: dates '20100101'..'20210901'
And a surrogate key on id with random uuids.
You can create external API in forms:
/server/invoice/{client_id}/{invoice_date}
/server/invoice/{id}
From security POV /{id} is superior by reasons:
it's impossible to deduce from one uuid value existence of other
it's easier to implement authorization system for entities of different types. E.g. entityA has natural key on int, entityB on bigint' and entityC on int+ byte+date`
In such case surrogate key not only justified but becames essential.
Afterword
I hope that I was clear in explanation of main correct principle: "There is no such thing as a universal correct principle".
An additional advice: avoid CASCADE UPDATE/DELETEs:
Although it depends on DBMS you use.
But in general :
"explicit is better than implicit"
CASCADEs rarely works as intended
when CASCADES works - usually they have performance problems
Thank you for your attention.
I hope this helps somebody.

Find primary key in star schema table

I have encountered a situation while developing a star schema. I have a table like this
Name
Email
amy
amy#gmail.com
jess
amy#gmail.com
I want to find the key column as foreign key for Fact table as you can see there is a duplication of records if look individually but unique if consider both column as a key column
Your help will be highly regarded
I am assuming from your question that the table is a dimension? If that is the case then it should have a surrogate key as its PK (as every dimension table should) and you use this to join to fact tables.

SQL primary key on lookup table or unique constraint?

I want to create a lookup table 'orderstatus'. i.e. below, just to clarify this is to be used in a Data Warehouse. I will need to join through OrderStatus to retrieve the INT (if i create one) to be used elsewhere if need be. Like in a fact table for example, I would store the int in the fact table to link to the lookup table.
+------------------------+------------------+
| OrderStatus | ConnectionStatus |
+------------------------+------------------+
| CLOSED | APPROVE |
+------------------------+------------------+
| COMPLETED | APPROVE |
+------------------------+------------------+
| FULFILLED | APPROVE |
+------------------------+------------------+
| CANCELLED | CLOSED |
+------------------------+------------------+
| DECLINED | CLOSED |
+------------------------+------------------+
| AVS_CHECK_SYSTEM_ERROR | CLOSED |
+------------------------+------------------+
What is best practise in terms of primary key/unique key? Should i just create an OrderStatusKey INT as PrimaryKey with identity? Or create a unique constraint on order status (unique)? Thanks.
For this, I would suggest you create an Identity column, and make that the clustered primary key.
It is considered best practice for tables to have a primary key of some kind, but having a clustered index for a table like this is the fastest way to allow for the use of this table in multi table queries ( with joins ).
Here is a sample as to how to add it:
ALTER TABLE dbo.orderstatus
ADD CONSTRAINT PK_orderstatus_OrderStatusID PRIMARY KEY CLUSTERED (OrderStatusID);
GO
Article with more details MSDN
And here is another resource for explaining a primary key Primary Key Primer
If OrderStatus is unique and the primary identifier AND you will be reusing this status code directly in related tables (and not a numeric pointer to this status code) then keep the columns as is and make OrderStatus the primary clustered index.
A little explanation:
A primary key is unique across the table; a clustered index ties all record data back to that index. It is not always necessary to have the primary key also be the clustered index on the table but usually this is the case.
If you are going to be linking to the order status using something other than the status code then create another column of type int as an IDENTITY and make that the primary clustered key. Also add a unique non-clustered index to OrderStatus to ensure that no duplicates could ever be added.
Either way you go every table should have a primary key as well as a clustered index (again, usually they are the same index).
Here are some things to consider:
PRIMARY KEY ensures that there is no NULL values or duplicates in the table
UNIQUE KEY can contain NULL and (by the ANSI standard) any number of NULLs. (This behavior depends on SQL Server settings and possible index filters r not null constraints)
The CLUSTERED INDEX contains all the data related to a row on the leaves.
When the CLUSTERED INDEX is not unique (and not null), the SQL Server will add a hidden GUID to each row.
SQL Server add a hidden GUID column to the key column list when the key columns are not unique to distinguish the individual records)
All indexes are using either values of the key columns of the clustered index or the rowid of a heap table.
The query optimizer uses the index stats to find out the best way to execute a query
For small tables, the indexes are ignore usually, since doing an index scan, then a lookup for each values is more expensive than doing a full table scan (which will read one or two pages when you have really small tables)
Status lookup tables are usually very small and can be stored on one page.
The referencing tables will store the PK value (or unique) in their structure (this is what you'll use to do a join too). You can have a slight performance benefit if you have an integer key to use as reference (aka IDENTITY in SQL Server).
If you usually don't want to list the ConnectionStatus, then using the actual display value (OrderStatus) can be beneficial, since you don't have to join the lookup table.
You can store both values in the referencing tables, but the maintaining both columns have some overhead and more space for errors.
The clustered/non-clustered question depends on the use cases of this table. If you usually use the OrderStatus for filtering (using the textual form), a NON CLUSTERED IDENTITY PK and a CLUESTERED UNIQUE on the OrderStatus can be beneficial. However (as you can read it above), in small tables the effect/performance gain is usually negligible.
If you are not familiar with the above things and you feel it safer, then create an identity clustered PK (OrderKey or OrderID) and a unique non clustered key on the OrderStatus.
Use the PK as referencing/referenced column in foreign keys.
One more thing: if this column will be referenced by only one table, you may want to consider to create an indexed view which contains both table's data.
Also, I would suggest to add a dummy value what you can use if there is no status set (and use it as default for all referencing columns). Because not set is still a status, isn't it?

Composite vs Surrogate keys for Referential Integrity in 6NF

Take three layers of information:
Layer 1: Information
This layer contains data with UNIQUE natural indexes and a surrogate key that is easily transferrable.
Table Surnames:
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Surname |
+-----------------------------+--------------+
| 1 | Smith |
| 2 | Edwards |
| 3 | Brown |
+-----------------------------+--------------+
Table FirstNames
+-----------------------------+--------------+
| ID (Auto Increment, PK) | FirstName |
+-----------------------------+--------------+
| 1 | John |
| 2 | Bob |
| 3 | Mary |
| 4 | Kate |
+-----------------------------+--------------+
Natural Keys
Alternatively, the two tables above can be without ID and utilize Surname and FirstName as Natural Primary Keys, as explained by Mike Sherrill. In this instance, assume the layer below references varchar rather than int.
Layer 2: People
In this layer a composite index is used. This value can be UNIQUE or PRIMARY, depending on whether a surrogate key is utilized as the Primary Key.
+-----------------+--------------+
| FirstName | LastName |
+-----------------+--------------+
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 3 | 1 |
| 4 | 2 |
| ... | ... |
+-----------------+--------------+
Layer 3: Parents
In this layer, relationships between people are explored through a ParentsOf table.
ParentsOf
+-----------------+-----------------+
| Person | PersonParent |
+-----------------+-----------------+
OR
+-----------------+-----------------+-----------------+-----------------+
| PersonFirstName | PersonSurname | ParentFirstName | ParentSurname |
+-----------------+-----------------+-----------------+-----------------+
The Question
Assuming that referential integrity is VERY important to me at its very core, and I will have FOREIGN KEYS on these indexes so that I keep the database responsible for monitoring its own integrity on this front, and that, if I were to use an ORM, it would be one like Doctrine which has native support for Compound Primary Keys...
Please help me to understand:
The list of trade-offs that take place with utilizing surrogate keys vs. natural keys on the 1st Layer.
The list of trade-offs that take place with utilizing compound keys vs. surrogate keys on the 2nd Layer which can be transferred over to the 3rd Layer.
I am not interested in hearing which is better, because I understand that there are significant disagreements among professionals on this topic and it would be sparking a religious war. Instead, I am asking, very simply and as objectively as is humanly possible, what trade-offs will you be taking by passing surrogate keys to each Layer vs maintaining Primary keys (natural/composite, or surrogate/composite). Anyone will be able to find someone saying NEVER or ALWAYS use surrogate keys on SO and other websites. Instead, a reasoned analyses of trade-offs is what I will most appreciate in your answers.
EDIT: It has been pointed out that a surname example is a poor example for a use of 6NF. For the sake of keeping the question intact, I am going to leave it be. If you are having trouble imagining the use case for this, a better one might be a list of "Grocery Items". AKA:
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Grocery |
+-----------------------------+--------------+
| 1 | Sponges |
| 2 | Tomato Soup |
| 3 | Ice Cream |
| 4 | Lemons |
| 5 | Strawberries |
| 6 | Whipped Cream|
+-----------------------------+--------------+
+-----------------------------+--------------+
| ID (Auto Increment, PK) | Brand |
+-----------------------------+--------------+
| 1 | Bright |
| 2 | Ben & Jerry's|
| 3 | Store Brand |
| 4 | Campbell's |
| 5 | Cool Whip |
+-----------------------------+--------------+
Natural Composite Key Example:
+-----------------------------+--------------+
| Grocery | Brand |
+-----------------------------+--------------+
| Sponges | Bright |
| Ice Cream | Ben & Jerry's|
| Ice Cream | Store Brand |
| Tomato Soup | Campbell's |
| Tomato Soup | Store Brand |
| Lemons | Store Brand |
| Whipped Cream | Cool Whip |
+-----------------------------+--------------+
Recommended Pairings
+-----------------+-----------------+-----------------+-----------------+
| Grocery1 | Brand1 | Grocery2 | Brand2 |
+-----------------+-----------------+-----------------+-----------------+
To reiterate, this is also just an example. This is not how I would recommend proceeding, but it should help to illustrate my question.
There ARE shortfalls to this method. I'll reiterate that this question was to request walking through the benefits and drawbacks of each method below, not to highlight one as better than another. I believe most people were able to look past the questionable nature of this specific example to answer the core question. This edit is for those that cannot.
There are some very good answers below and if you are curious about which direction to go, please read them.
END EDIT
Thank you!
Here's some trade-offs:
Single Surrogate (artificially created):
All child tables foreign keys only need a single column to reference the primary key.
Very easy to update the natural keys in table without needing to update every child table with foreign keys
Smaller primary/foreign key indexes (ie. not a wide) This can make the database run faster, for example when a record is deleted in a parent table, the child tables need to be searched to make sure this will not create orphans. Narrow indexes are faster to scan (just sightly).
you will have more indexes because you most likely will also want to index whatever natural keys exists in the data.
Natural composite keyed tables:
fewer indexes in the database
less columns in the database
easier/faster to insert a ton of records as you will not need to grab the sequence generator
updating one of the keys in the compound requires that every child table also be updated.
Then there is another category: artificial composite primary keys
I've only found one instance where this makes sense. When you need to tag every record in every table for row level security.
For example, suppose you had an database which stored data for 50,000 clients and each client was not supposed to see other client's data--very common in web application development.
If each record was tagged with a client_id field, you are creating a row level security environment. Most databases have the tools to enforce row level security when setup correctly.
First thing to do is setup primary and foreign keys. Normally a table with have an id field as the primary key. By adding client_id the key is now composite key. And it is necessary to carry client_id to all child table.
The composite key is based on 2 surrogate keys and is a bulletproof way to ensure data integrity among clients and within the database a whole.
After this you would create views (or if using Oracle EE setup Virtual Private Database) and other various structures to allow the database to enforce row level security (which is a topic all it own).
Granted that this data structure is no longer normalized to the nth degree. The client_id field in each pk/fk denormalizes an otherwise normal model. The benefit of the model is the ease of enforcing row level security at the database level (which is what databases should do). Every select, insert, update, delete is restricted to whatever client_id your session is currently set. The database has session awareness.
Summary
Surrogate keys are always the safe bet. They require a little more work to setup and require more storage.
The biggest benefit in my opinion is:
Being able to update the PK in one table and all other child tables are instantaneously changed without ever being touched.
When data gets messed up--and it will at some point due to a programming mistake, surrogate keys make the clean up much much easier and in some cases only possible to do because there are surrogate keys.
Query performance is improved as the db is able to search attributes to locate the s.key and then join all child table by a single numeric key.
Natural Keys especially composite NKeys make writing code a pain. When you need to join 4 tables the "where clause" will be much longer (and easier to mess up) than when single SKeys were used.
Surrogate keys are the "safe" route. Natural keys are beneficial in a few places, I'd say around 1% of the tables in a db.
First of all, your second layer can be expressed at least four different ways, and they're all relevant to your question. Below I'm using pseudo-SQL, mainly with PostgreSQL syntax. Certain kinds of queries will require recursion and more than one additional index regardless of the structure, so I won't say any more about that. Using a dbms that supports clustered indexes can affect some decisions here, but don't assume that six joins on clustered indexes will be faster than simply reading values from a single, covering index; test, test, test.
Second, there really aren't many tradeoffs at the first layer. Foreign keys can reference a column declared not null unique in exactly the same way they can reference a column declared primary key. The surrogate key increases the width of the table by 4 bytes; that's trivial for most, but not all, database applications.
Third, correct foreign keys and unique constraints will maintain referential integrity in all four of these designs. (But see below, "About Cascades".)
A. Foreign keys to surrogate keys
create table people (
FirstName integer not null
references FirstNames (ID),
LastName integer not null
references Surnames (ID),
primary key (FirstName, LastName)
);
B. Foreign keys to natural keys
create table people (
FirstName varchar(n) not null
references FirstNames (FirstName),
LastName varchar(n) not null
references Surnames (Surname),
primary key (FirstName, Surname)
);
C. Foreign keys to surrogate keys, additional surrogate key
create table people (
ID serial primary key,
FirstName integer not null
references FirstNames (ID),
LastName integer not null
references Surnames (ID),
unique (FirstName, LastName)
);
D. Foreign keys to natural keys, additional surrogate key
create table people (
ID serial primary key,
FirstName varchar(n) not null
references FirstNames (FirstName),
LastName varchar(n) not null
references Surnames (Surname),
unique (FirstName, Surname)
);
Now let's look at the ParentsOf table.
A. Foreign keys to surrogate keys in A, above
create table ParentsOf (
PersonFirstName integer not null,
PersonSurname integer not null,
foreign key (PersonFirstName, PersonSurname)
references people (FirstName, LastName),
ParentFirstName integer not null,
ParentSurname integer not null,
foreign key (ParentFirstName, ParentSurname)
references people (FirstName, LastName),
primary key (PersonFirstName, PersonSurname, ParentFirstName, ParentSurname)
);
To retrieve the names for a given row, you'll need four joins. You can join directly to the "FirstNames" and "Surnames" tables; you don't need to join through the "People" table to get the names.
B. Foreign keys to natural keys in B, above
create table ParentsOf (
PersonFirstName varchar(n) not null,
PersonSurname varchar(n) not null,
foreign key (PersonFirstName, PersonSurname)
references people (FirstName, LastName),
ParentFirstName varchar(n) not null,
ParentSurname varchar(n) not null,
foreign key (ParentFirstName, ParentSurname)
references people (FirstName, LastName),
primary key (PersonFirstName, PersonSurname, ParentFirstName, ParentSurname)
);
This design needs zero joins to retrieve the names for a given row. Many SQL platforms won't need to read the table at all, because they can get all the data from the index on the primary key.
C. Foreign keys to surrogate keys, additional surrogate key in C, above
create table ParentsOf (
Person integer not null
references People (ID),
PersonParent integer not null
references People (ID),
primary key (Person, PersonParent)
);
To retrieve names, you must join through the "people" table. You'll need a total of six joins.
D. Foreign keys to natural keys, additional surrogate key in D, above
This design has the same structure as in C immediately above. Because the "people" table in D, farther above, has natural keys referencing the tables "FirstNames" and "Surnames", you'll only need two joins to the table "people" to get the names.
About ORMs
ORMs don't build SQL the way a SQL developer writes SQL. If a SQL developer writes a SELECT statement that needs six joins to get the names, an ORM is liable to execute seven simpler queries to get the same data. This might be a problem; it might not.
About Cascades
Surrogate ID numbers make every foreign key reference an implicit, undeclared "ON UPDATE CASCADE". For example, if you run this update statement against your table of surnames . . .
update surnames
set surname = 'Smythe'
where surname = 'Smith';
then all the Smiths will become Smythes. The only way to prevent that is to revoke update permissions on "surnames". Implicit, undeclared "ON UPDATE CASCADE" is not always a Good Thing. Revoking permissions solely to prevent unwanted implicit "cascades" is not always a Good Thing.
Using natural keys can enable simpler, faster queries since one needn't join all the way up the foreign key chain to find the "natural" value e.g. for display on-screen.
I will avoid a pure academic discussion here and look at a few pragmatic considerations given a modern database design typically needs to consider scalability, mobility (disconnected operation), and conflict resolution where choice of key can have a large impact.
Things that can influence your choice are:
how to deal with distinct records which may have the same natural keys. Eg identical firstname and surname.
how does a web or mobile client persist a complex model graph if server assigned surrogate keys are used (requires some kind of mapping layer). Alternative is to avoid the mapping problem and use client assigned v4 UUIDs.
following on from the above, how do you deal with conflict resolution in temporarily disconnected environments such as mobile apps or where clients can peer/share with one another without having to first sync with a server . Object identity is an important concept to support and solve these problems.
scalability through sharding your database can be easy or difficult based on choice of key. Auto incrementing surrogate keys are hard to shard and require choosing a fixed number of shards a-priori so keys don't clash, whilst v4 UUID based surrogate keys are easy and can be client assigned. Composite and natural keys are hard because the key whilst relatively stable may still change and this requires the ability to migrate records from one shard to another.
how do your clients manage object identity? Often user interfaces require building a local graph of models for later persistence to a "server in the cloud". During this time before persistence those objects need identity and after persistence there needs to be an agreement or mapping between the server object identity and client object identity.
Do you force everything above the database (including the application server) to deal with an identity mapping problem or build it into the database key design and also help solve scalabilty/sharding for the db whilst your at it?
My advice is look at the characteristics of the system as a whole and look beyond the theoretical db design to what will work well for the non trivial full stack that sits above the database. The choice you make for key design can make or break usability of the system and help or harm the development complexity, thus increasing or decreasing your resulting time to market and overall tradeoff for quality and reliability.
I once saw this list of criteria for a primary key. I find it a rather good starting point for this kind of discussion
unique
stable (not necessarily immutable)
irreducible
simple
familiar
Sometimes there's a conflict between two or more criteria and we have to compromise between them. Unfortunate, many people never even reflect on how to design the key, they go with some kind of auto generated key, may it be an identity column, a guid or whatever.
One drawback with surrogate keys is that it becomes more difficult to enforce rules declarative (most DBMS don't support sub queries in check constraints). I'm thinking rules like:
CHECK ( jobtitle <> 'BOSS' OR salary > 100 )
However, I find the biggest problem with surrogate keys to be that you can get away with really weird constructions, and not even notice.
A frequent use case you can find in databases is versioning an history
Exemple with an user table:
ID Name Value DeletedFlag
1 Smith 78 0
2 Martin 98 0
3 John 78 1
4 Edouard 54 0
5 John 64 0
John has filled an info, then decided to delete it and fill a new one.
If you do not use a unique pk, you wont be able to manage this situation.
It makes it extremely easy in development and production to mark some data as deleted, and unmark them to do some tests or data corrections, instead of backing up or restoring or getting a lot of confusion.
It also faster to rebuild indexes on integers and it takes less disk space.
If our composite key consists of columns STUDENT and COURSE, the
database will ensure that we never enter duplicate values.
Example.
With composite key, this is not allowed, database will prevent it.
STUDENT COURSE
1 CS101
1 CS101
But if we choose a surrogate key as the key, we will need to find another way to prevent such duplications.
Thinking about
which combinations of fields are possible keys can help you discover and understand the problem better.
I think you have misunderstood something fundamental as regards data:
1) You are taking a single identifier (person name - assuming that does uniquely identify a person), splitting it into subatomic parts then, because of 6NF, putting them into separate relation variables. Often such a split is made for practical reasons and first name/last name is a common example; the decision is usually made on grounds of complexity, frequency, etc of splitting when compared to those of putting the attribute back together again. Here the split is not practical.
2) 6NF is always achievable but not always desirable. In this case, it makes it harder to define a constraint that would be able to verify the parts as being valid in combination (imagine you had split a date by time granules day, month and year and stored each part in separate relvars!).
3) For person identifiers, a compound of first and last names is rarely adequate. Identifiers are usually chosen based on the level of trust required. An employer checks references, qualifications, etc then issues a payroll reference. A police offer might require sight of your driver licence by the roadside but fingerprints will be taken if you are convicted of a crime. A DBMS cannot verify a person and therefore an auto increment integer is rarely adequate either.

Can I have 2 unique columns in the same table?

I have 2 tables:
roomtypes[id(PK),name,maxAdults...]
features(example: Internet in room, satelite tv)
Can both id and name field be unique in the same table in mysql MYISAM?
If the above is posible, I am thinking of changing the table to:
features[id(PK),name,roomtypeID] ==> features[id(PK),name,roomtypeNAME]
...because it is helping me not to do extra querying in presentation for features because the front end users can't handle with IDs.
Of course, you can make one of them PRIMARY and one UNIQUE. Or both UNIQUE. Or one PRIMARY and four UNIQUEs, if you like
Yes, you can define UNIQUE constraints to columns other than the primary key in order to ensure the data is unique between rows. This means that the value can only exist in that column once - any attempts to add duplicates will result in a unique constraint violation error.
I am thinking of changing the FEATURES table to features[id(PK), name, roomtypeNAME] because it is helping me not to do extra querying in presentation for features because the front end users can't handle with IDs.
There's two problems:
A unique constraint on the ROOM_TYPE_NAME wouldn't work - you'll have multiple instances of a given room type, and a unique constraint is designed to stop that.
Because of not using a foreign key to the ROOM_TYPES table, you risk getting values like "Double", "double", "dOUBle"
I recommend sticking with your original design for sake of your data; your application is what translates a room type into its respective ROOM_TYPE record while the UI makes it presentable.
I would hope so otherwise MySQL is not compliant with the SQL standard. You can only have one primary key but you can mark other columns as unique.
In SQL, this is achieved with:
create table tbl (
colpk char(10) primary key,
coluniq char(10) unique,
colother char(10)
);
There are other ways to do it (particularly with multi-part keys) but this is a simple solution.
Yes you can.
Also keep in mind that MySQL allow NULL values in unique columns, whereas a column that is a primary key cannot have a NULL value.
1 RoomType may have many Features
1 Feature may be assigned to many RoomTypes
So what type of relationship do i have? M:N ?
You have there a many-to-many relationship, which has to be represented by an extra table.
That relationship table will have 2 fields: the PK of RoomTypes and the PK of Features.
The PK of the relationship table will be made of those 2 fields.
If that's usefull, you can add extra fields like the Quantity.
I would like to encourage you to read about database Normalization, which is he process of creating a correct design for a relational database. You can Google for that, or look eventually here (there are plenty of books/web pages on this)
Thanks again for very helpful answers.
1 roomType may have many features
1 feature may be assigned to many roomTypes
So what type of relationship do i have? M:N ?
If yes the solution I see is changing table structure to
roomTypes[id,...,featuresIDs]
features[id(PK),name,roomtypeIDs] multiple roomTypesIDs separated with comma?