nature key vs auto_increment key as the primary key - primary-key

My problem is about nature key and auto_increment integer as primary key.
For example, I have tables A and B and A_B_relation. A and B may be some object, and A_B_realtion record the many to many relation of A and B.
Both A and B have their own global unique id, such as UUID. The UUID is available to user, this means user may query A or B by UUID.
There are two ways to design the table's primary key.
use the auto_increment integer. A_B_relation reference the integer as FK.
use the UUID. A_B_relation reference the UUID as FK.
For example, user want to query all the B's info associate with A by A's UUID.
For the first case, the query flow is this:
First, query A's integer primary key by UUID from `A`.
And then, query all the B's integer primary key from `A_B_relation`.
At last, query all the B's info from `B`.
For the latter case, the flow is as below:
Query all the B's UUID from the `A_B_relation` by A's UUID.
Query all the B's info from `B`.
So I think, the latter case is more convenient. Is this right? what's the shortage of the latter case?

According to my opinion convenience of using either natural key of auto-increment key depends on the program solution you are providing. Both methods have pros and cons. So the best solution is to understand both key types properly, analyze what kind of business solution you are trying to provide and select the appropriate primary key type.
Natural key is a column or a set of columns which we can be used to uniquely identify a record in a table. These columns contain real data which has a relationship with the rest of the columns of the table.
Auto-incremented key, also called as surrogate key is a single table column which contains unique numeric values which can be used to uniquely identify a single row of data in a table. These values are generated at run-time when a record is inserted to the table and has no relationship with the rest of the data of the row.
The main advantage of using Natural keys is it has it's own meaning and requires less joins with other tables where as if we used a surrogate key we would require to join to a foreign key table to get the results we got with the natural key.
But say we cannot get all the data required from single table and have to join with another table to get all the data required. Then it is convenient to use a surrogate key instead of natural key because most of the time natural keys are strings and larger in size than surrogate keys and it will take more time to join tables using larger values.
A natural key has it's own meaning. So when it comes to searching records it is more advantageous to use natural keys over surrogate keys. But say with time our program logic changes and we have to change the natural key value. This will be difficult and will cause a cascade effect over all foreign key relationships. We can overcome this problem using a surrogate key. Since a surrogate key does not have a relationship with the rest of the values of a row, changes of the logic won't have a affect over the surrogate key.
Likewise, as I see the convenience and inconvenience of using a surrogate key or a natural key entirely base on the solution you are providing.

Related

What is the difference between a primary key and a surrogate key?

I googled a lot, but I did not find the exact straight forward answer with an example.
Any example for this would be more helpful.
The primary key is a unique key in your table that you choose that best uniquely identifies a record in the table. All tables should have a primary key, because if you ever need to update or delete a record you need to know how to uniquely identify it.
A surrogate key is an artificially generated key. They're useful when your records essentially have no natural key (such as a Person table, since it's possible for two people born on the same date to have the same name, or records in a log, since it's possible for two events to happen such they they carry the same timestamp). Most often you'll see these implemented as integers in an automatically incrementing field, or as GUIDs that are generated automatically for each record. ID numbers are almost always surrogate keys.
Unlike primary keys, not all tables need surrogate keys, however. If you have a table that lists the states in America, you don't really need an ID number for them. You could use the state abbreviation as a primary key code.
The main advantage of the surrogate key is that they're easy to guarantee as unique. The main disadvantage is that they don't have any meaning. There's no meaning that "28" is Wisconsin, for example, but when you see 'WI' in the State column of your Address table, you know what state you're talking about without needing to look up which state is which in your State table.
A surrogate key is a made up value with the sole purpose of uniquely identifying a row. Usually, this is represented by an auto incrementing ID.
Example code:
CREATE TABLE Example
(
SurrogateKey INT IDENTITY(1,1) -- A surrogate key that increments automatically
)
A primary key is the identifying column or set of columns of a table. Can be surrogate key or any other unique combination of columns (for example a compound key). MUST be unique for any row and cannot be NULL.
Example code:
CREATE TABLE Example
(
PrimaryKey INT PRIMARY KEY -- A primary key is just an unique identifier
)
All keys are identifiers used as surrogates for the things they identify. E.F.Codd explained the concept of system-assigned surrogates as follows [1]:
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever
displayed to them.
This is what is commonly referred to as a surrogate key. The definition is immediately problematic however because Codd was assuming that such a feature would be provided by the DBMS. DBMSs in general have no such feature. The keys are normally visible to at least some DBMS users as, for obvious reasons, they have to be. The concept of a surrogate has therefore morphed slightly in usage. The term is generally used in the data management profession to mean a key that is not exposed and used as an identifier in the business domain. Note that this is essentially unrelated to how the key is generated or how "artificial" it is perceived to be. All keys consist of symbols invented by humans or machines. The only possible significance of the term surrogate therefore relates how the key is used, not how it is created or what its values are.
[1] Extending the database relational model to capture more meaning, E.F.Codd, 1979
This is a great treatment describing the various kinds of keys:
http://www.agiledata.org/essays/keys.html
A surrogate key is typically a numeric value. Within SQL Server, Microsoft allows you to define a column with an identity property to help generate surrogate key values.
The PRIMARY KEY constraint uniquely identifies each record in a database table.
Primary keys must contain UNIQUE values.
A primary key column cannot contain NULL values.
Most tables should have a primary key, and each table can have only ONE primary key.
http://www.databasejournal.com/features/mssql/article.php/3922066/SQL-Server-Natural-Key-Verses-Surrogate-Key.htm
I think Michelle Poolet describes it in a very clear way:
A surrogate key is an artificially produced value, most often a
system-managed, incrementing counter whose values can range from 1 to
n, where n represents a table's maximum number of rows. In SQL Server,
you create a surrogate key by assigning an identity property to a
column that has a number data type.
http://sqlmag.com/business-intelligence/surrogate-key-vs-natural-key
It usually helps you use a surrogate key when you change a composite key with an identity column.

Foreign keys vs secondary keys

I used to think that foreign key and secondary key are the same thing.
After Googling the result are even more confusing, some consider them to be the same, others said that a secondary key is an index that doesn't have to be unique, and allows faster access to data than with the primary key.
Can someone explain the difference?
Or is it indeed a case of mixed terminology?
Does it maybe differ per database type?
The definition in wiki/Foreign_key states that:
In the context of relational databases, a foreign key is a field (or
collection of fields) in one table that uniquely identifies a row of
another table. In other words, a foreign key is a column or a
combination of columns that is used to establish and enforce a link
between two tables.
The table containing the foreign key is called the referencing or
child table, and the table containing the candidate key is called the
referenced or parent table.
Take the example of the case:
A customer may place 0,1 or more orders.
From the point of the business, each customer is identified by a unique id (Primary Key) and instead of repeating the customer information with each order, we place a reference, or a pointer to that unique customer id (Customer's Primary Key) in the order table. By looking at any order, we can tell who placed it using the unique customer id.
The relationship established between the parent (Customer table) and the child table (Order table) is established when you set the value of the FK in the Order table after the Customer row has been inserted. Also, deleting a child row may affect the parent depending on your Referential Integrity stings (Cascading Rules) established when the FK was created. FKs help establish integrity in a relational database system.
As for the "Secondary Key", the term refers to a structure of 1 or more columns that together help retrieve 1 or more rows of the same table. The word 'key' is somewhat misleading to some. The Secondary Key does not have to be unique (unlike the PK). It is not the Primary Key of the table. It is used to locate rows in the same table it is defined within (unlike the FK). Its enforcement is only through an index (either unique or not) and it is implementation is optional. A table could have 0,1 or more Secondary Key(s). For example, in an Employee table, you may use an auto generated column as a primary key. Alternatively, you may decide to use the Employee Number or SSN to retrieve employee(s) information.
Sometimes people mix the term "Secondary Key" with the term "Candidate Key" or "Alternate Key" (usually appears in Normalization context) but they are all different.
A foreign key is a key that references an index on some other table. For example, if you have a table of customers, one of the columns on that table may be a country column which would just contain an ID number, which would match the ID of that country in a separate Country table. That country column in the customer table would be a foreign key.
A secondary key on the other hand is just a different column in the table that you have used to create an index (which is used to speed up queries). Foreign keys have nothing to do with improving query speeds.
"Secondary key" is not a term I'm familiar with. It doesn't appear in the index of Database Design for Mere Mortals and I don't remember it in Pro SQL Server 2012 Relational Database Design and Implementation (my two "goto" books for database design). It also doesn't appear in the index for SQL for Smarties. It sounds like its not an actual term at all.
I've always used the term "candidate key".
A candidate key is a way to uniquely identify an entity. You identify all the candidate keys during the design phase of a database system. During the implementation phase, you will decide on a primary key: either one of the candidate keys or an artificial key. The primary key will probably be implemented with a primary key constraint; the candidate keys will probably be implemented with unique constraints.
A foreign key is an instance of one entity's candidate key in another entity, representing a relationship between the two entities. It will probably be implemented with a foreign key constraints.

Should every table have a primary key?

I read somewhere saying that every table should have a primary key to fulfill 1NF.
I have a tbl_friendship table.
There are 2 fields in the table : Owner and Friend.
Fields of Owner and Friends are foreign keys of auto increment id field in tbl_user.
Should this tbl_friendship has a primary key?
Should I create an auto increment id field in tbl_friendship and make it as primary key?
Primary keys can apply to multiple columns! In your example, the primary key should be on both columns, For example (Owner, Friend). Especially when Owner and Friend are foreign keys to a users table rather than actual names say (personally, my identity columns use the "Id" naming convention and so I would have (OwnerId, FriendId)
Personally I believe every table should have a primary key, but you'll find others who disagree.
Here's an article I wrote on the topic of normal forms.
http://michaeljswart.com/2011/01/ridiculously-unnormalized-database-schemas-part-zero/
Yes every table should have a primary key.
Yes you should create surrogate key.. aka an auto increment pk field.
You should also make "Friend" an FK to that auto increment field.
If you think that you are going to "rekey" in the future you might want to look into using natural keys, which are fields that naturally identify your data. The key to this is while coding always use the natural identifiers, and then you create unique indexes on those natural keys. In the future if you have to re-key you can, because your ux guarantees your data is consistent.
I would only do this if you absolutely have to, because it increases complexity, in your code and data model.
It is not clear from your description, but are owner and friend foreign keys and there can be only one relationship between any given pair? This makes two foreign key column a perfect candidate for a natural primary key.
Another option is to use surrogate key (extra auto-incremented column as you suggested). Take a look here for an in-depth discussion.
A primary key can be something abstract as well. In this case, each tuple (owner, friend), e.g. ("Dave","Matt") can form a unique entry and therefore be your primary key. In that case, it would be useful not to use names, but keys referencing another table. If you guarantee, that these tuples can't have duplicates, you have a valid primary key.
For processing reasons it might be useful to introduce a special primary key, like an autoincrement field (e.g. in MySQL) or using a sequence with Oracle.
To comply with 1NF (which is not completely aggreed upon what defines 1NF), yes you should have a primary key identified on each table. This is necessary to provide for uniqueness of each record.
http://en.wikipedia.org/wiki/First_normal_form
In general, you can create a primary key in many ways, one of which is to have an auto-increment column, another is to have a column with GUIDs, another is to have two or more columns that will identify a row uniquely when taken together.
Your table will be much easier to manage in the long term if it has a primary key. At the very least, you need to uniquely identify each record in the table. The field that is used to uniquely identify each record might as well be the primary key.
Yes every table should have (at least one) key. Duplicating rows in any table is undesirable for lots of reasons so put the constraint on those two columns.

Why we can't have more than one primary key?

I Know there can't be more than 1 primary key in a table but what is the technical reason ?
Pulled directly from SO:
You can only have one primary key, but you can have multiple columns in your primary key.
You can also have Unique Indexes on your table, which will work a bit like a primary key in that they will enforce unique values, and will speed up querying of those values.
Primary in the context of Primary Key means that it's ranked first in importance. Therefore, there can only be one key. It's by definition.
It's also usually the key for which the index has the actual data attached to it, that is, the data is stored with the primary key index. Other indices contain only the data that's being indexed, and perhaps some Included Columns.
In fact E.F.Codd (the inventor of the Relational Database Model) [1] originated the term "primary key" to mean any number of keys of a relation - not just one. He made it clear that it was quite possible to have more than one such key. His suggestion was that the database designer could choose one key as a preferred identifier ("the primary key") - but in principle this was optional and such a choice was "arbitrary" (that was his word). Because all keys enjoy the same properties as each other there is no fundamental need to choose any one over another.
Later on [2] what Codd originally called primary keys became known as candidate keys and the one key singled out as the preferred one became known as the "primary" key. This was not really a fundamental shift however because a primary key means exactly the same as a candidate key. Since they are equivalent concepts it doesn't really mean anything important when we say there "must" only be one primary key. If you have more than one candidate key you could quite reasonably call more than one of them "primary" if you prefer because it doesn't make any logical or practical difference to the meaning and function of the database.
It has been argued (by me among others) that the idea of designating one key per table as "primary" is utterly superfluous and sometimes a positive hinderance to a good understanding of database design and data intgrity issues. However, the concept is so entrenched we are probably stuck with it.
So the proper answer to your question is "convention" and "convenience". There is no good technical reason at all.
[1] A Relational Model of Data for Large Shared Data Banks (1970)
[2] E.g. in "Further Normalization of the Relational Data Base Model" (1971)
Well, it's called "primary" for a reason. As in, its the one key used to uniquely identify the record... and there "can be only one".
You could certainly mimick a second "primary" key by having an index placed on one or more other fields that are unique but for the purposes of your database server it's generally only necessary if your key isn't unique enough to cross database servers in a merge replication situation. (ie: multi master).
PRIMARY KEY is usually equivalent to UNIQUE INDEX NOT NULL. So you can effectively have multiple "primary keys" on a single table.
The primary key is the key which uniquely identifies that record.
I'm not sure if you're asking if a) there can be a single primary key spanning multiple columns, or b) if you can have multiple keys which uniquely identify the record.
The first is possible, known as a composite primary key.
The second is possible also, but only one is called the primary key.
Because the "primary" in "primary key" denotes its, mmm, singularity(?).
But if you need more, you can define UNIQUE keys which have quite the same behaviour.
The technical reason is that there can be only one primary. Otherwise it wouldn't be called so.
However a primary key can include several columns - see 7.5.2. Multiple-Column Indexes
The primary key is the one (of possibly many) unique identifiers of a particular row in a table. The other unique identifiers, which were not designated as the primary one, are hence often refereed to as secondary unique indexes.
Primary key allows us to uniquely identify each record in the table. You can have 2 primary keys in a table but they are called Composite Primary Keys. "When you define more than one column as your primary key on a table, it is called a composite primary key."
A primary key defines record uniqueness. To have two different measures of uniqueness can be problematic. For example, if you have primary keys A and B and you insert records where A is the same and B is different, then are those records the same or different? If you consider them different, then make your primary a composite of A and B. If you consider them the same record, then just use A or B as the primary key.
For non-clustered index we can create two index and are typically made on non-primary key columns used in JOIN, WHERE , ORDER BY clauses.
While in clustered index we have only one index and that on primary key. So if we have two primary keys there is ambiguity.
Also in referential intergrity there is ambiguity selecting one of the two primary keys.
Only one primary key possible on the table because primary key creates a clustered index on the table which stored data physically on the leaf node in ordered way based on that primary key column.
If we try to create one another primary key on that table then there will be one major problem related to the data.Because be can not store same data of the table in two different-2 order.

Database "key/ID" design ideas, Surrogate Key, Primary Key, etc

So I've seen several mentions of a surrogate key lately, and I'm not really sure what it is and how it differs from a primary key.
I always assumed that ID was my primary key in a table like this:
Users
ID, Guid
FirstName, Text
LastName, Text
SSN, Int
however, wikipedia defines a surrogate key as "A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data."
According to Wikipedia, it looks like ID is my surrogate key, and my primary key might be SSN+ID? Is this right? Is that a bad table design?
Assuming that table design is sound, would something like this be bad, for a table where the data didn't have anything unique about it?
LogEntry
ID, Guid
LogEntryID, Int [sql identity field +1 every time]
LogType, Int
Message, Text
No, your ID can be both a surrogate key (which just means it's not "derived from application data", e.g. an artificial key), and it should be your primary key, too.
The primary key is used to uniquely and safely identify any row in your table. It has to be stable, unique, and NOT NULL - an "artificial" ID usually has those properties.
I would normally recommend against using "natural" or real data for primary keys - are not REALLY 150% sure it's NEVER going to change?? The Swiss equivalent of the SSN for instance changes each time a woman marries (or gets divorced) - hardly an ideal candidate. And it's not guaranteed to be unique, either......
To spare yourself all that grief, just use a surrogate (artificial) ID that is system-defined, unique, and never changes and never has any application meaning (other than being your unique ID).
Scott Ambler has a pretty good article here which has a "glossary" of all the various keys and what they mean - you'll find natural, surrogate, primary key and a few more.
First, a Surrogate key is a key that is artificially generated within the database, as a unique value for each row in a table, and which has no dependency whatsoever on any other attribute in the table.
Now, the phrase Primary Key is a red herring. Whether a key is primary or an alternate doesn't mean anything. What matters is what the key is used for. Keys can serve two functions which are fundementally inconsistent with one another.
They are first and foremost there to ensure the integrity and consistency of your data! Each row in a table represents an instance of whatever entity that table is defined to hold data for. No Surrogate Key, by definition, can ever perform this function. Only a properly designed natural Key can do this. (If all you have is a surrogate key, you can always add another row with every other attributes exactly identical to an existing row, as long as you give it a different surrogate key value)
Secondly they are there to serve as references (pointers) for the foreign Keys in other tables which are children entities of an entity in the table with the Primary Key. A Natural Key, (especially if it is a composite of multiple attributes) is not a good choice for this function because it would mean tha that A) the foreign keys in all the child tables would also have to be composite keys, making them very wide, and thereby decreasing performance of all constraint operations and of SQL Joins. and B) If the value of the key changed in the main table, you would be required to do cascading updates on every table where the value was represented as a FK.
So the answer is simple... Always (wherever you care about data integrity/consistency) use a natural key and, where necessary, use both! When the natural key is a composite, or long, or not stable enough, add an alternate Surrogate key (as auto-incrementing integer for example) for use as targets of FKs in child tables. But at the risk of losing data consistency of your table, DO NOT remove the natural key from the main table.
To make this crystal clear let's make an example.
Say you have a table with Bank accounts in it... A natural Key might be the Bank Routing Number and the Account Number at the bank. To avoid using this twin composite key in every transaction record in the transactions table you might decide to put an artificially generated surrogate key on the BankAccount table which is just an integer. But you better keep the natural Key! If you didn't, if you did not also have the composite natural key, you could quite easily end up with two rows in the table as follows
id BankRoutingNumber BankAccountNumber BankBalance
1 12345678932154 9876543210123 $123.12
2 12345678932154 9876543210123 ($3,291.62)
Now, which one is right?
To marc from comments below, What good does it do you to be able to "identify the row"?? No good at all, it seems to me, because what we need to be able to identify is which bank account the row represents! Identifying the row is only important for internal database technical functions, like joins in queries, or for FK constraint operations, which, if/when they are necessary, should be using a surrogate key anyway, not the natural key.
You are right in that a poor choice of a natural key, or sometimes even the best available choice of a natural key, may not be truly unique, or guaranteed to prevent duplicates. But any choice is better than no choice, as it will at least prevent duplicate rows for the same values in the attributes chosen as the natural key. These issues can be kept to a minimum by the appropriate choice of key attributes, but sometimees they are unavoidable and must be dealt with. But it is still better to do so than to allow incorrect inaccurate or redundant data into the database.
As to "ease of use" If all you are using the natural key for is to constrain the insertion of duplicate rows, and you are using another, surrogate, key as the target for FK constraints, I do not see any ease of use issues of concern.
Wow, you opened a can of worms with this question. Database purists will tell you never to use surrogate keys (like you have above). On the other hand, surrogate keys can have some tremendous benefits. I use them all the time.
In SQL Server, a surrogate key is typically an auto-increment Identity value that SQL Server generates for you. It has NO relationship to the actual data stored in the table. The opposite of this is a Natural key. An example might be Social Security number. This does have a relationship to the data stored in the table. There are benefits to natural keys, but, IMO, the benefits to using surrogate keys outweigh natural keys.
I noticed in your example, you have a GUID for a primary key. You generally want to stay away from GUIDS as primary keys. The are big, bulky and can often be inserted into your database in a random way, causing major fragmentation.
Randy
The reason that database purists get all up in arms about surrogate keys is because, if used improperly, they can allow data duplication, which is one of the evils that good database design is meant to banish.
For instance, suppose that I had a table of email addresses for a mailing list. I would want them to be unique, right? There's no point in having 2, 3, or n entries of the same email address. If I use email_address as my primary key ( which is a natural key -- it exists as data independently of the database structure you've created ), this will guarantee that I will never have a duplicate email address in my mailing list.
However, if I have a field called id as a surrogate key, then I can have any number of duplicate email addresses. This becomes bad if there are then 10 rows of the same email address, all with conflicting subscription information in other columns. Which one is correct, if any? There's no way to tell! After that point, your data integrity is borked. There's no way to fix the data but to go through the records one by one, asking people what subscription information is really correct, etc.
The reason why non-purists want it is because it makes it easy to use standardized code, because you can rely on refering to a single database row with an integer value. If you had a natural key of, say, the set ( client_id, email, category_id ), the programmer is going to hate coding around this instance! It kind of breaks the encapsulation of class-based coding, because it requires the programmer to have deep knowledge of table structure, and a delete method may have different code for each table. Yuck!
So obviously this example is over-simplified, but it illustrates the point.
Users Table
Using a Guid as a primary key for your Users table is perfect.
LogEntry table
Unless you plan to expose your LogEntry data to an external system or merge it with another database, I would simply use an incrementing int rather than a Guid as the primary key. It's easier to work with and will use slightly less space, which could be significant in a huge log stretching several years.
The primary key is whatever you make it. Whatever you define as the primary key is the primary key. Usually its an integer ID field.
The surrogate key is also this ID field. Its a surrogate for the natural key, which defines uniqueness in terms of your application data.
The idea behind having an integer ID as the primary key (even it doesnt really mean anything) is for indexing purposes. You would then probably define a natural key as a unique constraint on your table. This way you get the best of both worlds. Fast indexing with your ID field and each row still maintains its natural uniqueness.
That said, some people swear by just using a natural key.
There are actually three kinds of keys to talk about. The primary key is what is used to uniquely identify every row in a table. The surrogate key is an artificial key that is created with that property. A natural key is a primary key which is derived from the actual real life data.
In some cases the natural key may be unwieldy so a surrogate key may be created to be used as a foreign key, etc. For example, in a log or diary the PK might be the date, time, and the full text of the entry (if it is possible to add two entries at the exact same time). Obviously it would be a bad idea to use all of that every time that you wanted to identify a row, so you might make a "log id". It might be a sequential number (the most common) or it might be the date plus a sequential number (like 20091222001) or it might be something else. Some natural keys may work well as a primary key though, such as vehicle VIN numbers, student ID numbers (if they are not reused), or in the case of joining tables the PKs of the two tables being joined.
This is just an overview of table key selection. There's a lot to consider, although in most shops you'll find that they go with, "add an identity column to every table and that's our primary key". You then get all of the problems that go with that.
In your case I think that a LogEntryID for your log items seems reasonable. Is the ID an FK to the Users table? If not then I might question having both the ID and the LogEntryID in the same table as they are redundant. If it is, then I would change the name to user_id or something similar.