which database structure to choose for tagging system - sql

I have a database structure as follow:
tbl_products
-pk id (AutoIncrement)
-name
-description
tbl_tags (1) OR tbl_tags (2)
-pk name -pk id (AutoIncrement)
-name (unique)
tbl_products_tags
-fk product_id
-fk tag_id
I have seen most choose data structures tbl_tags (2). I want to ask whether i could choose tbl_tags(1) since name is always unique, so i want to to make it primary. Does it have any downside ?

If you make the tag name unique, you have to think about what you'll do if a name needs to be changed. For example, if I want to change "tag" to "tags".
If this is a primary key, then all the child records that refer to "tag" will also have to be updated so the constraint is valid. If you have a lot of rows referring to a given name, running this change is likely to be slow and introduce some blocking/contention into your application. Whereas if you use a surrogate primary key, you only have to update the unique name field, not all the child rows as well.
If you're certain that you'll never update a tag name then you could use it as the primary key. Beware of changing requirements however!
Natural keys generally make sense when using codes that are issued and managed by an external source (e.g. airport, currency and country codes). In these cases you can be sure that the natural key won't change and is guaranteed to be unique within the domain.

My understanding is there would be a marginal performance penalty to tbl_tags (1) in the context of a very large dataset when compared to option 2. In smaller datasets, probably not so much. The machine can process integers much more efficiently than strings.
In the bigger picture though, with modern processor speeds, the difference between the two might be negligable in all but the largest datasets.
Of course, I am speaking about relational databases here. The various flavors of NoSQL are a different animal.
Also, there is the matter of consistency. The other tables in your database all seem to be using (what I assume to be) an auto-incrementing integer ID. For that reason, I would use it on the tags table as well.
The use of auto-incrementing integer PK fields vs "Natural Keys" in designing a database is a long-standing debate. My understanding is academics largely prefer the "Natural Keys" concept, while in practice some form of generated unique key tends to be the norm.
Personally, I prefer to create generated keys which have no meaning to the end user, integers where possible. Unless I have missed something, index performance is significantly enhanced.

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

Safe to use human readable primary keys in SQL?

I want to know if I can use human readable primary keys for a relatively small number of database objects, which will describe large metropolitan areas.
For example, using "washington_dc" as the pk for the Washington, DC metro area, or "nyc" for the New York City one.
Tons of objects will be foreign keyed to these metro area objects, and I'd like to be able to tell where a person or business is located just by looking at their database record.
I'm just worried because my gut tells me this might be a serious crime against good practices.
So, am I "allowed" to do this kind of thing?
Thanks!
It all depends on the application - natural primary keys make a good deal of sense on the surface, since they are human readable and don't require any joins when displaying data to end users.
However, natural primary keys tend to be larger than INT (or even BIGINT) suragate primary keys and there are very few domains where there isn't some danger of having a natural primary key change. To take your example, a city changing its name is not a terribly uncommon occurrence. When a city's name changes you are then left with either an update that needs to touch every instance of city as a foreign key or with a primary key that no longer reflects reality ("The data shows Leningrad, but it really is St. Petersburg.")
So in sum, natural primary keys:
Take up more disc space (most of the time)
Are more susceptible to change (in the majority of cases)
Are more human readable (as long as they don't change)
Whether #1 and #2 are sufficiently counteracted by #3 depends on what you are building and what its use is.
I think that this question
What are the design criteria for primary keys?
gives a really good overview of the tradeoffs you might be making. I think the answer given is the correct one, but its brevity belies some significant thinking you actually have to do to work out what's right for you.
(From that answer)
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
For what it's worth, the small number of times I've had problems with scaling by choosing strings as the primary key is about the same as the number of time's I've had problems with redundant data using an autoincrement key. The problems that arise with autoincrement keys are worse, in my opinion, because you don't usually see them as soon.
A primary key must be unique and immutable, a human-readable string can be used as a PK so long as it meets both of those requirements.
In the example you've given, it sounds fine, given that cities don't change their names (and in the rare event they do then you can change the PK value with enough effort).
One of the main reasons you'd use numeric PKs instead of strings is performance (the other being to take advantage of automatically-incrementing IDs, see IDENTITY). If you anticipate more than a hundred queries per second on your textual PK then I would move to use int or bigint as a PK type. When you reach that level of database size and complexity you tend to stop using SSMS to edit table data directly and use your own tools, which would presumably perform a JOIN so you'd get the city name in the same resultset as the city's numeric PK.
you are allowed.
it is generally not the best practice.
numeric - auto incrementing keys are preferred. they are easily maintained and allow for coding of input forms and other interfaces where the user does not have to think up a new string as a key...
imagine: should it be washington, or washington_dc or dc or washingtondc.. etc.

Does every table really need an auto-incrementing artificial primary key? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.
This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.
Is there any real, practical reason we need to use auto-incrementing keys?
This question has been asked numerous times on SO and has been the subject of much debate over the years amongst (and between) developers and DBAs.
Let me start by saying that the premise of you question implies that one approach is universally superior to the other ... this is rarely the case in real life. Surrogate keys and natural keys both have their uses and challenges - and it's important to understand what they are. Whichever choice you make in your system, keep in mind there is benefit to consistency - it makes the data model easier to understand and easier to develop queries and applications for. I also want to say that I tend to prefer surrogate keys over natural keys for PKs ... but that doesn't mean that natural keys can't sometimes be useful in that role.
It is important to realize that surrogate and natural keys are NOT mutually exclusive - and in many cases they can complement each other. Keep in mind that a "key" for a database table is simply something that uniquely identifies a record (row). It's entirely possible for a single row to have multiple keys representing the different categories of constraints that make a record unique.
A primary key, on the other hand, is a particular unique key that the database will use to enforce referential integrity and to represent a foreign key in other tables. There can only be a single primary key for any table. The essential quality of a primary key is that it be 100% unique and non-NULL. A desirable quality of a primary key is that it be stable (unchanging). While mutable primary keys are possible - they cause many problems for database that are better avoided (cascading updates, RI failures, etc). If you do choose to use a surrogate primary key for your table(s) - you should also consider creating unique constraints to reflect the existence of any natural keys.
Surrogate keys are beneficial in cases where:
Natural keys are not stable (values may change over time)
Natural keys are large or unwieldy (multiple columns or long values)
Natural keys can change over time (columns added/removed over time)
By providing a short, stable, unique value for every row, we can reduce the size of the database, improve its performance, and reduce the volatility of dependent tables which store foreign keys. There's also the benefit of key polymorphism, which I'll get to later.
In some instances, using natural keys to express relationships between tables can be problematic. For instance, imagine you had a PERSON table whose natural key was {LAST_NAME, FIRST_NAME, SSN}. What happens if you have some other table GRANT_PROPOSAL in which you need to store a reference to a Proposer, Reviewer, Approver, and Authorizer. You now need 12 columns to express this information. You also need to come up with a naming convention of some kind to identify which columns belong to which kind of individual. But what if your PERSON table required 6, or 8, or 24 columns to for a natural key? This rapidly becomes unmanageable. Surrogate keys resolve such problems by divorcing the semantics (meaning) of a key from its use as an identifier.
Let's also take a look at the example you described in your question.
Should the 2-character abbreviation of a state be used as the primary key of that table.
On the surface, it looks like the abbreviation field meets the requirements of a good primary key. It's relatively short, it is easy to propagate as a foreign key, it looks stable. Unfortunately, you don't control the set of abbreviations ... the postal service does. And here's an interesting fact: in 1973 the USPS changed the abbreviation of Nebraska from NB to NE to minimize confusion with New Brunswick, Canada. The moral of the story is that natural keys are often outside of the control of the database ... and they can change over time. Even when you think they cannot. This problem is even more pronounced for more complicated data like people, or products, etc. As businesses evolve, the definitions for what makes such entities unique can change. And this can create significant problems for data modelers and application developers.
Earlier I mentioned that primary keys can support key polymorphism. What does that mean? Well, polymorphism is the ability of one type, A, to appear as and be used like another type, B. In databases, this concept refers to the ability to combine keys from different classes of entities into a single table. Let's look at an example. Imagine for a moment that you want have an audit trail in your system that identifies which entities were modified by which user on what date. It would be nice to create a table with the fields: {ENTITY_ID, USER_ID, EDIT_DATE}. Unfortunately, using natural keys, different entities have different keys. So now we need to create a separate linking table for each kind of entity ... and build our application in a manner where it understand the different kinds of entities and how their keys are shaped.
Don't get me wrong. I'm not advocating that surrogate keys should ALWAYS be used. In the real world never, ever, and always are a dangerous position to adopt. One of the biggest drawbacks of surrogate keys is that they can result in tables that have foreign keys consisting of lots of "meaningless" numbers. This can make it cumbersome to interpret the meaning of a record since you have to join or lookup records from other tables to get a complete picture. It also can make a distributed database deployment more complicated, as assigning unique incrementing numbers across servers isn't always possible (although most modern database like Oracle and SQLServer mitigate this via sequence replication).
No.
In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.
In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.
A primary key must be:
unique (100% guaranteed! Not just "almost" unique)
NON NULL
A primary key should be:
stable if ever possible (not change - or at least not too frequently)
State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.
So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.
I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.
First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,
Then, having any key serves several often mutually inconsistent purposes.
Purpose 1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
Purpose 2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)
Purpose 3. To increase performance of queries that need to rapidly locate a specific record/row in the table.
Purpose 4. (Most Important from data consistency perspective!) To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)
Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying Purpose 4.
But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.
The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...
The only time you can get away without both is for a completely stand alone table that participates in no relationships with other tables and has an obvious and reliable natural key.
In general, a numeric primary key will perform better than a string. You can additionaly create unique keys to prevent duplicates from creeping in. That way you get the assurance of no duplicates, but you also get the performance of numbers (vs. strings in your scenario).
In all likelyhood, the major databases have some performance optimizations for integer-based primary keys that are not present for string-based primary keys. But, that is only a reasonable guess.
Yes, in my opinion every table needs an auto incrementing integer key because it makes both JOINs and (especially) front-end programming much, much, much easier. Others feel differently, but this is over 20 years of experience speaking.
The single exception is small "code" or "lookup" tables in which I'm willing to substitute a short (4 or 5 character) TEXT code value. I do this because the I often use a lot of these in my databases and it allows me to present a meaningful display to the user without having to look up the description in the lookup table or JOIN it into a result set. Your example of a States table would fit in this category.
No, absolutely not.
Having a primary key which can't change is a good idea (UPDATE is legal for primary key columns, but in general potentially confusing and can create problems for child rows). But if your application has some other candidate which is more suitable than an auto-incrementing value, then you should probably use that instead.
Performance-wise, in general fewer columns are better, and particularly fewer indexes. If you have another column which has a unique index on it AND can never be changed by any business process, then it may be a suitable primary key.
Speaking from a MySQL (Innodb) perspective, it's also a good idea to use a "real" column as a primary key rather than an "artificial" one, as InnoDB always clusters the primary key and includes it in secondary indexes (that is how it finds the rows in them). This gives it potential to do useful optimisation with a primary key which it can't with any other unique index. MSSQL users often choose to cluster the primary key, but it can also cluster a different unique index.
EDIT:
But if it's a small database and you don't really care about performance or size too much, adding an unnecessary auto-increment column isn't that bad.
A non auto-incrementing value (e.g. UUID, or some other string generated according to your own algorithm) may be useful for distributed, sharded, or diverse systems where maintaining a consistent auto-incrementing ID is difficult (or impossible - think of a distributed system which continues to insert rows on both sides of a network partition).
I think there are two things that may explain the reason why auto-incrementing keys are sometimes used:
Space consideration; ok your state name doesn't amount to much, but the space it takes may add up. If you really want to store the state with its name as a primary key, then go ahead, but it will take more place. That may not be a problem in certain cases, and it sounds like a problem of olden days, but the habit is perhaps ingrained. And we programmers and DBA do love habits :D
Defensive consideration: i recently had the following problem; we have users in the database where the email is the key to all identification. Why not make the email the promary key? except suddenly border cases creep in where one guy must be there twice to have two different adresses, and nobody talked about it in the specs so the adress is not normalized, and there's this situation where two different emails must point to the same person and... After a while, you stop pulling your hairs out and add the damn integer id column
I'm not saying it's a bad habit, nor a good one; i'm sure good systems can be designed around reasonable primary keys, but these two points lead me to believe fear and habit are two among the culprits
It's a key component of relational databases. Having an integer relate to a state instead of having the whole state name saves a bunch of space in your database! Imagine you have a million records referencing your state table. Do you want to use 4 bytes for a number on each of those records or do you want to use a whole crapload of bytes for each state name?
Here are some practical considerations.
Most modern ORMs (rails, django, hibernate, etc.) work best when there is a single integer column as the primary key.
Additionally, having a standard naming convention (e.g. id as primary key and table_name_id for foreign keys) makes identifying keys easier.

What should I consider when selecting a data type for my primary key?

When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.

SQL Server normalization tactic: varchar vs int Identity

I'm just wondering what the optimal solution is here.
Say I have a normalized database. The primary key of the whole system is a varchar. What I'm wondering is should I relate this varchar to an int for normalization or leave it? It's simpler to leave as a varchar, but it might be more optimal
For instance I can have
People
======================
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
name varchar(10)
number varchar(15)
Or I could have
People
======================
id int Identity
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
id int
number varchar(15)
Add several other one-to-many relationships of course.
What do you all think? Which is better and why?
I believe that the majority of people who have developed any significant sized real world database applications will tell you that surrogate keys are the only realistic solution.
I know the academic community will disagree but that is the difference between theoretical purity and practicality.
Any reasonable sized query that has to do joins between tables that use non-surrogate keys where some tables have composite primary keys quickly becomes unmaintainable.
Can you really use names as primary keys? Isn't there a high risk of several people with the same name?
If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.
And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.
Using any kind of non-synthetic data (i.e. anything from the user, as opposed to generated by the application) as a PK is problematic; you have to worry about culture/localization differences, case sensitivity (and other issues depending on DB collation), can result in data problems if/when that user-entered data ever changes, etc.
Using non-user-generated data (Sequential GUIDs (or non-sequential if your DB doesn't support them or you don't care about page splits) or identity ints (if you don't need GUIDs)) is much easier and much safer.
Regarding duplicate data: I don't see how using non-synthetic keys protects you from that. You still have issues where the user enters "Bob Smith" instead of "Bob K. Smith" or "Smith, Bob" or "bob smith" etc. Duplication management is necessary (and pretty much identical) regardless of whether your key is synthetic or non-synthetic, and non-synthetic keys have a host of other potential issues that synthetic keys neatly avoid.
Many projects don't need to worry about that (tightly constrained collation choices avoid many of them, for example) but in general I prefer synthetic keys. This is not to say you can't be successful with organic keys, clearly you can, but for many projects they're not the better choice.
I think if your VARCHAR was larger you would notice you're duplicating quite a bit of data throughout the database. Whereas if you went with a numeric ID column, you're not duplicating nearly the same amount of data when adding foreign key columns to other tables.
Moreover, textual data is a royal pain in terms of comparisons, your life is much easier when you're doing WHERE id = user_id versus WHERE name LIKE inputname (or something similar).
If the "name" field really is appropriate as a primary key, then do it. The database will not get more normalized by creating a surrogate key in that case. You will get some duplicate strings for foreign keys, but that is not a normalization issue, since the FK constraint guarantrees integrity on strings just as it would on surrogate keys.
However you are not explaining what the "name" is. In practice it is very seldom that a string is appropriate as a primary key. If it is the name of a person, it wont work as a PK, since more than one person can have the same name, people can change names and so on.
One thing that others don't seem to have mentioned is that joins on int fields tend to perform better than joins on varchar fields.
And I definitely would always use a surrogate key over using names (of people or businesses) because they are never unique over time. In our database, for instance, we have 164 names with over 100 instances of the same name. This clearly shows the dangers of considering using name as a key field.
The original question is not one of normalization. If you have a normalized database, as you stated, then you do not need to change it for normalization reasons.
There are really two issues in your question. The first is whether ints or varchars a preferable for use as primary keys and foreign keys. The second is whether you can use the natural keys given in the problem definition, or whether you should generate a synthetic key (surrogate key) to take the place of the natural key.
ints are a little more concise than varchars, and a little more efficient for such things as index processing. But the difference is not overwhelming. You should probably not make your decision on this basis alone.
The question of whether the natural key provided really works as a natural key or not is much more significant. The problem of duplicates in a "name" column is not the only problem. There is also the problem of what happens when a person changes her name. This problem probably doesn't surface in the example you've given, but it does surface in lots of other database applications. An example would be the transcript over four years of all the courses taken by a student. A woman might get married and change her name in the course of four years, and now you're stuck.
You either have to leave the name unchanged, in which case it no longer agrees with the real world, or update it retroactively in all the courses the person took, which makes the database disagree with the printed rosters made at the time.
If you do decide on a synthetic key, you now have to decide whether or not the application is going to reveal the value of the synthetic key to the user community. That's another whole can of worms, and beyond the scope of this discussion.