Related
I have a (dummy) table structure as follows:
ticket
id: int(11) PK
name: varchar(255)
status: ?????????
The question is, what data type should I use for status? Here are my options, as I see them:
varchar representing the status - BAD because there's no integrity
enum representing the status - BAD because to change the value, I'd have to alter the table, and then any code with dropdowns for the values, etc etc etc
int FK to a status table - GOOD because it's dynamic, BAD because it's harder to inspect by sight (which may be useful)
varchar FK to a status table - GOOD because it's dynamic, and visible on inspection. BAD because the keys are meaningful, which is generally frowned upon. Interestingly, in this case it's entirely possible for the status table to have just 1 column, making it a glorified enum
Have I got an accurate read of the situation? Is having a meaningful key really that bad? Because while it does give me goosebumps, I don't have any reason for it doing so...
Update:
For option 4, the proposed structure would be status: char(4) FK, to a status table. So,
OPEN => "Open"
CLOS => "Closed"
"PEND" => "Pending Authorization"
"PROG" => "In Progress
What's the disadvantage in this case ? The only benefit I can see of using int over char in this case is slight performance.
I would go with number 4, but I'd use a char(x) column. If you're worried about performance, a char(4) takes up as much space (and, or so one would think, disk i/o, bandwidth, and processing time) as an int, which also takes 4 bytes to store. If you're really worried about performance, make it a char(2) or even char(1).
Don't think of it as "meaningful data", think of it as an abbreviation of the natural key. Yes, the data has meaning, but as you've noticed that can be a good thing when working with the data--it means you don't always have to join (even if to a trivially small table) to extract meaning from the database. And of course the foreign key constraint ensures that the data is valid, since it must be in the lookup table. (This can be done with CHECK constraints as well, but Lookup tables are generally easier to manage and maintain over time.)
The downside is that you can get caught up with trying to find meaning. char(1) has a strong appeal, but if you get to ten or more values, it can get hard to come up with good meaningful values. Less of a problem with char(4), but still a possible issue. Another downside: if the data is likely to change, then yes, your meaningful data ("PEND" = "Pending Authorization") can lose its meaning ("PEND" = "Forward to home office for initial approval"). That's a poor example; if codes like that do change, you're probably much better off refactoring your system to reflect the change in business rules. I guess my point should be, if it's a user-entered lookup value, surrogate keys (integers) will be your friend, but if they're internally defined and maintained you should definitely consider more human-friendly values. That, or you'll need post-em notes on your monitor to remind you what the heck Status = 31 is supposed to mean. (I've got three on mine, and the stickum wears out every few months. Talk about cost to maintain...)
Go with number 3. Create a view that join's in the status value if you want something inspectable.
I would use an INT, and create a foreign key relationship to the status table. An INT should definitely be safe for an enumerated status column.
May I recommend you go with a statusID field instead, and have a separate table mapping the ID to a varchar?
EDIT: I guess that's exactly what you outlined in point 3. I think that is the best option.
I'm assuming that your database has a front end of some description, and that regular users are not exposed to the status code.
So, your convenience is only for programmers and DBAs - important people, but I wouldn't optimize my design for them.
Stronger - I would be very careful of using "meaningful" abbreviations - the most egregious data foul-up I've ever seen happened when a developer was cleansing some data, and interpreted the "meaningful" key incorrectly; turns out that "PROG" does not mean "programmed", but "in progress".
Go with option 3.
I've been working with a lot of databases recently that require a lot of statuses AND I've got a few notes that might be worth adding to the conversation.
INT: One thing I found is that if an application has a lot of tracking going on, the number of reference tables can quickly get unwieldy and, as you've mentioned, make inspecting the database at a glance impractical. (Which, for some of my clients, has mattered much more than the scant milliseconds it's saved in processing time.)
VARCHAR: Terrible idea for programming, but it's important to consider if a given status is actually going to be used by the code, or just human eyes. For the latter, you get unlimited range and don't have to maintain any relationships.
CHAR(4): Using a descriptive char column can actually be a very good approach. I'd typically only consider it if the value range were going to be low and obvious, but only because I consider this a nonstandard approach (risking confusion to new devs). Realistically, you could use a CHAR value as a foreign key just the same as an INT, gain legibility and maintain performance parity.
The one thing you couldn't do that I'd miss is mathematical operations (like "<" and ">").
INT Range: A hybrid strategy I've tried out is to use INT, but adding a degree of semantics to the numbers. So, for instance,
1-10 being for initial stages,
11-20 being in progress, and
21-30 being the final stages.
60-69 for errors, rejections
The problem here is that if you discover you need more numbers, you're SOL, since the next range is already taken. So, what I ended up doing was (sort of) mimicking HTTP responses:
100-199 being for initial stages,
200-299 being in progress, and
300-399 being the final stages.
500-599 for errors, rejections
I prefer this to simple INT, and while it can be less descriptive than CHAR, it can also be less ambiguous. Whereas "PROG" could mean a number of things, good, bad or benign, if I can see something is in the 500 range, I may not known what the problem is, I will be able to tell you there is a problem.
Creating a separate table with status is a good idea when you want to show the list of the status in the HTML form. You can show the verbose description from the lookup table and it will help the user to choose status if the requirements are like that.
From the development perspective, I would like to go integer as a primary key. You can optimize it by using small/tiny integer if you know it will not exceed the limit.
If you use abbreviation as a foreign key then you have to think every time to make it unique all the time as #Philip Kelley had mentioned it as a downside of it.
Lastly, you can declare the table type MYISAM if you like.
Update:
Reflecting #Philip Kelley opinion, if there are too many status, then it's better to use integer as foreign key. If there are only couple of status, then may be use abbr as a foreign key.
I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.
I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.
Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.
When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.
Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.
Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:
The prevalent data access pattern.
The storage considerations.
Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:
current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.
FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.
Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.
Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.
These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.
So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.
I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the
primary key.
With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.
And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.
Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.
Marc
I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:
it requires no "algorithm" to generate the id/key/value
you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
you're following a defacto standard
That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.
Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.
Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.
Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.
INT
Use unless good reason not to do so.
GUID
Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'
Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.
CHAR
Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.
For best performance, 99.999% of the time the primary key should be a single integer field.
Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.
Use INT. Your points are all valid; I would prioritize as:
Ease of using SQL auto increment capabiity - why reinvent the wheel?
Managability - you don't want to have to change the key field.
Performance
Disk Space
1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.
If Joe Celko was on here, he would have some harsh words... ;-)
I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):
SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'
Now, with a Varchar PK on VehicleType:
SELECT Make, Model
FROM Vehicle
WHERE VehicleTypeName = 'Truck'
The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.
Just a thought. :-)
While INT is generally recommended, it really depends on your situation.
If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.
But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.
With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".
One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.
Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.
We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).
Keeping in mind that this is quite old a question, I still want to make the case for using varchar with surrogate keys fur future readers:
An environment with several replicated machines
Scenarios where it is required that the ID of a to be inserted row is known before it is actually inserted (i.e., the client assigns this ID, not the database)
The object part is misleading. My question is not specific to one type of sql.
ATM i am using sqlite but i will be switching to TSQL (It looks to be what my host is offering) and i am rewriting some tables and logic to clean things up.
One pattern i notice is i have a bigint that could possible be one of 2+ keys and sometimes if i need it a bit or byte as an id to what type it is. Two major things that come to mind is
1) If a bigint is signed and i happen to have more then 2^32 PK in a table would bigint still be able to access the keys? I'm thinking since the value will be negative and PKs are always positive? that i will get an error. mistake, i forgot bigint is 2^63, i have nothing to worry about.
2) If i have a bigint that represents the PK of 2 or more tables would this be bad practice? For whatever reason i think there is a better way of doing bigint the_id, byte the_id
1) TSQL Bigint is not limited to 2^32, it is -2^63 to +2^63 -1 - far more than you are likely to use, something like 9.2 Quintillion.
2) Id prefer to not use an ID to represent two different PK's in other tables, but without knowing the problem you are solving it is hard to say whether it is the right decision or the only one you really have.
As a rule of thumb, I always design my columns to hold one piece of data and only one type of data (by type, I don't mean data type although that is generally true as well.)
If nothing else, putting two different IDs in the same column will prevent the use of foreign keys to make sure that your data is accurate and valid.
I inherited a system that stores default values for some fields in some tables in the database. These default values are used in the application to prepopulate control values. So, essentially, every field in every table in the database can potentially have a default value. The previous developer decided to store these values in a single table that had a key/value pair combo. The key represented by the source table + field name (as a varchar) and the default value as a varchar field as well. The Business layer would then cast the varchar field to the appropriate data type.
Somehow, I feel this is brittle. Though the application works as expected, there appears to be a flaw in the design.
Any suggestions on how this requirement could have been handled earlier? Is there anything that can be done now to make it more robust?
EDIT: I should have defined what the term "default" meant. This is NOT related to the default value of a field in the table. Instead, it's a default value that will be used by the application in the front end.
That schema design is fine. I've seen it used in commercial apps and I've also used it in a few apps of my own where the users needed to be able to change the defaults or other parameters around fields in the application (limits, allowable characters etc.) or the application allowed the users to add new fields for use in the app.
Having it in a single table (not separate default tables for each table) protects it from schema changes in the tables it supports. Those schema changes become simple configuration changes in this model.
The single table makes it easy to encapsulate in a Class to serve as the "defaults" configuration object.
Some general advice:
When you inherit a working system and don't understand why something was designed the way it is - the problem is most likely your understanding, not the system. If it isn't broken, do not fix it.
Specific advice on the only improvements I would recommend (if they become necessary):
You can use the new SQLVARIANT field for the value rather than a varchar - it can hold any of the regular data types - you will need to add support for casting them to the correct data type when using the value though.
Refactoring the schema now would be risky and disruptive so I would not recommend it (unless you absolutely need to do that to fix some pressing issue, but from what you say it doesn't look like you do).
Were you doing the design from scratch, I'd recommend one defaults-table per real-table, with a single row recording the defaults with their real column names and types. Having several tiny tables scares some DBAs, but it's not really any substantial performance hit in my experience, and it sure does make the system sounder and more robust as you desire.
If you want to use SQL's own DEFAULT clauses as other answers recommend, be sure to name those explicitly, otherwise altering them when a default changes can be a doozy. Personally, I like to keep the default values separate from the schema's metadata, especially in a system where updating or tweaking a default value is a much more common and should-be-innocuous operation than the momentous undertaking of metadata/schema changes!
A better way to go would be using SQL Server's built-in DEFAULT constraint.
e.g.
CREATE TABLE Orders
(
OrderID int IDENTITY NOT NULL,
OrderDate datetime NULL CONSTRAINT DF_Orders_OrderDate DEFAULT(GETDATE()),
Freight money NULL CONSTRAINT DF_Orders_Freight DEFAULT (0) CHECK(Freight >= 0),
ShipAddress nvarchar (60) NULL DF_Orders_ShipAddress DEFAULT('NO SHIPPING ADDRESS'),
EnteredBy nvarchar (60) NOT NULL DF_Orders_EnteredBy DEFAULT(SUSER_SNAME())
)
If the requirement was that the default selection of a given control be configurable and the "application works as expected" then I don't see a problem. You didn't elaborate on the "flaw" in the design.
If you want (and should!) use default values on the database, I would strongly urge to use the built-in DEFAULT constraint that's available on any field. Only that is really guaranteed to work properly - anything else is a hack solution at best.....
CREATE TABLE
MyTable(ID INT IDENTITY(1,1),
NumericField INT CONSTRAINT DF_MyTable_Numeric DEFAULT(42),
StringID VARCHAR(20) CONSTRAINT DF_MyTable_StringID DEFAULT 'rubbish',
.......)
and so on - you get the idea.
Just learn this mantra: DRY - DON'T REPEAT YOURSELF - don't go out re-inventing stuff that's already there and has been heavily tested and used - just use it.
Marc
I think the real answer here depends heavily on how often these default values change. If default values are set once when the database is designed, then DEFAULT constraints make sense. If some non-technical person needs to change them every couple of months, I really like the design presented.
Where it becomes brittle is when you have a mismatch between the column names or data types and the default values in the Defaults table. If you code a careful interface to manage the Defaults table values, this shouldn't be a problem.
If its a case of UI defaults - the following questions come up.
How 'dynamic' or generic is your schema.? Does the same schema support multiple front-ends - i.e. the same column in the Db-table supports 2 front-ends - each with multiple-defaults?
Do multiple apps use your DB? In that case having the default defined in the DB could still help
Its possible to query the Data-dictionary to get default info for each column.
If a UI field does not have a corresponding db-column, then your current implementation will be justified in such cases
One downside is more code is needed to handle and use this table.
If it was a one-off application and this default 'intelligence' was not leveraged across multiple-apps - thats a consideration
Its more like a 'frameworky' kind of thing to do - though I'd say its quite non-standard, and would be done on the web-layer.
If the table of default values is what irks you, here's some food for thought:
Rather than stick to dogma about varchar(max) or casting strings or key/value tables - a good approach is to ask what is a better solution?
From your description, it seems like this table contains few rows, and has only two columns: key and value.
I should ask - is the data in this table controlled from an administrative UI? Perhaps this is the reason behind the original design decision to make it a table.
If type-safety is an issue, you could consider the existence of a "type" column and analyze how the code would need to be changed.
I wouldn't jump to conclusions about "good" or "bad" until you really analyze WHY the system is implemented this way.
The idea (not necessarily the implementation) makes sense if you want to keep the application defaults separate from the data, allowing different apps to have different defaults.
This is generally a good thing, because many databases inevitably spawn secondary applications (import jobs, if not anything else), where you do NOT want the same defaults (or any defaults at all); and in principle, a defaults table can support this.
What I think makes this implementation less-than-ideal is that while the defaults are MOSTLY data-driven, the calling application either needs its own set of defaults IF the defaults are not specified in the table or terminate.
If the former is employed, this could introduce a number of headaches when you're trying to track down bugs, especially if you don't have good audit tables keeping track of which user/application inserted/updated which rows on which tables.
Disclaimer: I'm generally of the thought that columns ought to be NULLable and w/o defaults, except where it absolutely makes sense from a data point of view (id/primary key, custom timestamp, etc.). If a column should never be NULL introduce a constraint forbidding NULLs, not a concrete default.
When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.