Use string as primary key to store words - sql

I am planning to create one huge table to store all words that could possibly exist for personal experimentation (whether part of the official dictionary, urban or else).
Does it make sense to use the word itself as a primary key?
It is 100% certain that words MUST be unique, moreover they will not change.
The purpose in the end is to also use this PK as FK in related tables to get more information on these words.
I am not too familiar with table scaling, so I wonder if I can get into trouble:
Performance wise
If the table becomes too large and has to be partitioned (?)
If I want to move the database to sqlite to use as embedded data store
Tagging this question with postgres (my current db), but may migrate to sqlite.

I would be surprised if there were enough words that you needed to partition the table. Of course if your "words" are really genetic sequences or something, I might be off there.
In any case, one of the primary purposes of a primary key is to support foreign key relationships. So, if there is any possibility that another table might refer to this table, then you want to take that into account.
Integer foreign keys are generally preferable, because they are a fixed length -- and that is a little more efficient for indexes. In addition, four-byte integers are probably smaller than the average word length, so they save on storage of the foreign key as well.
That would be balanced against an additional 4 bytes in the words table itself. On balance, I usually add synthetic primary keys.

Another Idea:
Make 2 columns
Column 1: Initial Letter
Column 2: The Word
[if word is APPLE :::: Column1-->A :::: Column2-->Apple]
Benefits:
you can query faster for tasks like 'word count from a letter' (like, no. of words from A)
could give you simple rules for making shards (like all words with column1 as 'A', can be assigned to a particular dedicated shard)

Related

SQL Server, does the id change if an element gets deleted?

I wondered if I insert, let's say, 10 entries into a SQL Server table.
If i then delete one of them, will the id/index change correspondingly?
Example:
1 | Simon Cowell | 56 years
2 | Frank Lampard| 24 years
3 | Harry Bennet | 12 years
If I delete #2, will Harry Bennet's index change to 2?
Thanks :)
EDIT:
Sorry for my outrage, had a bad day. And yes, I should have researched it myself, I deserve to be downvoted.
I don't ask for anything, I just want to say that I'm sorry :|
Since you seem to be conflating "id/index" let's talk a little but about the primary key and indexes in the context of a relational database.
The "id" or primary key assigned to a row in a SQL database is the unique identifier for that row. It can consist of one or more columns. (When more than one column is involved it is known as a "composite" or "multi-part" key.) The primary key should really do nothing more than be a unique handle for addressing a row: the primary key should not contain any information about the entity represented by the row, especially if that info has the potential to be mutable; an example would be a part number that has a suffix that stands for the type of metal the part is made from; if that metal can possibly change from titanium to unobtainium, say, that part number would make a bad choice as a primary key; it would be better to have another column to store the type of metal than to make the metal-type suffix part of the primary key. "Meaningful" primary keys might have made some sense in legacy non-relational databases but in a relational database they are to be avoided.
When seeking to enforce the uniqueness of a primary key, a database engine can make use of an index so it can rapidly test whether the key value exists. It could use a binary algorithm to find the value, avoiding the need to scan the actual data "brute force", row by row, looking for the value. But the index that is used behind the scenes by the engine to assist it with the primary key housekeeping is not the same as the primary key itself.
If you have a simple sequential integer as your primary key, there's an infinite number of them, so there is no need to reuse an integer when it becomes available when the row to which it was assigned has been deleted. So the relational database engine won't automatically attempt to reuse it, and it won't by any means change the primary key values that have been assigned to all other rows in the table when "gaps" in the number sequence are created by a deletion. Many other rows in other tables could be referencing those values and having them be mutable would create either chaos or a huge inefficiency.
Hashing algorithms are another very efficient way a database engine can quickly test for the existence of a key value. It computes the location in the hashed-file where the key would be if it did exist, and then looks there for it. The rows are stored in no particular order, so such schemes are optimized for instant finding of records in a large table, not for culling records that have a common characteristic, such as all customers in zipcode 10023.
No. You can set up triggers or logic to do it if you want; however, it will not automatically do this.
No it will not change automatically
No, it wont. And hopefully, that's the answer you're hoping for. For any auto-generated identifiers (such as IDENTITY columns), you should, so far as possible, ignore the data type and treat it as an opaque "blob" of identity information.
It gets assigned during insert, and you can use it for cross-referencing purposes, but the fact that it's numeric is not something you should use or rely upon. It's just a stable identifier for the row.

Does a foreign key reduce redundancy if all the columns are referenced?

I've read that one of the benefits of normalization is to reduce redundancy in the DB. But I'm wondering, if you end up referencing all the columns in the target table?
For example, if I have a Video table that references a Genre table, the Genre table might very likely have a single column with a dozen fairly static values like 'Horror' 'Sci-Fi' 'Romance' etc.
In a case like this, does it save any space to separate the two, or is the only benefit making it so you can update all referencing rows from one place?
Right, space saving is ONE of the benefits, not the only one.
In the case you mentioned, no, you'll save no space if you use that one column as the PK which is fine.
You could abstract that table with a autonumber/sequence and use that as the PK, and make the current column the candidate key (so it stays unique).
But leaving your design exactly as you've outlined, the benefit is in consistency. You'll have only those 12 values... you'll not accidentally enter a value for "Horrer" or "PSY-Fi"
Saving space is one benefit to separating the 2 tables. Like it was said before, putting a Genre_ID in place of an actual value such as "Horror" or "Adventure" will save space.
In my opinion, the better part of doing this to to enforce integrity. If you put in the text values in the Video table, what prevents you from changing the value accidentally? Now some rows may have "Adventure" or "Action/Adventure" and so on. By having 2 tables and referencing with a foreign key, you're going to have better control over what values can be a genre.
In summary, don't worry about the fact that you reference all the columns, especially if a table has very few columns. If you decide to add an ID field, or just keep the 1 column table as a list of "acceptable values", your goal should be to enforce integrity first, and save space or I/O costs second.
I would use surrogate keys (Autonumber, Identity, etc) and use that for the foreign key join instead of the actual value.
The idea is more about data quality than reducing space.
In most db's an INT will be smaller than Varchar2 (20)
Yes, it will save space if you have a surrogate key (int) which you use in the video table instead of the varchar(20) or whatever the genra would be.
But you've hit the problem yourself there:
single column with a dozen fairly
static values like 'Horror' 'Sci-Fi'
'Romance' etc.
With surrogate keys and normalized tables, you only have "Horror" stored once in database, but its ID number is stored in several places (a simple number is smaller than the text most of the time, and does save space). Not only does it increase the maintainability of the database, but it does indeed save raw space.
What happens if you want to ensure that your rows in the Video table have valid/predetermined entries for Genre? If you don't have a foreign key constraint you would need an enum for that column in the Video table and then you would have to change the schema every time you add a new Genre instead of just adding a new row to a Genre table.
In cases like that, your key values plus their indexes can be considerably larger than the data itself. Another model of doing simple codes like that is to have a table of codes and then an insert and update check constraint to validate them. That also avoids a join in order to get the genre data out. Which way you do it is kind of a toss up and would depend on what your application queries tend to be.
Data modification anomalies
What if you add a new genre?
Is Sci-Fi the same as SciFi?
Is Sci-Fi the same as Sci-fi?
It gets worse if you another table, say, "Books" that have the same Genres.
Normalization has nothing to do with saving space. It's about eliminating potential anomalies that can occur as a result of certain kinds of redundancy. Since normalization defines the logical level only it would be quite possible for a normalized database to be physically larger or physically smaller than a denormalized or un-normalized one.
It is true that normalization generally makes designs that ought to translate efficiently into storage - but that's really down to the features of the DBMS rather than something implicit in normalization.
You will save also because 'Horror' takes 12 bytes in Unicode, while GenreId can be a Byte or char(1).

When to use an auto-incremented primary key and when not to?

I'm trying to figure out the "best practices" for deciding whether or not to add an auto-incrementing integer as the primary key to a table.
Let's say I have a table containing data about the chemical elements. The atomic number of each element is unique and will never change. So rather than using an auto-incrementing integer for each column, it would probably make more sense to just use the atomic number, correct?
Would the same be true if I had a table of books? Should I use the ISBN or an auto-incrementing integer for the primary key? Or a table of employees containing each person's SSN?
There are a lot of already addressed questions on Stack Overflow that can help you with your questions. See here, here, here and here.
The term you should be looking for: surrogated keys.
Hope it helps.
This is a highly debated question, with lots of emotion on both sides.
In my humble opinion, if there's a good, useable natural key available -- like an ISBN -- I use it. I'm going to store it on the database anyway. Yes, a natural key is usually bigger than an integer auto-increment key, but I think this issue is overblown. Disk space is cheap today. I'd worry more about it taking longer to process. If you were talking about an 80 byte text field as a primary key, I'd say no. But if you're thinking of using a 10-byte ISBN instead of an 8-byte big integer, I can't imagine that brings much of a performance penalty.
Sometimes there's a performance advantage to natural keys. Suppose, for example, I want to find how many copies of a given book have been sold. I don't care about any of the data from the Book master record. If the primary key is ISBN, I could simply write "select count(*) from sale where isbn='143573338X'". If I used an autoincrement key, I would have to do a join to look up the isbn, and the query becomes more complex and slower, like "select count(*) from book join sale using (bookid) where isbn='143573338X'". (And I can assure you that that as this particular ISBN is for my book, the number of sale records is very small, so doing the join and reading one extra record is a big percentage difference!)
Another advantage of natural keys is that when you have to work on the database and you look at records that refer back to this table by key, it's easy to see what record they're referring to.
On the other hand, if there is no good, obvious natural key, don't try to cobble together a crazy one. I've seen people try to make a natural key by concatenating together the first 6 letters of the customers first name, his year of birth, and his zip code, and then pray that that will be unique. That sort of silliness is just making trouble for yourself. Often people end up taking on a sequence number to insure it's unique anyway, and at that point, why bother? Why not just use the sequence number by itself as the key?
You've got the idea right there.
Auto-increment should be used as a unique key when no unique key already exists about the items you are modelling. So for Elements you could use the Atomic Number or Books the ISBN number.
But if people are posting messages on a message board then these need a unique ID, but don't contain one naturally so we assign the next number from a list.
It make sense to use natural keys where possible, just remember to make the field as the primary key and ensure that it is indexed for performance
With regards to using ISBN and SSN you really have to Think about how many rows in other tables are going to reference these through foreign keys because those ids will take up much more space than an integer and thus may lead to a waste of disk space and possibly to worse join performance.
The main problem that I have seen with the auto incrementing an integer approach is when you export your data to bring into another db instance, or even an archive and restore operation. Because the integer has no relation to the data that it references, there is no way to determine if you have duplicates when restoring or adding data to an existing database. If you want no relationship between the data contained in the row and the PK, I would just use a guid. Not very user friendly to look at, but it solves the above problem.
I'm trying to figure out the "best practices" for deciding whether or not to add an auto-incrementing integer as the primary key to a table.
Use it as a unique identifier with a dataset where the PKey is not part of user managed data.
Let's say I have a table containing data about the chemical elements. The atomic number of each element is unique and will never change. So rather than using an auto-incrementing integer for each column, it would probably make more sense to just use the atomic number, correct?
Yes.
Would the same be true if I had a table of books? Should I use the ISBN or an auto-incrementing integer for the primary key? Or a table of employees containing each person's SSN?
ISBNs/SS#s are assigned by third-parties and because of their large storage size would be a highly inefficient way to uniquely identify a row. Remember, PKeys are useful when you join tables. Why use a large data format like an ISBN which would be numerous textual characters as the Unique identifier when a small and compact format like Integer is available?
Old topic I know, but one other thing to consider is that given that most RDBMSes lay out blocks on disk using the PK, using an auto-incrementing PK will simply massively increase your contention. This may not be an issue for your baby database you're mucking around with, but believe me it can cause massive performance issues at the bigger end of town.
If you must use an auto-incrementing ID, maybe consider using it as part of a PK. Tack it on the end to maintain uniqueness.....
Also, it is best to exhaust all possibilities for natural PKs before jumping to a surrogate. People are generally lazy with this.
One other thing I haven't seen mentioned is you're potentially exposing statistics you may not want to. For example, when this question was written, it was (give or take) the 2,186,260th one posted on Stack Overflow, which I can tell by literally just looking at it's ID in the URL.

Why is the Primary Key often an integer in a Relational Database Management System?

It's been habitual in most of the scenarios while developing a database design we set primary key as integer type for a unique identifier in the Table. Why not use string or float for primary keys? Does this affect the accessibility of values, or in plain words retrieval speed? Are there any specific reasons?
An integer will use less disk space than a string, thus giving you a smaller index file to search through. This is important for large tables where you want to have as much of the index as possible cached in RAM.
Also, they can be autoincremented so you don't need to write your own routines to generate keys.
You often want to have a technical key (also called a surrogate key), a key that is only used to identify the row and not used for anything else. Most data may change sooner or later for reasons you can't control and you don't want to update it everywhere. Even such seemingly static data as a nation-assigned personal id number can change (if you get a new identity) or there may be laws prohibiting their use. A key generated by you, however, is in your own control. For such surrogate keys it's useful to have a small key that is easily generated.
As for "floats as primary keys": Don't do this. A primary key should uniquely identify a row. Floats have no equality relation, which means you cannot safely compare two float values for equality. This is an inherent shortcoming of floating-point values. If you need decimals, use a fixed-point number type instead.
The primary key is supposed to be an index that can provide a unique way to access a specific row in a table. Primary keys can be most data types (in practical applications, float/double won't work too well), and primary keys can also be compound keys (comprised of several columns.)
If you carefully examine the data in the table, you might be able to find a data item that will be unique for every row in the table, thereby eliminating the requirement that you fabricate a key like the autoincrement integer that you find in some schemas.
If you're in a manufacturing environment it might be an alphanumeric field like part number or assembly identifier. Retail or warehousing applications might have a stock number or combination of stock number/shipment/manufacturer.
Generally, If some data in your table is supposed to be a unique identifier it probably will serve well as a primary key for your table.
Using data that exists in the table already completely eliminates the requirement to "make up" a value (such as the autoincrement column) and use it as the primary key. This saves space since it's one less column in the table and one less index on the table.
Yes, in my experience integer keys are almost always faster, since it's more efficient for the database engine to compare integers than comparing strings. Depending on the "uniqueness" of the data (technically called cardinality http://en.wikipedia.org/wiki/Cardinality_(SQL_statements)), the effect of character vs. integer keys is nominal.
Character keys may degrade performance depending on the number of characters that the database needs to compare to determine whether keys are equal or not equal. In the pathological case, imagine a hundred-character field which differ only on the right hand side. One row has 100 A's. We need to compare this to a key with 99 A's and a B as the last character. Conceptually, databases compare character fields just like strcmp() (strncmp() if you prefer) from left to right.
good luck!
The only reason is for performance.
A logical database design should specify which "real" columns are unique, but when the logical design is transformed into a physical design, it is traditional to not use any of these "natural" keys as the primary key; instead, a meaningless integer column is added for this purpose - called a "surrogate key".
Normally the designer will add further unique constraints for the "real" uniqueness business rules as specified in the logical design.
This is because most DBMS's have trouble updating a primary key (e.g. due to performance issues when cascading the update to child tables). Some DBMS's might not be able to support non-integer primary keys at all.
Some side notes:
There's no theoretical reason why
primary keys should be immutable.
This is nothing to do with
normalization, which happens in the
logical model (which should never
have surrogate keys).
Also, note that the idea of a
"primary" key is not a relational
concept - it is simply a way of
denoting the "preferred" uniqueness
constraint, perhaps for relational
integrity - but there's nothing in
the RM that says that you must use
the same key for each child table.
I've created natural keys as "Primary
Keys" in Oracle databases before,
albeit rarely. I've even had them
used for foreign key constraints.
Admittedly, they were either
immutable, or I hand-wrote the
update-cascade code; and I had
trouble with one front-end
application where the PK included a
date column.
Bottom line: there is no theoretical requirement for surrogate keys, but they're much more practical than the alternative.
I suspect that it is because we can auto-increment integer values so it's easy to generate a new unique key for every insert.
Many common ORM (Object Relational Mapping) tools either force to use or at least recommend using integer as primary key.
Integer primary key also saves space compared to string and integer primary key is in some cases also faster. Sequences or auto increment fields make integer primary key generation easy at least if you do not work with distributed databases.
These are some of the main reasons why i think we have integers/ numbers as primary keys.
1.Primary keys should be able to uniquely define your row and should be immutable. One of the problems with using real attributes (name etc..) is that they could change over time. To maintain relational integrity in such a case would be very difficult as this change needs to cascade to all the child records.
2.The size of the table and thereby the index would be smaller in case we use a number as a key for the tab.e
3.Since these are automatically generated using a sequence, we can be sure that the values would be unique under all circumstances.
Check this.
http://forums.oracle.com/forums/thread.jspa?messageID=3916511&#3916511

Should I design a table with a primary key of varchar or int?

I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.
I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.
Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.
When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.
Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.
Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:
The prevalent data access pattern.
The storage considerations.
Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:
current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.
FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.
Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.
Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.
These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.
So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.
I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the
primary key.
With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.
And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.
Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.
Marc
I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:
it requires no "algorithm" to generate the id/key/value
you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
you're following a defacto standard
That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.
Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.
Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.
Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.
INT
Use unless good reason not to do so.
GUID
Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'
Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.
CHAR
Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.
For best performance, 99.999% of the time the primary key should be a single integer field.
Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.
Use INT. Your points are all valid; I would prioritize as:
Ease of using SQL auto increment capabiity - why reinvent the wheel?
Managability - you don't want to have to change the key field.
Performance
Disk Space
1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.
If Joe Celko was on here, he would have some harsh words... ;-)
I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):
SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'
Now, with a Varchar PK on VehicleType:
SELECT Make, Model
FROM Vehicle
WHERE VehicleTypeName = 'Truck'
The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.
Just a thought. :-)
While INT is generally recommended, it really depends on your situation.
If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.
But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.
With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".
One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.
Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.
We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).
Keeping in mind that this is quite old a question, I still want to make the case for using varchar with surrogate keys fur future readers:
An environment with several replicated machines
Scenarios where it is required that the ID of a to be inserted row is known before it is actually inserted (i.e., the client assigns this ID, not the database)