Setting up a basic table, is integer auto-increment primary key a standard? - sql

I've got a web app, I have a concept of users, which will probably go into a user table like:
table: user
username (varchar 32) | email (varchar 64) | fav_color | ...
I'd like username and email to be unique, meaning I can't allow users to have the same username, or the same email. I see example tables of this sort always introduce an integer auto-increment primary key.
Not sure why this is done, is it to somehow speed up queries by foreign keys later on? For example, let's say I have another table like:
table: grades
username (foreign key?) | grade
Is it inefficient to be using the username as a foreign key? I want to do queries like:
SELECT FROM grades WHERE username = 'john'
so I guess it'd be faster to do an integer lookup for the database instead?:
SELECT FROM grades WHERE fk_user_id = 20431
Thanks

What you are asking is somewhat a design decision based on the judgment of the individual data modeler. Personally, in this case, I would include the auto-incremented integer primary key. It is unusual to be able to guarantee that username (and even more so e-mail address) will be unchanging. However, you can design your software so that the same integer primary key always refers to the same user, regardless of what else may change about that user record.
What would help the performance of username lookups would be a UNIQUE constraint on username with an index that corresponds to it. If you really want e-mail addresses to be unique (mostly a business requirement decision), you could also put a UNIQUE constraint on e-mail address. Foreign keys are ignored in the default database engine in MySQL (unfortunately), so I won't bother going into the benefits there from a data modeling perspective.
Edit:
I guess I will go into the benefits for foreign keys if they are being enforced now. Yes, there are provisions for updating all the data that depends on a foreign key (such as ON UPDATE CASCADE). However, they are often poorly understood and viewed as difficult to maintain. It is usually a better practice to have a foreign key refer to something unchanging, hence your integer primary key.

My advice, after years of db-building
only use chars as PK when they don't represent anything in the real world.
The real world is a caotic place, and as soon as you use PK's from it, you're one a slope.
Just trust me.
(and there's a speed gain too).
regards,
//t

It may not necessarily be a "standard" per se, but it is quick, easy, convenient and generally resistant to business key changes.
See also: Pros and cons of autoincrement keys on every table

Integers as the primary key will make your life a lot easier down the road as your application evolves. Use an index on your username and/or email for the query optimization.

I like integer keys because:
make joins faster
smaller and faster indexes
never need to change (your username and email field values may need to change)

Indexes on Integer columns perform faster than on large character values. Having the primary key on the narrow identity column is the most optimal solution.

Using real-world data as foreign keys is very problematic, and "inefficient" because they violate referential integrity. You think that user names and emails are unique and will never change? You are almost certainly wrong. Read an earlier question on natural keys
Integer auto-increment primary keys will be faster, but that's not why they're used. They're used because they work. Use them.

Related

Identifying primary key for a vote table

I am working on a voting table design using Postgres 9.5 (but maybe the question itself is applicable to sql in general). My vote table should be like:
-------------------------
object | user | timestamp
-------------------------
Where object and user are foreign keys to the ids corresponding to their own tables. I have a problem identifying what actually should be a primary key.
I thought at first to make a primary_key(object, user) but since I use django as a server, it just doesn't support multicolumn primary key, I am not sure either about the performance since I may access a row using only one of those 2 columns (i.e. object or user), but the advantage this idea works automatically as a unique key since the same user shouldn't vote twice for the same object. And I don't need any additional indexes.
The other idea is to introduce an auto or serial id field, I really don't think of any advantage of using this approach especially when the table gets bigger. I need also to introduce at least a unique_key(object, user) which adds to the computational complexity and data storage. Not even sure about the performance when I select using one of the 2 columns, may be I need also 2 additional indexes for the object and user to accelerate the select operation since I need this heavily.
Is there something I am missing here? or is there a better idea?
django themselves recognise that the "natural primary key" in this case is not supported. So your gut feeling is right, but django don't support it.
https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys
Relational database designs use a set of columns as the primary key
for a table. When this set includes more than one column, it is known
as a “composite” or “compound” primary key. (For more on the
terminology, here is an ​article discussing database keys).
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns. Django currently can't work with these schemas; they
must instead introduce a redundant single-column key (a “surrogate”
key), forcing applications to make arbitrary and otherwise-unnecessary
choices about which key to use for the table in any given instance.
I'm less failure with django personally. One option might be to form an extra column as a primary key by concatenating object and user.
Remember that there is nothing special about a primary key. You can always add a UNIQUE KEY on the pair of columns and make them both NOT NULL.
You might find this example useful.
https://thecuriousfrequency.wordpress.com/2014/11/11/make-primary-key-with-two-or-more-field-in-django/
The correct solution woulf be to have a PRIMARY KEY (object, user) and an additional index on user. The primary key index can also be used for searches for object alone.
Form a database point of view, your problem is that you use an inadequate middleware if it does not support composite primary keys.
You'll probably have to introduce an artificial primary key constraint and in addition have a unique constraint on (object, user) and an index on user, but your gut feelings that that is not the best solution from a database perspective are absolutely true.

Using string as PK vs using GUID or int Id with Unique Constraint for Names

Hi I was wondering what is the best practice for tables in which you have a record that must be unique. I've seen the two ways of doing that: use a Primary Key or add a Unique constraint to the column.
If you use a primary key, is it bad practice to have a primary key such as "UserName" that is varchar(*)? Does that impact performance enough that it is problematic? Is it best to use an integer id with a unique constraint on the username?
I see some other factors that may impact choosing a column as PK vs Unique. Am I right about these?
PK
- Column should be one that doesn't ever need to be changed
Unique
- Column could be changed later on
Having a primary key on the UserName is not the best idea, but it isn't so bad in performance as you maybe think.
The best idea would be using a ID (INT) as PRIMARY KEY and the UserName as UNIQUE.
Usernames change over time, that is why they are a bad candidate for a PK especally since it is extremely likely you have child records associated with the username. For instance suppose my username included some variation of my real name. If I then got divorced and returned to my maiden name, the last thing I want to do is be reminded of that SOB I was married to and so I change my username. Do you really want to change the 2 million posts I've made in the last ten years as well? I didn't think so.
Yes string comparisons are slower but this may or may not be an issue depending on the overall amount of action the database will get. Small copmany database with less than 200 users, probaly not a problem, Internet site with millions of users, much more likely to be a problem.
It may or may not be a good idea as others have already discussed. Let me just add one more detail...
I see some other factors that may impact choosing a column as PK vs Unique.
The main difference is usually related to clustering. Most DBMSes (that support clustering) automatically use PK as a clustering index. For example MySQL/InnoDB always clusters data and you can't event turn it off, while MS SQL Server clusters by default (you have to use special syntax to turn it off).
Should you choose to use clustering (or are forced by your DBMS), having fewer indexes is usually better (e.g. see "Disadvantages of clustering" in this article), even when leading to "fatter" foreign keys.

Use email address as primary key?

Is email address a bad candidate for primary when compared to auto incrementing numbers?
Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. However my colleague suggests that string comparison will be slower than integer comparison.
Is it a valid reason to not use email as primary key?
We are using PostgreSQL.
String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.
If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.
I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone numbers, emails can get re-used. Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later.
Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you will have to update the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)
the primary key should be unique and constant
email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.
Disadvantages of using an email address as a primary key:
Slower when doing joins.
Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)
An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.
Advantages of using email address as a primary key:
You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.
When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.
You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.
In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.
No one seems to have mentioned a possible problem that email addresses could be considered private. If the email address is the primary key, a profile page URL most likely will look something like ..../Users/my#email.com. What if you don't want to expose the user's email address? You'd have to find some other way of identifying the user, possibly by a unique integer value to make URLs like ..../Users/1. Then you'd end up with a unique integer value after all.
It is pretty bad. Assume some e-mail provider goes out of business. Users will then want to change their e-mail. If you have used e-mail as primary key, all foreign keys for users will duplicate that e-mail, making it pretty damn hard to change ...
... and I haven't even started talking about performance considerations.
I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "John#Doe.com", you won't find “john#doe.com".
At the logical level, the email is the natural key.
At the physical level, given you are using a relational database, the natural key doesn't fit well as the primary key. The reason is mainly the performance issues mentioned by others.
For that reason, the design can be adapted. The natural key becomes the alternate key (UNIQUE, NOT NULL), and you use a surrogate/artificial/technical key as the primary key, which can be an auto-increment in your case.
systempuntoout asked,
What if someone wants to change his email address? Are you going to change all the foreign keys too?
That's what cascading is for.
Another reason to use a numeric surrogate key as the primary key is related to how the indexing works in your platform. In MySQL's InnoDB, for example, all indexes in a table have the primary key pre-pended to them, so you want the PK to be as small as possible (for speed's and size's sakes). Also related to this, InnoDB is faster when the primary key is stored in sequence, and a string would not help there.
Another thing to take into consideration when using a string as an alternate key, is that using a hash of the actual string that you want might be faster, skipping things like upper and lower cases of some letters. (I actually landed here while looking for a reference to confirm what I just said; still looking...)
yes, it is better if you use an integer instead. you can also set your email column as unique constraint.
like this:
CREATE TABLE myTable(
id integer primary key,
email text UNIQUE
);
Yes, it is a bad primary key because your users will want to update their email addresses.
Another reason why integer primary key is better is when you refer to email address in different table. If address itself is a primary key then in another table you have to use it as a key. So you store email addresses multiple time.
I am not too familiar with postgres. Primary Keys is a big topic. I've seen some excellent questions and answers on this site (stackoverflow.com).
I think you may have better performance by having a numeric primary key and use a UNIQUE INDEX on the email column. Emails tend to vary in length and may not be proper for primary key index.
some reading here and here.
Personally, I do not use any information for primary key when designing database, because it is very likely that I might need to alter any information later. The sole reason that I provide primary key is, it is convenience to do most SQL operation from client-side, and my choice for that has been always auto-increment integer type.
I know this is a bit of a late entry but i would like to add that people abandon email accounts and service providers recover the address allowing another person to use it.
As #HLGEM pointed out "Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later." in this case should John Smith want your service you either have to refuse to use his email address or delete all your records pertaining to Julia Smith.
If you have to delete records and they relate to the financial history of the business depending on local law you could find yourself in hot water.
So i would never use data like email addresses, number plates, etc. as a primary keys because no matter how unique they seem they are out of your control and can provide some interesting challenges that you may not have time to deal with.
You may need to consider any applicable data regulation legislation. Email is personal information, and if your users are EU citizens for instance then under GDPR they can instruct you to delete their information from your records (remember this applies regardless of which country you are based).
If you need to keep the record itself in the database for referential integrity or historical reasons such as audit, using a surrogate key would allow you to just NULL all the personal data field. This obviously isn't as easy if their personal data is the primary key
Your colleague is right: Use an autoincrementing integer for your primary key.
You can implement the email-uniqueness either at the application level, or you coudl mark your email address column as unique, and add an index on that column.
Adding the field as unique will cost you string comparision only when inserting into that table, and not when performing joins and foreign key constraint checks.
Of course, you must note that adding any constraints to your application at the database level can cause your app to become inflexible. Always give due consideration before you make any field "unique" or "not null" just because your application needs it to be unique or non-empty.
Use a GUID as a primary key... that way you can generate it from your program when you do an INSERT and you don't need to get a response from the server to find out what the primary key is. It will also be unique accross tables and databases and you don't have to worry about what happens if you truncate the table some day and the auto-increment gets reset to 1.
you can boost the performance by using integer primary key.
you should use an integer primary key. if you need the email-column to be unique, why don't you simply set an unique-index on that column?
If you have a non int value as primary key then insertions and retrievals will be very slow on large data.
primary key should be chosen a static attribute. Since email addresses are not static and can be shared by multiple candidates so it is not a good idea to use them as primary key. Moreover email addresses are strings usually of a certain length which may be greater than unique id we would like to use[len(email_address)>len(unique_id)] so it would require more space and even worst they are stored multiple times as foreign key. And consequently it will lead to degrade the performance.
It depends on the table. If the rows in your table represent email addresses, then email is the best ID. If not, then email is not a good ID.
If it's simply a matter of requiring the email to be unique then you can just create a unique index with that column.
Email is a good unique index candidate, but not for primary key, if it is a primary key, you will be no able to change the contact's emails address for example.
I think your join querys will be slower too.
don not use email address as primary key , keep email as unique but don not use it as primary key, use user id or username as primary key

Relational database design question - Surrogate-key or Natural-key?

Which one is the best practice and Why?
a) Type Table, Surrogate/Artificial Key
Foreign key is from user.type to type.id:
b) Type Table, Natural Key
Foreign key is from user.type to type.typeName:
I believe that in practice, using a natural key is rarely the best option. I would probably go for the surrogate key approach as in your first example.
The following are the main disadvantages of the natural key approach:
You might have an incorrect type name, or you may simply want to rename the type. To edit it, you would have to update all the tables that would be using it as a foreign key.
An index on an int field will be much more compact than one on a varchar field.
In some cases, it might be difficult to have a unique natural key, and this is necessary since it will be used as a primary key. This might not apply in your case.
The first one is more future proof, because it allows you to change the string representing the type without updating the whole user table. In other words you use a surrogate key, an additional immutable identifier introduced for the sake of flexibility.
A good reason to use a surrogate key (instead of a natural key like name) is when the natural key isn't really a good choice in terms of uniqueness. In my lifetime i've known no fewer than 4 "Chris Smith"s. Person names are not unique.
I prefer to use the surrogate key. It is often people will identity and use the natural key which will be fine for a while, until they decide they want to change the value. Then problems start.
You should probably always use an ID number (that way if you change the type name, you don't need to update the user table) it also allows you to keep your datasize down, as a table full of INTs is much smaller than one full of 45 character varchars.
If typeName is a natural key, then it's probably the preferable option, because it won't require a join to get the value.
You should only really use a surrogate key (id) when the name is likely to change.
Surrogate key for me too, please.
The other might be easier when you need to bang out some code, but it will eventually be harder. Back in the day, my tech boss decided using an email addr as a primary key was a good idea. Needless to say, when people wanted to change their addresses it really sucked.
Use natural keys whenever they work. Names usually don't work. They are too mutable.
If you are inventing your own data, you might as well invent a syntheic key. If you are building a database of data provided by other people or their software, analyze the source data to see how they identify things that need identification.
If they are managing data at all well, they will have natural keys that work for the important stuff. For the unimportant stuff, suit yourself.
well i think surrgote key is helpful when you don't have any uniquely identified key whose value is related and meaningful as is to be its primary key... moreover surrgote key is easier to implement and less overhead to maintain.
but on the other hand surrgote key is sometimes make extra cost by joining tables.
think about 'User' ... I have
UserId varchar(20), ID int, Name varchar(200)
as the table structure.
now consider that i want to take a track on many tables as who is inserting records... if i use Id as a primary key, then [1,2,3,4,5..] etc will be in foreign tables and whenever i need to know who is inserting data i've to join User Table with it because 1,2,3,4,5,6 is meaningless. but if i use UserId as a primary key which is uniquely identified then on other foreign tables [john, annie, nadia, linda123] etc will be saved which is sometimes easily distinguishable and meaningful . so i need not to join user table everytime when i do query.
but mind it, it takes some extra physical space as varchar is saved in foreign tables which takes extra bytes.. and ofcourse indexing has a significant performance issue where int performs better rather than varchar
Surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row in the table.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

why we use an ID column in the table if we have a unique value

i want to ask a small question here but i really don't know what is the answer of this question.
i have a accounts table which has
Username | Password
the username is a primary key so its unique
so is it necessary to put an ID column to the table ? if Yes, what is the benefit of that ?
Thanks
Search by a numeric key is slightly faster (varies from one DB to another). Also, if you have a lot of references to the user table, you save some database space by having the numeric ID as the foreign key, as opposed to a string name.
It will make everything else easier, mainly foreign key relationships from other tables. And it allows you to change the username if you want - primary keys are not easy to change.
Faster in indexes
Consumes less disk space (and is again faster) when used as a foreign key
And, as mentioned a number of times, you can change the username without modifying a host of other tables.