Can you use username as unique identifier/primary key - sql

Just a quick question on the best practise for the below case.
Developing a website with accounts. Website is setup so that no two accounts can have the same username i.e. all usernames are unique.
When persisting accounts in a database, is it ok to use the username as a primary key (unique identifier) or is there some reasons I should be aware of that would require a separately generated unique id?

Don't use username as primary key, never.
Use surrogate keys (ie autogenerated numbers), because
they are faster and smaller (key is 4-8 bytes, username is up to you dont know bytes?)
it is only right now you suppose usernames will be unique, later you will find out that you need non-unique usernames (for example for deleted users for which you have to save transaction history), or requirments will change
users should be able to change their username, in case of error/typo/etc
UPDATE: in case of distributed systems, use GUIDs

Related

Why some tables in popular databases dont have primary keys defined?

I have seen tables in SAP database and TFS database (both configuration and collection) that don't have primary keys defined. Why is that?
In TFS a number of tables don't have a primary key nor foreign keys due very specific to performance constraints. Plus, these databases are not supposed to be updated manually, TFS handles all changes to these tables though its own APIs. It's one of the reasons why Microsoft doesn't support direct querying against these tables.
Another reason in the case of TFS is that its cloud counterpart, Visual Studio Team Services, doesn't store all of its data in SQL Azure, but in Table Storage Blob storage or DocumentDB.
Nobody has attempted an answer to this yet, so here goes...
Some tables don't need a PRIMARY KEY, because they are never going to be updated, and might only have a tiny set of data in them (e.g. lookup tables). If these tables have no indexes at all then they are essentially heaps, which isn't always a bad thing.
Why should every table have a PRIMARY KEY defined anyway? If the table has a UNIQUE CLUSTERED INDEX in place then this does just about everything that a PRIMARY KEY does, with the added bonus that you can allow NULL values to be stored. Depending on the implementation (e.g. SQL Server allows only one "unique" NULL value, other RDBMs allow multiple) this might be a much better match to your application.
For example, let's say you want a table with two columns, account number and account name. Let's assume that you make your PRIMARY KEY account number, because you want to ensure it is unique. Now you want to allow NULL account numbers, because these aren't always supplied at the point where you create an account; you have some weird 2-part process where you create a record with just a name, then backfill the account number. If you stick to the PRIMARY KEY design then you would need to do something like add an IDENTITY column, make this the PRIMARY KEY, then add a UNIQUE CONSTRAINT to prevent identical multiple account numbers.
Now you are left with a surrogate key that is going to be of little use to any queries, so you would probably end up with a performance index anyway, even if you don't care about uniqueness.
If you had no PRIMARY KEY, but instead a UNIQUE CLUSTERED INDEX then you would be able to do this without changing your table, with only one customer ever allowed to have a NULL account number at the same time (if it's SQL Server).
I designed a database for a customer a couple of years ago that had over 200 tables, and not a single PRIMARY KEY. Although this was more about me "making a point", it's not a stretch to assume that the same is true for other database developers out there.

How to store all user settings in one database with a unique id

Im making an app where I have a server-side sql database to store the user settings of all users.
Im not sure how to make each user unique, so that the database knows who is who.
The database is storing these user data for each row: id, email, county, age and gender.
So im thinking the best way is to make the user unique to he/she's email - which is unique - so that the when the settings are updated or outputted, the sql knows what row to fetch.
How should I go about with this?
And how would i then output the right data to the right user?
An entity in the database should have a primary key. I understand that in your design the id field is going to be the primary key. Usually this is an auto-generated integer. This is called a surrogate key In this case you need to tell to the table that the email field must be unique as well. You can do that by creating a unique index for this field. The unique index will prevent the creation of two different users with the same email. Going with this approach you can query the table checking either for id or for email.
An alternative is to have natural key. In this case, email would be the primary key of your table, so you wouldn't have the id field. Going with this approach you can query the table checking either for email, which is the unique identifier of each user.

Username or Userid

I've seen many discussion whether is better to use userid or username as primary key for a table. userid would allow for the flexibility of later changing username if desired. Also is a way to implement security. However, username is also a unique identifier.
If I choose userid as my primary key, what is the best way to enforce username to take on a unique value?
If I choose username, what problems should I be aware?
I would declare UserId as the PRIMARY KEY as there will be other tables referencing this user record through UserId and thereby will be useful to enforce any FOREIGN KEY constraints.
If username needs to be unique, then I would declare it as NON NULL column and define UNIQUE KEY constraint. The NON NULL property will prevent the single null value allowed by the UNIQUE KEY constraint in a column. So, this set up on UserName would be similar to that of a PRIMARY KEY.
This is from my own point of view.
I rather choose UserID of data type int (or could be string) to be the primary key of the table since at all times this can't be change. And there is no problem on some foreign keys that are referencing on it since it is unchangeable.
The reason why I didn't choose Username is because at some point, although this is unique, can be change sometimes. If there are already foreign keys that are referencing to it, that username can't be change at all until those keys or records where dropped or deleted first.
Semantically, to my ear at least, userid sounds like an artificially created value, possibly an artificial primary key, while username sounds like a natural, user-friendly (component of) a natural primary key. To use either term in the opposite sense is likely to confuse programmers and users occasionally, and possibly create subtle bugs down the road.
what is the best way to enforce username to take on a unique value?
Create a unique index.
If I choose username, what problems should I be aware?
Will you allow a user to change their user name (as long as it stays unique) whenever they want? If yes, then use a user id that you generate; otherwise use a user name that they choose.
For your first question, UNIQUE constraint is available in most modern RDBMS. You can implement it at application level too.
For your second question, I don't see any obvious problems. UserNames are generally known to the end-users. However, if you have large set of users and the max length of username field is large, indexing may not be as efficient as on int type userID fields.

Setting up a basic table, is integer auto-increment primary key a standard?

I've got a web app, I have a concept of users, which will probably go into a user table like:
table: user
username (varchar 32) | email (varchar 64) | fav_color | ...
I'd like username and email to be unique, meaning I can't allow users to have the same username, or the same email. I see example tables of this sort always introduce an integer auto-increment primary key.
Not sure why this is done, is it to somehow speed up queries by foreign keys later on? For example, let's say I have another table like:
table: grades
username (foreign key?) | grade
Is it inefficient to be using the username as a foreign key? I want to do queries like:
SELECT FROM grades WHERE username = 'john'
so I guess it'd be faster to do an integer lookup for the database instead?:
SELECT FROM grades WHERE fk_user_id = 20431
Thanks
What you are asking is somewhat a design decision based on the judgment of the individual data modeler. Personally, in this case, I would include the auto-incremented integer primary key. It is unusual to be able to guarantee that username (and even more so e-mail address) will be unchanging. However, you can design your software so that the same integer primary key always refers to the same user, regardless of what else may change about that user record.
What would help the performance of username lookups would be a UNIQUE constraint on username with an index that corresponds to it. If you really want e-mail addresses to be unique (mostly a business requirement decision), you could also put a UNIQUE constraint on e-mail address. Foreign keys are ignored in the default database engine in MySQL (unfortunately), so I won't bother going into the benefits there from a data modeling perspective.
Edit:
I guess I will go into the benefits for foreign keys if they are being enforced now. Yes, there are provisions for updating all the data that depends on a foreign key (such as ON UPDATE CASCADE). However, they are often poorly understood and viewed as difficult to maintain. It is usually a better practice to have a foreign key refer to something unchanging, hence your integer primary key.
My advice, after years of db-building
only use chars as PK when they don't represent anything in the real world.
The real world is a caotic place, and as soon as you use PK's from it, you're one a slope.
Just trust me.
(and there's a speed gain too).
regards,
//t
It may not necessarily be a "standard" per se, but it is quick, easy, convenient and generally resistant to business key changes.
See also: Pros and cons of autoincrement keys on every table
Integers as the primary key will make your life a lot easier down the road as your application evolves. Use an index on your username and/or email for the query optimization.
I like integer keys because:
make joins faster
smaller and faster indexes
never need to change (your username and email field values may need to change)
Indexes on Integer columns perform faster than on large character values. Having the primary key on the narrow identity column is the most optimal solution.
Using real-world data as foreign keys is very problematic, and "inefficient" because they violate referential integrity. You think that user names and emails are unique and will never change? You are almost certainly wrong. Read an earlier question on natural keys
Integer auto-increment primary keys will be faster, but that's not why they're used. They're used because they work. Use them.

Use email address as primary key?

Is email address a bad candidate for primary when compared to auto incrementing numbers?
Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. However my colleague suggests that string comparison will be slower than integer comparison.
Is it a valid reason to not use email as primary key?
We are using PostgreSQL.
String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.
If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.
I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone numbers, emails can get re-used. Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later.
Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you will have to update the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)
the primary key should be unique and constant
email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.
Disadvantages of using an email address as a primary key:
Slower when doing joins.
Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)
An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.
Advantages of using email address as a primary key:
You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.
When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.
You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.
In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.
No one seems to have mentioned a possible problem that email addresses could be considered private. If the email address is the primary key, a profile page URL most likely will look something like ..../Users/my#email.com. What if you don't want to expose the user's email address? You'd have to find some other way of identifying the user, possibly by a unique integer value to make URLs like ..../Users/1. Then you'd end up with a unique integer value after all.
It is pretty bad. Assume some e-mail provider goes out of business. Users will then want to change their e-mail. If you have used e-mail as primary key, all foreign keys for users will duplicate that e-mail, making it pretty damn hard to change ...
... and I haven't even started talking about performance considerations.
I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "John#Doe.com", you won't find “john#doe.com".
At the logical level, the email is the natural key.
At the physical level, given you are using a relational database, the natural key doesn't fit well as the primary key. The reason is mainly the performance issues mentioned by others.
For that reason, the design can be adapted. The natural key becomes the alternate key (UNIQUE, NOT NULL), and you use a surrogate/artificial/technical key as the primary key, which can be an auto-increment in your case.
systempuntoout asked,
What if someone wants to change his email address? Are you going to change all the foreign keys too?
That's what cascading is for.
Another reason to use a numeric surrogate key as the primary key is related to how the indexing works in your platform. In MySQL's InnoDB, for example, all indexes in a table have the primary key pre-pended to them, so you want the PK to be as small as possible (for speed's and size's sakes). Also related to this, InnoDB is faster when the primary key is stored in sequence, and a string would not help there.
Another thing to take into consideration when using a string as an alternate key, is that using a hash of the actual string that you want might be faster, skipping things like upper and lower cases of some letters. (I actually landed here while looking for a reference to confirm what I just said; still looking...)
yes, it is better if you use an integer instead. you can also set your email column as unique constraint.
like this:
CREATE TABLE myTable(
id integer primary key,
email text UNIQUE
);
Yes, it is a bad primary key because your users will want to update their email addresses.
Another reason why integer primary key is better is when you refer to email address in different table. If address itself is a primary key then in another table you have to use it as a key. So you store email addresses multiple time.
I am not too familiar with postgres. Primary Keys is a big topic. I've seen some excellent questions and answers on this site (stackoverflow.com).
I think you may have better performance by having a numeric primary key and use a UNIQUE INDEX on the email column. Emails tend to vary in length and may not be proper for primary key index.
some reading here and here.
Personally, I do not use any information for primary key when designing database, because it is very likely that I might need to alter any information later. The sole reason that I provide primary key is, it is convenience to do most SQL operation from client-side, and my choice for that has been always auto-increment integer type.
I know this is a bit of a late entry but i would like to add that people abandon email accounts and service providers recover the address allowing another person to use it.
As #HLGEM pointed out "Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later." in this case should John Smith want your service you either have to refuse to use his email address or delete all your records pertaining to Julia Smith.
If you have to delete records and they relate to the financial history of the business depending on local law you could find yourself in hot water.
So i would never use data like email addresses, number plates, etc. as a primary keys because no matter how unique they seem they are out of your control and can provide some interesting challenges that you may not have time to deal with.
You may need to consider any applicable data regulation legislation. Email is personal information, and if your users are EU citizens for instance then under GDPR they can instruct you to delete their information from your records (remember this applies regardless of which country you are based).
If you need to keep the record itself in the database for referential integrity or historical reasons such as audit, using a surrogate key would allow you to just NULL all the personal data field. This obviously isn't as easy if their personal data is the primary key
Your colleague is right: Use an autoincrementing integer for your primary key.
You can implement the email-uniqueness either at the application level, or you coudl mark your email address column as unique, and add an index on that column.
Adding the field as unique will cost you string comparision only when inserting into that table, and not when performing joins and foreign key constraint checks.
Of course, you must note that adding any constraints to your application at the database level can cause your app to become inflexible. Always give due consideration before you make any field "unique" or "not null" just because your application needs it to be unique or non-empty.
Use a GUID as a primary key... that way you can generate it from your program when you do an INSERT and you don't need to get a response from the server to find out what the primary key is. It will also be unique accross tables and databases and you don't have to worry about what happens if you truncate the table some day and the auto-increment gets reset to 1.
you can boost the performance by using integer primary key.
you should use an integer primary key. if you need the email-column to be unique, why don't you simply set an unique-index on that column?
If you have a non int value as primary key then insertions and retrievals will be very slow on large data.
primary key should be chosen a static attribute. Since email addresses are not static and can be shared by multiple candidates so it is not a good idea to use them as primary key. Moreover email addresses are strings usually of a certain length which may be greater than unique id we would like to use[len(email_address)>len(unique_id)] so it would require more space and even worst they are stored multiple times as foreign key. And consequently it will lead to degrade the performance.
It depends on the table. If the rows in your table represent email addresses, then email is the best ID. If not, then email is not a good ID.
If it's simply a matter of requiring the email to be unique then you can just create a unique index with that column.
Email is a good unique index candidate, but not for primary key, if it is a primary key, you will be no able to change the contact's emails address for example.
I think your join querys will be slower too.
don not use email address as primary key , keep email as unique but don not use it as primary key, use user id or username as primary key