SQL: To primary key or not to primary key? - sql

I have a table with sets of settings for users, it has the following columns:
UserID INT
Set VARCHAR(50)
Key VARCHAR(50)
Value NVARCHAR(MAX)
TimeStamp DATETIME
UserID together with Set and Key are unique. So a specific user cannot have two of the same keys in a particular set of settings. The settings are retrieved by set, so if a user requests a certain key from a certain set, the whole set is downloaded, so that the next time a key from the same set is needed, it doesn't have to go to the database.
Should I create a primary key on all three columns (userid, set, and key) or should I create an extra field that has a primary key (for example an autoincrement integer called SettingID, bad idea i guess), or not create a primary key, and just create a unique index?
----- UPDATE -----
Just to clear things up: This is an end of the line table, it is not joined in anyway. UserID is a FK to the Users table. Set is not a FK. It is pretty much a helper table for my GUI.
Just as an example: users get the first time they visit parts of the website, a help balloon, which they can close if they want. Once they click it away, I will add some setting to the "GettingStarted" set that will state they helpballoon X has been disabled. Next time when the user comes to the same page, the setting will state that help balloon X should not be shown anymore.

Having composite unique keys is mostly not a good idea.
Having any business relevant data as primary key can also make you troubles. For instance, if you need to change the value. If it is not possible in the application to change the value, it could be in the future, or it must be changed in an upgrade script.
It's best to create a surrogate key, a automatic number which does not have any business meaning.
Edit after your update:
In this case, you can think of having conceptually no primary key, and make this three columns either the primary key of a composite unique key (to make it changeable).

Should I create a primary key on all three columns (userid, set, and key)
Make this one.
Using surrogate primary key will result in an extra column which is not used for other purposes.
Creating a UNIQUE INDEX along with surrogate primary key is same as creating a non-clustered PRIMARY KEY, and will result in an extra KEY lookup which is worse for performance.
Creating a UNIQUE INDEX without a PRIMARY KEY will result in a HEAP-organized table which will need an extra RID lookup to access the values: also not very good.

How many Key's and Set's do you have? Do these need to be varchar(50) or can they point to a lookup table? If you can convert this Set and Key into SetId and KeyId then you can create your primary key on the 3 integer values which will be much faster.

I would probably try to make sure that UserID was a unique identifier, rather than having duplicates of UserID throughout the code. Composite keys tend to get confusing later on in your code's life.
I'm assuming this is a lookup field for config values of some kind, so you could probably go with the composite key if this is the case. The data is already there. You can guarantee it's uniqueness using the primary key. If you change your mind and decide later that it isn't appropriate for you, you can easily add a SettingId and make the original composite key a unique index.

Create one, separate primary key. No matter what how bussines logic will change, what new rules will have to be applied to your Key VARCHAR(50) field - having one primary key will make you completly independent of bussines logic.

In my experience it all depends how many tables will be using this table as FK information. Do you want 3 extra columns in your other tables just to carry over a FK?
Personally I would create another FK column and put a unique constraint over the other three columns. This makes foreign keys to this table a lot easier to swallow.

I'm not a proponent of composite keys, but in this case as an end of the line table, it might make sense. However, if you allow nulls in any of these three fields becasue one or more of the values is not known at the time of the insert, there can be difficulty and a unique index might be better.

Better have UserID as 32 bit newid() or unique identifier because UserID as int gives a hint to the User of the probable UserID. This will also solve your issue of composite key.

Related

What does PRIMARY KEY actually signify, and does my table need one?

I have a PostgreSQL 9.3 database with a users table that stores usernames in their case-preserved format. All queries will be case insensitive, so I should have an index that supports that. Additionally, usernames must be unique, regardless of case.
This is what I have come up with:
forum=> \d users
Table "public.users"
Column | Type | Modifiers
------------+--------------------------+------------------------
name | character varying(24) | not null
Indexes:
"users_lower_idx" UNIQUE, btree (lower(name::text))
Expressed in standard SQL syntax:
CREATE TABLE users (
name varchar(24) NOT NULL
);
CREATE UNIQUE INDEX "users_lower_idx" ON users (lower(name));
With this schema, I've satisfied all my constraints, albeit without a primary key. The SQL standard doesn't support functional primary keys, so I cannot promote the index:
forum=> ALTER TABLE users ADD PRIMARY KEY USING INDEX users_lower_idx;
ERROR: index "users_lower_idx" contains expressions
LINE 1: ALTER TABLE users ADD PRIMARY KEY USING INDEX users_lower_id...
^
DETAIL: Cannot create a primary key or unique constraint using such an index.
But, I already have the UNIQUE constraint, and the column is already marked "NOT NULL." If I had to have a primary key, I could construct the table like this:
CREATE TABLE users (
name varchar(24) PRIMARY KEY
);
CREATE UNIQUE INDEX "users_lower_idx" ON users (lower(name));
But then I'll have two indexes, and that seems wasteful and unnecessary to me. So, does PRIMARY KEY mean anything special to postgres beyond "UNIQUE NOT NULL," and am I missing anything by not having one?
First off, practically every table should have a primary key.
citext
The additional module provides a data type of the same name. "ci" for case insensitive. Per documentation:
The citext module provides a case-insensitive character string type,
citext. Essentially, it internally calls lower when comparing
values. Otherwise, it behaves almost exactly like text.
It is intended for exactly the purpose you describe:
The citext data type allows you to eliminate calls to lower in SQL
queries, and allows a primary key to be case-insensitive.
Bold emphasis mine.
Be sure to read the manual about limitations first. Install it once per database with
CREATE EXTENSION citext;
text
If you don't want to go that route, I suggest you add a serial as surrogate primary key.
CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);
I would use text instead of varchar(24). Use a CHECK constraint if you need to enforce a maximum length (that may change at a later time). Details:
Any downsides of using data type "text" for storing strings?
Change PostgreSQL columns used in views
Along with the UNIQUE index in your original design (without type cast):
CREATE UNIQUE INDEX users_username_lower_idx ON users (lower(username));
The underlying integer of a serial is small and fast and does not have to waste time with lower() or the collation of your database. That's particularly useful for foreign key references. I mostly prefer that over some natural primary key with varying properties.
Both solutions have pros and cons.
I would suggest using a primary key, as you have stated you want something that is unique, and as you have demonstrated that you can put unique constraints on a username. I will assume that since this is a unique,not null username that you will use this to track your users in other parts of the Database, as well as allow usernames to be changed.
This is where a primary key will come in handy, instead of having to go into all of your tables and change the value of the Username column, you will only have one place to change it.
Example
Without primary key:
Table users
Username
'Test'
Table thingsdonebyUsers
RandomColumn AnotherColumn Username
RandomValue RandomValue Test
Now assume your user wants to change his username to Test1, well now you have to go find everywhere you used Username and change that to the new value before you change it in your users table since I'm assuming you will have a constraint there.
With Primary Key
Table users
PK Username
1 'Test'
Table thingsdonebyUsers
RandomColumn AnotherColumn PK_Users
RandomValue RandomValue 1
Now you can just change your users table and be done with the change.
You can still enforce unique and not null on your username column as you demonstrated.
This is just one of the many advantages of having normalized tables, which requires your tables to have a Primary Key that is an unrelated value(forget what the proper name is for this right now).
As for what a PK actually signifies, it just a non nullable unique column that identifies the row, so in this sense you already have a Primary Key on your table. The thing is that usually PKs are INT numbers because of the reason that I explained above.
Short answer: No, you don't need a declarative "PRIMARY KEY", since the UNIQUE index serves the same exact purpose.
Long answer:
The idea of having Primary Keys comes from database systems where the data is physically in key order. This requires having a single, "primary" key. MySQL InnoDB is this way, as are many older databases.
However, PostgreSQL does not keep the tables in key order; it separates the indexes, including the primary key index, from the heap, which is essentially unordered. As a result, in Postgres, there is no material difference between primary keys and unique indexes. You can even create a foreign key against a unique index, as long as that index covers the whole table.
That being said, some tools external to PostgreSQL look for primary keys and do not regard unique indexes as being equivalent. These tools may cause you issues because of not finding a PK.

Primary key in "many-to-many" table

I have a table in a SQL database that provides a "many-to-many" connection.
The table contains id's of both tables and some fields with additional information about the connection.
CREATE TABLE SomeTable (
f_id1 INTEGER NOT NULL,
f_id2 INTEGER NOT NULL,
additional_info text NOT NULL,
ts timestamp NULL DEFAULT now()
);
The table is expected to contain 10 000 - 100 000 entries.
How is it better to design a primary key? Should I create an additional 'id' field, or to create a complex primary key from both id's?
DBMS is PostgreSQL
This is a "hard" question in the sense that there are pretty good arguments on both sides. I have a bias toward putting in auto-incremented ids in all tables that I use. Over time, I have found that this simply helps with the development process and I don't have to think about whether they are necessary.
A big reason for this is so foreign key references to the table can use only one column.
In a many-to-many junction table (aka "association table"), this probably isn't necessary:
It is unlikely that you will add a table with a foreign key relationship to a junction table.
You are going to want a unique index on the columns anyway.
They will probably be declared not null anyway.
Some databases actually store data based on the primary key. So, when you do an insert, then data must be moved on pages to accommodate the new values. Postgres is not one of those databases. It treats the primary key index just like any other index. In other words, you are not incurring "extra" work by declaring one more more columns as a primary key.
My conclusion is that having the composite primary key is fine, even though I would probably have an auto-incremented primary key with separate constraints. The composite primary key will occupy less space so probably be more efficient than an auto-incremented id. However, if there is any chance that this table would be used for a foreign key relationship, then add in another id field.
A surrogate key wont protect you from adding multiple instances of (f_id1, f_id2) so you should definitely have a unique constraint or primary key for that. What would the purpose of a surrogate key be in your scenario?
Yes that's actually what people commonly do, that key is called surrogate key.. I'm not exactly sure with PostgreSQL, but in MySQL by using surrogate key you can delete/edit the records from the user interface.. Besides, this allows the database to query the single key column faster than it could multiple columns.. Hope it helps..

Should every table have a primary key?

I read somewhere saying that every table should have a primary key to fulfill 1NF.
I have a tbl_friendship table.
There are 2 fields in the table : Owner and Friend.
Fields of Owner and Friends are foreign keys of auto increment id field in tbl_user.
Should this tbl_friendship has a primary key?
Should I create an auto increment id field in tbl_friendship and make it as primary key?
Primary keys can apply to multiple columns! In your example, the primary key should be on both columns, For example (Owner, Friend). Especially when Owner and Friend are foreign keys to a users table rather than actual names say (personally, my identity columns use the "Id" naming convention and so I would have (OwnerId, FriendId)
Personally I believe every table should have a primary key, but you'll find others who disagree.
Here's an article I wrote on the topic of normal forms.
http://michaeljswart.com/2011/01/ridiculously-unnormalized-database-schemas-part-zero/
Yes every table should have a primary key.
Yes you should create surrogate key.. aka an auto increment pk field.
You should also make "Friend" an FK to that auto increment field.
If you think that you are going to "rekey" in the future you might want to look into using natural keys, which are fields that naturally identify your data. The key to this is while coding always use the natural identifiers, and then you create unique indexes on those natural keys. In the future if you have to re-key you can, because your ux guarantees your data is consistent.
I would only do this if you absolutely have to, because it increases complexity, in your code and data model.
It is not clear from your description, but are owner and friend foreign keys and there can be only one relationship between any given pair? This makes two foreign key column a perfect candidate for a natural primary key.
Another option is to use surrogate key (extra auto-incremented column as you suggested). Take a look here for an in-depth discussion.
A primary key can be something abstract as well. In this case, each tuple (owner, friend), e.g. ("Dave","Matt") can form a unique entry and therefore be your primary key. In that case, it would be useful not to use names, but keys referencing another table. If you guarantee, that these tuples can't have duplicates, you have a valid primary key.
For processing reasons it might be useful to introduce a special primary key, like an autoincrement field (e.g. in MySQL) or using a sequence with Oracle.
To comply with 1NF (which is not completely aggreed upon what defines 1NF), yes you should have a primary key identified on each table. This is necessary to provide for uniqueness of each record.
http://en.wikipedia.org/wiki/First_normal_form
In general, you can create a primary key in many ways, one of which is to have an auto-increment column, another is to have a column with GUIDs, another is to have two or more columns that will identify a row uniquely when taken together.
Your table will be much easier to manage in the long term if it has a primary key. At the very least, you need to uniquely identify each record in the table. The field that is used to uniquely identify each record might as well be the primary key.
Yes every table should have (at least one) key. Duplicating rows in any table is undesirable for lots of reasons so put the constraint on those two columns.

SQL - how to keep track of "simple relations"

I hope somebody can edit my title to better describe what I mean, because I don't know exactly what this would be called. However, consider this setup: I want to create a notification system, where a message is displayed to a user until he clicks "dismiss". I then need to "remember" that this user has dismissed the notification so I don't show it to him again. Here is my current solution
users table has a uid primary key and user info
notifications table has a nid primary key and notification text
notifications_seen table with two columns, uid and nid
When somebody clicks dismiss on a notification, I store their uid and the notification's nid in notifications_seen. This seems to work fine, but phpMyAdmin has giant red messages telling me that notifications_seen does not have an index. However, neither column is unique. Should I really have an extra utterly useless column in notifications_seen and call that a primary key? Is there a better way to do this?
You can use more than one column to create your primary key. In this case, you should set nid AND uid as your primary key in your notifications_seen table. The idea here is that even though neither nid or uid will be unique within your notifications_seen table; the nid/uid PAIR is unique. You should add a primary key constraint to these two columns. This is usually what you would like to do for this kind of situation.
There are times where you might actually want to create an auto-increment row to simplify the primary key. For example, when your best candidate key consists of a lot of columns (I'm pulling this out of the air; but lets say 4 or more columns) or you have columns which contain strings; which would be slower to match when doing lookups. But for this situation, just adding the primary key constraint to the two columns should be more than fine.
Primary keys are indexed BY default; which is why you should just add the primary key constraint to the two columns. This will also preserve the integrity of your data by making sure you don't accidentally insert rows with the same uid/nid pair.
You should also add a foreign key constraint on the uid to the id in the users table, and a foreign key constraint to the nid on the id in the notifications table. Adding the foreign key constraints will ensure you don't insert uids or nids which don't actually exist into your notifications_seen table.
You may be able to create a compound primary key (consisting of both uid and nid).
You could make an index on notifications_seen that contains both columns! Or create a separate column just for a primary key, or do both - having an index on uid and nid might speed up queries (but don't worry too much about that until you start to notice major performance problems - just remember it for the future). Having a primary key for these n:n relations isn't a terrible thing.

SQL: what exactly do Primary Keys and Indexes do?

I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:
Primary Key
Index
I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?
I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?
edit: should one index colums one is likely to want to ORDER BY?
Thanks a lot,
Mala
Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.
For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.
Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.
However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.
Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.
The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.
Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.
Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.
Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.
The difference between a primary key and a unique key is best explained through an example.
We have a table of users:
USER_ID number
NAME varchar(30)
EMAIL varchar(50)
In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.
Why then do we need a separate primary key? Three reasons:
the numeric key is more efficient
when used in foreign key
relationships as it takes less space
the email can change (for example
swapping provider) but the user is
still the same; rippling a change of
a primary key value throughout a schema
is always a nightmare
it is always a bad idea to use
sensitive or private information as
a foreign key
In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.
In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.
For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.
Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.
An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.
Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.