I am building a database system and having trouble with the design of one of my tables.
In this system there is a users table, an object table, an item table and cost table.
A unique record in the cost table is determine by the user, object, item and year. However, there can be multiple records that have the same year if the item is different.
The hierarchy goes user->object->item->year, multiple unique years per item, multiple unique items per object, multiple unique objects per user, multiple unique users.
What would be the best way to design the cost table?
I am thinking of including the userid, objectid and itemid as foreign keys and then using a composite key consisting of userid, objecid, itemid and costyear. I have heard that composite keys are bad design, but I am unsure how to structure this to get away from using a composite key. As you can tell my database building skills are a bit rusty.
Thanks!
P.S. If it matters, this is an interbase db.
To avoid the composite key, you just define a surrogate key. This holds an artificial value, for instance an auto counter.
You still can (and should) define a unique constraint on these columns.
Btw: its not only recommended not to use composite keys, it's also recommendable to use surrogate keys. In all your tables.
Use an internally generated key field (called surrogate keys), something like CostID, that the users will never see but will uniquely identify each entry in the Cost table (in SqlServer, fields like uniqueidentifier or IDENTITY would do the trick.)
Try building your database with a composite key using exactly the columns you outlined, and see what happens. You may be pleasantly surprised. Making sure that there is no missing data in those four columns, and making sure that no two rows have the same value in all four columns will help protect the integrity of your data.
When you declare a composite primary key, the order of columns in your declaration won't affect the logical consequences of the dclaration. However the composite index that the DBMS builds for you will also have the columns in the same order, and the order of columns in a composite index does affect performance.
For queries that specify only one, two, or three of these columns, the index will be useless if the first column in the index is a column not specified in the query. If you know in advance how your queries are gonig to me, and which queries most need to run fast, this can help you declare the columns for the primary key in the right order. In rare circumstances creating two or three additional one column indexes can speed up some queries, at the cost of slowing down updates.
Related
I recently ran into a quite complex problem and after looking around a lot I couldn't find a solution to it. I've found answers to my questions many times before on stackoverflow.com, so I decided to post here.
So I'm making a user/group managment system for a web-based project, and I'm storing all related data into a postgreSQL database. This system relies on three tables:
USERS (Contains the primary key "USER_ID")
GROUPS (Contains the primary key "GROUP_ID")
GROUP_USERS
The two first tables simply define all the users and all the groups on the site, and the last table, GROUP_USERS, stores the groups every user is part of. It only has two columns:
USER_ID
GROUP_ID
Since every user can be a member of several groups, I decided to make a separate table for this purpose, rather than storing a comma separated column in the USERS-table.
Now, both columns are foreign keys, and I want to make them a composite primary key as well, this since each combination of USER_ID and GROUP_ID has to be unique. But now I am stuck with what seems to be a lot of indexes and relations to a very small table only containing numbers. In the end, I want this table to be as fast as possible, even if containing tens of thousands of rows. Size on disk shouldn't be a problem since its just all numbers anyway, but it feels quite stupid to have a full-sized index refering to a smaller table.
Should I stick with my current solution, store comma-separated values in a column in the USERS-table or is there any other solution I should be aware of. What I am looking for is best possible performance. This table could potentially (but not likely or commonly) be queried several hundreds of times on a single page load.
I don't want to use an array-column, even if they are supported by postgreSQL. I want to be as generic as possible so I can switch database later on, if necessary.
EDIT: In other words, will using a composite primary key and two foreign keys in one table with only two columns have a negative impact on performance rather than the opposite due to the size of the generated index?
EDIT2: Clarifications.
Thank you!
I believe you're in the right path right now, but didn't understand which indexes you really defined.
My suggestion is that you should have your primary key index in USERS by USER_ID, your primary key index in GROUPS by GROUP_ID, and two more indexes in GROUP_USERS. One of the indexes in GROUP_USERS should be either by the couple (USER_ID, GROUP_ID), or by the couple (GROUP_ID, USER_ID). The second index should be by the field that was left in second place in the last index defined.
Now why did I mentioned two options while defining the primary key over GROUP_USERS? That's because there is a slightly performance difference between a primary key index and any other duplicate index. It's very likely that your most common query into that table would be to find out if a user is in a certain group, and that query will perform fast in either way. What you have to consider is which of the following two queries will be more common.
Query which groups a certain user is in.
Query which users are in a certain group.
If 1 is more likely over 2, then your primary key should be (USER_ID, GROUP_ID), otherwise (GROUP_ID, USER_ID).
If I understand your question properly, what you might be missing is that Primary Keys (for that matter, Foreign Keys as well) can be what is called Composite, meaning that they contain more than one column... That's what you want here. A composite Primary Key on both UserId and GroupId, and a Foreign Key on each one indivudyally, that each points to (references) the PK in the respective parent table.
I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.
I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:
Primary Key
Index
I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?
I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?
edit: should one index colums one is likely to want to ORDER BY?
Thanks a lot,
Mala
Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.
For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.
Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.
However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.
Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.
The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.
Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.
Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.
Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.
The difference between a primary key and a unique key is best explained through an example.
We have a table of users:
USER_ID number
NAME varchar(30)
EMAIL varchar(50)
In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.
Why then do we need a separate primary key? Three reasons:
the numeric key is more efficient
when used in foreign key
relationships as it takes less space
the email can change (for example
swapping provider) but the user is
still the same; rippling a change of
a primary key value throughout a schema
is always a nightmare
it is always a bad idea to use
sensitive or private information as
a foreign key
In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.
In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.
For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.
Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.
An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.
Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.
When do you use each MySQL index type?
PRIMARY - Primary key columns?
UNIQUE - Foreign keys?
INDEX - ??
For really large tables, do indexed columns improve performance?
Primary
The primary key is - as the name suggests - the main key of a table and should be a column which is commonly used to select the rows of this table. The primary key is always a unique key (unique identifier). The primary key is not limited to one column, for example in reference tables (many-to-many) it often makes sense to have a primary key including two or more columns.
Unique
A unique index makes sure your DBMS doesn't accept duplicate entries for this column. You ask 'Foreign keys?' NO! That would not be useful since foreign keys are per definition prone to be duplicates, (one-to-many, many-to-many).
Index
Additional indexes can be placed on columns which are often used for SELECTS (and JOINS) which is often the case for foreign keys. In many cases SELECT (and JOIN) queries will be faster, if the foreign keys are indexed.
Note however that - as SquareCog has clarified - Indexes get updated on any modifications to the data, so yes, adding more indexes can lead to degradation in INSERT/UPDATE performance. If indexes didn't get updated, you would get different information depending on whether the optimizer decided to run your query on an index or the raw table -- a highly undesirable situation.
This means, you should carefully assess the usage of indices. One thing is sure on the basis of that: Unused indices have to be avoided, resp. removed!
I'm not that familiar with MySQL, however I believe the following to be true across most database servers. An index is a balanced tree which is used to allow the database to scan the table for given data. For example say you have the following table.
CREATE TABLE person (
id SERIAL,
name VARCHAR(20),
dob VARCHAR(20)
);
If you created an index on the 'name' field this would create in a balanced tree for that data in the table for the name column. Balanced tree data structures allow for faster searching of results (see http://www.solutionhacker.com/tag/balanced-tree/).
You should note however indexing a column only allows you to search on the data as it is stored in the database. For example:
This would not be able to search on the index and would instead do a sequential scan on the table, calling UPPER() on each of the column:name rows in the table.
select *
from person
where UPPER(name) = "BOB";
This would also have the following effect, because the index will be sorted starting with the first letter. Replacing the search term with "%B" would however use the index.
select *
from person
where name like "%B"
Indexes will improve performance on larger tables. Normally, the primary key has an index based on the key. Usually unique.
It is useful to add indexes to fields that are used to search on a lot too such as Street Name or Surname as again it will improve perfomance. Don't need to be unique.
Foreign Keys and Unique Keys are more for keeping your data integrity in order. So that you cannot have duplicate primary keys and so that your child tables don't have data for a parent that has been deleted.
PRIMARY defines a primary key, yes.
UNIQUE simply defines that the specified field has to be unique, it has nothing to do with foreign keys.
INDEX creates an index for the specified column and, yes, it improves performance for large tables, sorting and finding something in this column can be much faster if you use indexing.
The bigger the table, the bigger is gain from using an index. Do note that indexes makes insert (and probably update) operations slower so make sure you don't index too many fields.
Say you have a Many-Many table between Artists and Fans. When it comes to designing the table, do you design the table like such:
ArtistFans
ArtistFanID (PK)
ArtistID (FK)
UserID (FK)
(ArtistID and UserID will then be contrained with a Unique Constraint
to prevent duplicate data)
Or do you build use a compound PK for the two relevant fields:
ArtistFans
ArtistID (PK)
UserID (PK)
(The need for the separate unique constraint is removed because of the
compound PK)
Are there are any advantages (maybe indexing?) for using the former schema?
ArtistFans
ArtistID (PK)
UserID (PK)
The use of an auto incremental PK has no advantages here, even if the parent tables have them.
I'd also create a "reverse PK" index automatically on (UserID, ArtistID) too: you will need it because you'll query the table by both columns.
Autonumber/ID columns have their place. You'd choose them to improve certain things after the normalisation process based on the physical platform. But not for link tables: if your braindead ORM insists, then change ORMs...
Edit, Oct 2012
It's important to note that you'd still need unique (UserID, ArtistID) and (ArtistID, UserID) indexes. Adding an auto increments just uses more space (in memory, not just on disk) that shouldn't be used
Assuming that you're already a devotee of the surrogate key (you're in good company), there's a case to be made for going all the way.
A key point that is sometimes forgotten is that relationships themselves can have properties. Often it's not enough to state that two things are related; you might have to describe the nature of that relationship. In other words, there's nothing special about a relationship table that says it can only have two columns.
If there's nothing special about these tables, why not treat it like every other table and use a surrogate key? If you do end up having to add properties to the table, you'll thank your lucky presentation layers that you don't have to pass around a compound key just to modify those properties.
I wouldn't even call this a rule of thumb, more of a something-to-consider. In my experience, some slim majority of relationships end up carrying around additional data, essentially becoming entities in themselves, worthy of a surrogate key.
The rub is that adding these keys after the fact can be a pain. Whether the cost of the additional column and index is worth the value of preempting this headache, that really depends on the project.
As for me, once bitten, twice shy – I go for the surrogate key out of the gate.
Even if you create an identity column, it doesn't have to be the primary key.
ArtistFans
ArtistFanId
ArtistId (PK)
UserId (PK)
Identity columns can be useful to relate this relation to other relations. For example, if there was a creator table which specified the person who created the artist-user relation, it could have a foreign key on ArtistFanId, instead of the composite ArtistId+UserId primary key.
Also, identity columns are required (or greatly improve the operation of) certain ORM packages.
I cannot think of any reason to use the first form you list. The compound primary key is fine, and having a separate, artificial primary key (along with the unique contraint you need on the foreign keys) will just take more time to compute and space to store.
The standard way is to use the composite primary key. Adding in a separate autoincrement key is just creating a substitute that is already there using what you have. Proper database normalization patterns would look down on using the autoincrement.
Funny how all answers favor variant 2, so I have to dissent and argue for variant 1 ;)
To answer the question in the title: no, you don't need it. But...
Having an auto-incremental or identity column in every table simplifies your data model so that you know that each of your tables always has a single PK column.
As a consequence, every relation (foreign key) from one table to another always consists of a single column for each table.
Further, if you happen to write some application framework for forms, lists, reports, logging etc you only have to deal with tables with a single PK column, which simplifies the complexity of your framework.
Also, an additional id PK column does not cost you very much in terms of disk space (except for billion-record-plus tables).
Of course, I need to mention one downside: in a grandparent-parent-child relation, child will lose its grandparent information and require a JOIN to retrieve it.
In my opinion, in pure SQL id column is not necessary and should not be used. But for ORM frameworks such as Hibernate, managing many-to-many relations is not simple with compound keys etc., especially if join table have extra columns.
So if I am going to use a ORM framework on the db, I prefer putting an auto-increment id column to that table and a unique constraint to the referencing columns together. And of course, not-null constraint if it is required.
Then I treat the table just like any other table in my project.