Related
I am using PostgreSQL 9.5 (but upgrade is possible to say 9.6).
I have permissions table:
CREATE TABLE public.permissions
(
id integer NOT NULL DEFAULT nextval('permissions_id_seq'::regclass),
item_id integer NOT NULL,
item_type character varying NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT permissions_pkey PRIMARY KEY (id)
)
-- skipping indices declaration, but they would be present
-- on item_id, item_type
And 3 tables for many-to-many associations
-companies_permissions (+indices declaration)
CREATE TABLE public.companies_permissions
(
id integer NOT NULL DEFAULT nextval('companies_permissions_id_seq'::regclass),
company_id integer,
permission_id integer,
CONSTRAINT companies_permissions_pkey PRIMARY KEY (id),
CONSTRAINT fk_rails_462a923fa2 FOREIGN KEY (company_id)
REFERENCES public.companies (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_rails_9dd0d015b9 FOREIGN KEY (permission_id)
REFERENCES public.permissions (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE INDEX index_companies_permissions_on_company_id
ON public.companies_permissions
USING btree
(company_id);
CREATE INDEX index_companies_permissions_on_permission_id
ON public.companies_permissions
USING btree
(permission_id);
CREATE UNIQUE INDEX index_companies_permissions_on_permission_id_and_company_id
ON public.companies_permissions
USING btree
(permission_id, company_id);
-permissions_user_groups (+indices declaration)
CREATE TABLE public.permissions_user_groups
(
id integer NOT NULL DEFAULT nextval('permissions_user_groups_id_seq'::regclass),
permission_id integer,
user_group_id integer,
CONSTRAINT permissions_user_groups_pkey PRIMARY KEY (id),
CONSTRAINT fk_rails_c1743245ea FOREIGN KEY (permission_id)
REFERENCES public.permissions (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_rails_e966751863 FOREIGN KEY (user_group_id)
REFERENCES public.user_groups (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE UNIQUE INDEX index_permissions_user_groups_on_permission_and_user_group
ON public.permissions_user_groups
USING btree
(permission_id, user_group_id);
CREATE INDEX index_permissions_user_groups_on_permission_id
ON public.permissions_user_groups
USING btree
(permission_id);
CREATE INDEX index_permissions_user_groups_on_user_group_id
ON public.permissions_user_groups
USING btree
(user_group_id);
-permissions_users (+indices declaration)
CREATE TABLE public.permissions_users
(
id integer NOT NULL DEFAULT nextval('permissions_users_id_seq'::regclass),
permission_id integer,
user_id integer,
CONSTRAINT permissions_users_pkey PRIMARY KEY (id),
CONSTRAINT fk_rails_26289d56f4 FOREIGN KEY (user_id)
REFERENCES public.users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_rails_7ac7e9f5ad FOREIGN KEY (permission_id)
REFERENCES public.permissions (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE INDEX index_permissions_users_on_permission_id
ON public.permissions_users
USING btree
(permission_id);
CREATE UNIQUE INDEX index_permissions_users_on_permission_id_and_user_id
ON public.permissions_users
USING btree
(permission_id, user_id);
CREATE INDEX index_permissions_users_on_user_id
ON public.permissions_users
USING btree
(user_id);
I will have to run SQL query like this a lot times:
SELECT
"permissions".*,
"permissions_users".*,
"companies_permissions".*,
"permissions_user_groups".*
FROM "permissions"
LEFT OUTER JOIN
"permissions_users" ON "permissions_users"."permission_id" = "permissions"."id"
LEFT OUTER JOIN
"companies_permissions" ON "companies_permissions"."permission_id" = "permissions"."id"
LEFT OUTER JOIN
"permissions_user_groups" ON "permissions_user_groups"."permission_id" = "permissions"."id"
WHERE
(companies_permissions.company_id = <company_id> OR
permissions_users.user_id in (<user_ids> OR NULL) OR
permissions_user_groups.user_group_id IN (<user_group_ids> OR NULL)) AND
permissions.item_type = 'Topic'
Let's say we have about 10000+ permissions and similar amount of records inside other tables.
Do I need to worry about performance?
I mean... I have 4 LEFT OUTER JOINs and it should return results pretty fast (say <200ms).
I was thinking about declaring 1 "polymorphic" table, something like:
CREATE TABLE public.permissables
(
id integer NOT NULL DEFAULT nextval('permissables_id_seq'::regclass),
permission_id integer,
resource_id integer NOT NULL,
resource_type character varying NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT permissables_pkey PRIMARY KEY (id)
)
-- skipping indices declaration, but they would be present
Then I could run query like this:
SELECT
permissions.*,
permissables.*
FROM permissions
LEFT OUTER JOIN
permissables ON permissables.permission_id = permissions.id
WHERE
permissions.item_type = 'Topic' AND
(permissables.owner_id IN (<user_ids>) AND permissables.owner_type = 'User') OR
(permissables.owner_id = <company_id> AND permissables.owner_type = 'Company') OR
(permissables.owner_id IN (<user_groups_ids>) AND permissables.owner_type = 'UserGroup')
QUESTIONS:
Which options is better/faster? Maybe there is better way to do this?
a) 4 tables (permissions, companies_permissions, user_groups_permissions, users_permissions)
b) 2 tables (permissions, permissables)
Do I need to declare different indexes than btree on permissions.item_type ?
Do I need to run a few times per day vacuum analyze for tables to make indices work (both options)?
EDIT1:
SQLFiddle examples:
wildplasser suggestion (from comment), not working: http://sqlfiddle.com/#!15/9723f8/1
Original query (4 tables): http://sqlfiddle.com/#!15/9723f8/2
{ I also removed backticks in wrong places thanks #wildplasser }
I'd recommend abstracting all access to your permissions system to a couple of model classes. Unfortunately, I've found that permission systems like this do sometimes end up being performance bottlenecks, and I've found that it is sometimes necessary to significantly refactor your data representation.
So, my recommendation is that try to keep the permission-related queries isolated in a few classes and try to keep the interface to those classes independent of the rest of the system.
Examples of good approaches here are what you have above. You don't actually join against the topics table; you already have the topic IDs you care about when you're constructing the permissions.
Examples of bad interfaces would be class interfaces that make it easy to join the permissions tables into arbitrary other SQL.
I understand you asked the question in terms of SQL rather than a particular framework on top of SQL, but from the rails constraint names it looks like you are using such a framework, and I think taking advantage of it will be useful to your future code maintainability.
In the 10,000 rows cases, I think either approach will work fine.
I'm not actually sure that the approaches will be all that different. If you think about the query plans generated, assuming you're getting a small number of rows from the table, the join might be handled with a loop against each table in exactly the same way that the or query might be handled assuming that the index is likely to return a small number of rows.
I have not fed a plausible data set into Postgres to figure out whether that's what it actually does given a real data set. I have reasonably high confidence that Postgres is smart enough to do that if it makes sense to do so.
The polymorphic approach does give you a bit more control and if you run into performance problems you may want to check if moving to it will help.
If you choose the polymorphic approach, I'd recommend writing code to go through and check to make sure that your data is consistent. That is, make sure that resource_type and resource_id corresponds to actual resources that exist in your system.
I'd make that recommendation in any case where application concerns force you to denormalize your data such that database constraints are not sufficient to enforce consistency.
If you start running into performance problems, here are the sorts of things you may need to do in the future:
Create a cache in your application mapping objects (such as topics) to the set of permissions for those objects.
Create a cache in your application caching all the permissions a given user has (including the groups they are a member of) for the objects in your application.
Materializing the user group permissions. That is create a materialized view that combines the user_group permissions with the user permissions and the user group memberships.
In my experience the thing that really kills performance of permission systems is when you add something like permitting one group to be a member of another group. At that point you very quickly get to a point where you need caching or materialized views.
Unfortunately, it's really hard to give more specific advice without actually having your data and looking at real query plans and real performance. I think that if you prepare for future changes you'll be fine though.
Maybe it's an obvious answer, but I think the option with 3 tables should be just fine. SQL databases are good at doing join operations and you have 10,000 records - this is not a big amount of data at all, so I am not sure what makes you think there will be a performance problem.
With proper indexes (btree should be OK), it should work fast and actually you can go just a bit further and generate the sample data for you tables and see how your query actually works on real amount of data.
I also don't think you'll need to worry about something like running vacuum manually.
Regarding the option two, polymorphic table, it can be not very good as you now have single resource_id field which can point out to different tables which is a source of problems (for example, due to a bug you can have a record with resource_type=User and resource_id pointing to Company - table structure doesn't prevent it).
One more note: you do not tell anything about relations between User, UserGropup and Company - if they are all related too, it may be possible to fetch permissions just using user id(s), joining also gropus and companies to users.
And one more: you don't need ids in many-many tables, nothing bad happens if you have them, but it's enough to have permission_id and user_id and make them to be composite primary key.
You can try to denormalize the many-to-many relations in a permission field on each of the 3 tables (user, user_group, company).
You can use this field to store the permissions in JSON format, and use it only for reading (SELECTs). You can still use the many-to-many tables for changing the permissions of specific users, groups and companies, just write a trigger on them, that will update the denormalized permission field whenever there is a new change on the many-to-many table. With this solution you will still get fast query execution time on the SELECTs, while keeping the relationship normalized and in compliance with database standards.
Here is an example script, that I have written for mysql for a one-to-many relation, but a similar thing can be applied for your case as well:
https://github.com/martintaleski/mysql-denormalization/blob/master/one-to-many.sql
I have used this approach several times, and it makes sense when the SELECT statements outnumber and are more important than the INSERT, UPDATE and DELETE statements.
In case you do not often change your permissions, materialized views might speed up your search enormously. I will prepare an example based on your setting later today and will post it. Afterwards, we can do some benchmark.
Nevertheless, materialized views require an update of the materialized view after changing the data. So that solution might be fast, but will speed up your queries only if basic data are not changed so often.
I have my user table (pseudo sql, because I use an ORM and I must support several different DB types):
id: INTEGER, PK, AUTOINCREMENT
UUID : BINARY(16) (inserted by an update, it's a hash(id) )
I am currently using id for FK in all other tables.
However, in my REST API, I have to serve informations with the UUID, which causes a problem later to query.
Should I:
FK on the UUID instead?
just lookup id(UUID) each time (fast thanks to cache mechanism after a while)?
In general, it is better to use the auto-incremented id for the foreign key reference rather than some other combination of unique columns.
One important reason is that indexes on a single integer are more efficient than indexes on other column types -- if for no other reason than the index being smaller, so it occupies less disk and less memory. Also, there is additional overhead to storing the longer UUID in secondary tables.
This is not the only consideration. Another consideration is that you could change the UUID, if necessary, without changing the foreign key references. For instance, you may wake up one day and say "that id has to start with AAA". You can alter the table and update the table and be done with it -- or you could worry about foreign key references as well. Or, you might add an organization column and decide that the unique key is a combination of the UUID and organization. These operations are much harder/slower if the UUID is being used as a foreign key reference.
When you have composite primary keys (more than one column), using the auto-incremented id is an even better idea. In this case, using the id for joins prevents mistakes where one of the join conditions might be left out.
As you point out, looking up the UUID for a given id should be a fast operation with the correct indexes. There may be some borderline cases where you would not want to have an id, but in general, it is a good idea.
I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:
Primary Key
Index
I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?
I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?
edit: should one index colums one is likely to want to ORDER BY?
Thanks a lot,
Mala
Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.
For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.
Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.
However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.
Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.
The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.
Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.
Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.
Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.
The difference between a primary key and a unique key is best explained through an example.
We have a table of users:
USER_ID number
NAME varchar(30)
EMAIL varchar(50)
In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.
Why then do we need a separate primary key? Three reasons:
the numeric key is more efficient
when used in foreign key
relationships as it takes less space
the email can change (for example
swapping provider) but the user is
still the same; rippling a change of
a primary key value throughout a schema
is always a nightmare
it is always a bad idea to use
sensitive or private information as
a foreign key
In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.
In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.
For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.
Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.
An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.
Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.
I'm running into an issue with a join: getting back too many records. I added a table to the set of joins and the number of rows expanded. Usually when this happens I add a select of all the ID fields that are involved in the join. That way it's pretty obvious where the expansion is happening and I can change the ON of the join to fix it. Except in this case, the table that I added doesn't have an ID field. This is a problem. But perhaps I'm wrong.
Should every table in a database have an IDENTITY field that's used as the PK? Are there any drawbacks to having an ID field in every table? What if you're reasonably sure this table will never be used in a PK/FK relationship?
When having an identity column is not a good idea?
Surrogate vs. natural/business keys
Wikipedia Surrogate Key article
There are two concepts that are close but should not be confused: IDENTITY and PRIMARY KEY
Every table (except for the rare conditions) should have a PRIMARY KEY, that is a value or a set of values that uniquely identify a row.
See here for discussion why.
IDENTITY is a property of a column in SQL Server which means that the column will be filled automatically with incrementing values.
Due to the nature of this property, the values of this column are inherently UNIQUE.
However, no UNIQUE constraint or UNIQUE index is automatically created on IDENTITY column, and after issuing SET IDENTITY_INSERT ON it's possible to insert duplicate values into an IDENTITY column, unless it had been explicity UNIQUE constrained.
The IDENTITY column should not necessarily be a PRIMARY KEY, but most often it's used to fill the surrogate PRIMARY KEYs
It may or may not be useful in any particular case.
Therefore, the answer to your question:
The question: should every table in a database have an IDENTITY field that's used as the PK?
is this:
No. There are cases when a database table should NOT have an IDENTITY field as a PRIMARY KEY.
Three cases come into my mind when it's not the best idea to have an IDENTITY as a PRIMARY KEY:
If your PRIMARY KEY is composite (like in many-to-many link tables)
If your PRIMARY KEY is natural (like, a state code)
If your PRIMARY KEY should be unique across databases (in this case you use GUID / UUID / NEWID)
All these cases imply the following condition:
You shouldn't have IDENTITY when you care for the values of your PRIMARY KEY and explicitly insert them into your table.
Update:
Many-to-many link tables should have the pair of id's to the table they link as the composite key.
It's a natural composite key which you already have to use (and make UNIQUE), so there is no point to generate a surrogate key for this.
I don't see why would you want to reference a many-to-many link table from any other table except the tables they link, but let's assume you have such a need.
In this case, you just reference the link table by the composite key.
This query:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (a_id, b_id, PRIMARY KEY (a_id, b_id))
CREATE TABLE business_rule (id, a_id, b_id, FOREIGN KEY (a_id, b_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a
ON a.id = br.a_id
is much more efficient than this one:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (id, a_id, b_id, PRIMARY KEY (id), UNIQUE KEY (a_id, b_id))
CREATE TABLE business_rule (id, ab_id, FOREIGN KEY (ab_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a_to_b ab
ON br.ab_id = ab.id
JOIN a
ON a.id = ab.a_id
, for obvious reasons.
Almost always yes. I generally default to including an identity field unless there's a compelling reason not to. I rarely encounter such reasons, and the cost of the identity field is minimal, so generally I include.
Only thing I can think of off the top of my head where I didn't was a highly specialized database that was being used more as a datastore than a relational database where the DBMS was being used for nearly every feature except significant relational modelling. (It was a high volume, high turnover data buffer thing.)
I'm a firm believer that natural keys are often far worse than artificial keys because you often have no control over whether they will change which can cause horrendous data integrity or performance problems.
However, there are some (very few) natural keys that make sense without being an identity field (two-letter state abbreviation comes to mind, it is extremely rare for these official type abbreviations to change.)
Any table which is a join table to model a many to many relationship probably also does not need an additional identity field. Making the two key fields together the primary key will work just fine.
Other than that I would, in general, add an identity field to most other tables unless given a compelling reason in that particular case not to. It is a bad practice to fail to create a primary key on a table or if you are using surrogate keys to fail to place a unique index on the other fields needed to guarantee uniqueness where possible (unless you really enjoy resolving duplicates).
Every table should have some set of field(s) that uniquely identify it. Whether or not there is a numeric identifier field separate from the data fields will depend on the domain you are attempting to model. Not all data easily falls into the 'single numeric id' paradigm, and as such it would be inappropriate to force it. Given that, a lot of data does easily fit in this paradigm and as such would call for such an identifier. There is no one answer to always do X in any programming environment, and this is another example.
If you have modelled, designed, normalised etc, then you will have no identity columns.
You will have identified natural and candidate keys for your tables.
You may decide on a surrogate key because of the physical architecture (eg narrow, numeric, strictly monotonically increasing), say, because using a nvarchar(100) column is not a good idea (still need unique constraint).
Or because of ideology: they appeal to OO developers I've found.
Ok, assume ID columns. As your db gets more complex, say several layers, how can you jon parent and grand-.child tables directly. You can't: you always need intermediate tables and well indexed PK-FL columns. With a composite key, it's all there for you...
Don't get me wrong: I use them. But I know why I use them...
Edit:
I'd be interested to collate "always ID"+"no stored procs" matches on one hand, with "use stored procs"+"IDs when they benefit" on the other...
No. Whenever you have a table with an artificial identity column, you also need to identify the natural primary key for the table and ensure that there is a unique constraint on that set of columns too so that you don't get two rows that are identical apart from the meaningless identity column by accident.
Adding an identity column is not cost free. There is an overhead in adding an unnecessary identity column to a table - typically 4 bytes per row of storage for the identity value, plus a whole extra index (which will probably weigh in at 8-12 bytes per row plus overhead). It also takes slightly to work out the most cost-effective query plan because there is an extra index per table. Granted, if the table is small and the machine is big, this overhead is not critical - but for the biggest systems, it matters.
Yes, for the vast majority of cases.
Edge cases or exceptions might be things like:
two-way join tables to model m:n relationships
temporary tables used for bulk-inserting huge amounts of data
But other than that, I think there is no good reason against having a primary key to uniquely identify each row in a table, and in my opinion, using an IDENTITY field is one of the best choices (I prefer surrogate keys over natural keys - they're more reliable, stable, never changing etc.).
Marc
I can't think of any drawback about having an ID field in each table. Providing your the type of your ID field provides enough space for your table to grow.
However, you don't necessarily need a single field to ensure the identity of your rows.
So no, a single ID field is not mandatory.
Primary and Foreign Keys can consist not only of one field, but of multiple fields. This is typical for tables implementing a N-N relationship.
You can perfectly have PRIMARY KEY (fa, fb) on your table:
CREATE TABLE t(fa INT , fb INT);
ALTER TABLE t ADD PRIMARY KEY(fa , fb);
Recognize the distinction between an Identity field and a key... Every table should have a key, to eliminate the data corruption of inadvertently entering multiple rows that represent the same 'entity'. If the only key a table has is a meaningless surrogate key, then this function is effectively missing.
otoh, No table 'needs' an identity, and certainly not every table benefits from one... Examples are: A table with a short and functional key, a table which does not have any other table referencing it through a foreign Key, or a table which is in a one to zero-or-one relationship with another table... none of these need an Identity
I'd say, if you can find a simple, natural key in your table (i.e. one column), use that as a key instead of an identity column.
I generally give every table some kind of unique identifier, whether it is natural or generated, because then I am guaranteed that every row is uniquely identified somehow.
Personally, I avoid IDENTITY (incrementing identity columns, like 1, 2, 3, 4) columns like the plague. They cause a lot of hassle, especially if you delete rows from that table. I use generated uniqueidentifiers instead if there is no natural key in the table.
Anyway, no idea if this is the accepted practice, just seems right to me. YMMV.
I'm currently in the process of designing the database tables for a customer & website management application. My question is in regards to the use of primary keys as functional parts of a table (and not assigning "ID" numbers to every table just because).
For example, here are four related tables from the database so far, one of which uses the traditional primary key number, the others which use unique names as the primary key:
--
-- website
--
CREATE TABLE IF NOT EXISTS `website` (
`name` varchar(126) NOT NULL,
`client_id` int(11) NOT NULL,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`notes` text NOT NULL,
`website_status` varchar(26) NOT NULL,
PRIMARY KEY (`name`),
KEY `client_id` (`client_id`),
KEY `website_status` (`website_status`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- website_status
--
CREATE TABLE IF NOT EXISTS `website_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `website_status` (`name`) VALUES
('demo'),
('disabled'),
('live'),
('purchased'),
('transfered');
--
-- client
--
CREATE TABLE IF NOT EXISTS `client` (
`id` int(11) NOT NULL auto_increment,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`client_status` varchar(26) NOT NULL,
`firstname` varchar(26) NOT NULL,
`lastname` varchar(46) NOT NULL,
`address` varchar(78) NOT NULL,
`city` varchar(56) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` int(11) NOT NULL,
`country` varchar(3) NOT NULL,
`phone` text NOT NULL,
`email` varchar(78) NOT NULL,
`notes` text NOT NULL,
PRIMARY KEY (`id`),
KEY `client_status` (`client_status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;
--
-- client_status
---
CREATE TABLE IF NOT EXISTS `client_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `client_status` (`name`) VALUES
('affiliate'),
('customer'),
('demo'),
('disabled'),
('reseller');
As you can see, 3 of the 4 tables use their 'name' as the primary key. I know that these will always be unique. In 2 of the cases (the *_status tables) I am basically using a dynamic replacement for ENUM, since status options could change in the future, and for the 'website' table, I know that the 'name' of the website will always be unique.
I'm wondering if this is sound logic, getting rid of table ID's when I know the name is always going to be a unique identifier, or a recipe for disaster? I'm not a seasoned DBA so any feedback, critique, etc. would be extremely helpful.
Thanks for taking the time to read this!
There are 2 reasons I would always add an ID number to a lookup / ENUM table:
If you are referencing a single column table with the name then you may be better served by using a constraint
What happens if you wanted to rename one of the client_status entries? e.g. if you wanted to change the name from 'affiliate' to 'affiliate user' you would need to update the client table which should not be necessary. The ID number serves as the reference and the name is the description.
In the website table, if you are confident that the name will be unique then it is fine to use as a primary key. Personally I would still assign a numeric ID as it reduces the space used in foreign key tables and I find it easier to manage.
EDIT:
As stated above, you will run into problems if the website name is renamed. By making this the primary key you will be making it very difficult if not impossible for this to be changed at a later date.
When making natural PRIMARY KEY's, make sure their uniqueness is under your control.
If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's.
Since website_status and client_status seem to be generated and used by you and only by you, it's acceptable to use them as a PRIMARY KEY, though having a long key may impact performance.
website name seems be under control of the outer world, that's why I'd make it a plain field. What if they want to rename their website?
The counterexamples would be SSN and ZIP codes: it's not you who generates them and there is no guarantee that they won't be ever duplicated.
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
Using "Name" as your key, while it seems to satisfy #1, doesn't satisfy ANY of the other three.
Even for your "lookup" table, what if your boss decides to change all affiliates to partners instead? You'll have to modify all rows in the database that use this value.
From a performance perspective, I'm probably most concerned that a key be narrow. If your website name is actually a long URL, then that could really bloat the size of any non-clustered indexes, and all tables that use it as a foreign key.
Besides all the other excellent points that have already been made, I would add one more word of caution against using large fields as clustering keys in SQL Server (if you're not using SQL Server, then this probably doesn't apply to you).
I add this because in SQL Server, the primary key on a table by default also is the clustering key (you can change that, if you want to and know about it, but most of the cases, it's not done).
The clustering key that determines the physical ordering of the SQL Server table is also being added to every single non-clustered index on that table. If you have only a few hundred to a few thousand rows and one or two indices, that's not a big deal. But if you have really large tables with millions of rows, and potentially lots of indices to speed up the queries, this will indeed cause a lot of disk space and server memory to be wasted unnecessarily.
E.g. if your table has 10 million rows, 10 non-clustered indices, and your clustering key is 26 bytes instead of 4 (for an INT), then you're wasting 10 mio. by 10 by 22 bytes for a total of 2.2 billion bytes (or 2.2 GBytes approx.) - that's not peanuts anymore!
Again - this only applies to SQL Server, and only if you have really large tables with lots of non-clustered indices on them.
Marc
"If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's."
If you're absolutely sure you will never ever have uniqueness violation, then don't bother to define the key.
Personally, I think you will run into trouble using this idea. As you end up with more parent child relationships, you end up with a huge amount of work when the names change (As they always will sooner or later). There can be a big performance hit when having to update a child table that has thousands of rows when the name of the website changes. And you have to plan for how do make sure that those changes happen. Otherwise, the website name changes (oops we let the name expire and someone else bought it.) either break because of the foreign key constraint or you need to put in an automated way (cascade update) to propagate the change through the system. If you use cascading updates, then you can suddenly bring your system to a dead halt while a large chage is processed. This is not considered to be a good thing. It really is more effective and efficient to use ids for relationships and then put unique indexes on the name field to ensure they stay unique. Database design needs to consider maintenance of the data integrity and how that will affect performance.
Another thing to consider is that websitenames tend to be longer than a few characters. This means the performance difference between using an id field for joins and the name for joins could be quite significant. You have to think of these things at the design phase as it is too late to change to an ID when you have a production system with millions of records that is timing out and the fix is to completely restructure the databse and rewrite all of the SQL code. Not something you can fix in fifteen minutes to get the site working again.
This just seems like a really bad idea. What if you need to change the value of the enum? The idea is to make it a relational database and not a set of flat files. At this point, why have the client_status table? Moreover, if you are using the data in an application, by using a type like a GUID or INT, you can validate the type and avoid bad data (in so far as validating the type). Thus, it is another of many lines to deter hacking.
I would argue that a database that is resistant to corruption, even if it runs a little slower, is better than one that isn’t.
In general, surrogate keys (such as arbitrary numeric identifiers) undermine the integrity of the database. Primary keys are the main way of identifying rows in the database; if the primary key values are not meaningful, the constraint is not meaningful. Any foreign keys that refer to surrogate primary keys are therefore also suspect. Whenever you have to retrieve, update or delete individual rows (and be guaranteed of affecting only one), the primary key (or another candidate key) is what you must use; having to work out what a surrogate key value is when there is a meaningful alternative key is a redundant and potentially dangerous step for users and applications.
Even if it means using a composite key to ensure uniqueness, I would advocate using a meaningful, natural set of attributes as the primary key, whenever possible. If you need to record the attributes anyway, why add another one? That said, surrogate keys are fine when there is no natural, stable, concise, guaranteed-to-be-unique key (e.g. for people).
You could also consider using index key compression, if your DBMS supports it. This can be very effective, especially for indexes on composite keys (think trie data structures), and especially if the least selective attributes can appear first in the index.
I think I am in agreement with cheduardo. It has been 25 years since I took a course in database design but I recall being told that database engines can more efficiently manage and load indexes that use character keys. The comments about the database having to update thousands of records when a key is changed and on all of the added space being taken up by the longer keys and then having to be transferred across systems, assumes that the key is actually stored in the records and that it does not have to be transferred across systems anyway. If you create an index on a column(s) of a table, I do not think the value is stored in the records of the table (unless you set some option to do so).
If you have a natural key for a table, even if it is changed occassionally, creating another key creates a redundancy that could result in data integrity issues and actually creates even more information that needs to be stored and transferred across systems. I work for a team that decided to store the local application settings in the database. They have an identity column for each setting, a section name, a key name, and a key value. They have a stored procedure (another holy war) to save a setting that ensures it does not appear twice. I have yet to find a case where I would use a setting's ID. I have, however, ended up with multiple records with the same section and key name that caused my application to fail. And yes, I know that could have been avoided by defining a constraint on the columns.
Here few points should be considered before deciding keys in table
Numeric key is more suitable when you
use references ( foreign keys), since
you not using foreign keys, it ok in
your case to use non numeric key.
Non-numeric key uses more space than
numeric keys, can decrease
performance.
Numeric keys make db look simpler to
understand ( you can easily know no
of rows just by looking at last row)
You NEVER know when the company you work for suddenly explodes in growth and you have to hire 5 developers overnight. Your best bet is to use numeric (integer) primary keys because they will be much easier for the entire team to work with AND will help your performance if and when the database grows. If you have to break records out and partition them, you might want to use the primary key. If you are adding records with a datetime stamp (as every table should), and there is an error somewhere in the code that updates that field incorrectly, the only way to confirm if the record was entered in the proper sequence it to check the primary keys. There are probably 10 more TSQL or debugging reasons to use INT primary keys, not the least of which is writing a simple query to select the last 5 records entered into the table.