Is composite index really necessary for a table with composite primary key? - sql

Let's say we have a table to store users' favourite pictures, with a composite primary key pair(UserId, PictureId). Books normally say in this case you need a composite index based on (UserId, PictureId), which normally appears in the WHERE clause as (UserId=103 AND PictureId=1234). But I think the dababase engine should be smart enough to use two individual indexes based on the two columns separately. Just get the set of row numbers from each of the index and find the ones that are present in both sets. That way, a composite index is not necessary.
So, in reality can database engines do that?

There'd be no advantage to using the two separate single-column indexes; the engine would be better off doing a table scan.
The point of using an index is to make access faster. If the engine used two indexes, it would have to sort at least one set of data from one of the indexes and merge the results from the two indexes. That would be a lot more work than reading just one composite index, especially since the composite index allows for an index-only scan.

Most database engines will require the composite index to enforce the primary key. As such, it's a "free" index that you're going to have anyway - why worry about it?
There may be some benefit (if the index is on UserID,PictureID) to adding a second index just on PictureID. Any query on just UserID will be able to use the composite index, whereas a query just using PictureID would be unable to do so.

I think in the use case you describe, the composite index is not necessary. That would be useful if you were doing a query on, say, a given set of user IDs plus a given set of picture IDs. But when would you ever need that? You'd be more likely to query all a user's pictures in a given date range, or lookup a specific picture by ID. This would suggest an index structure of one composite user id + date index, and another picture id only index.
It always depends on the distribution of records in your database, and the types of queries you will be running most frequently.

PRIMARY KEY or UNIQUE constraints are abstract, theoretical concepts.
An INDEX is practical physical thing that lives in the real world.
In practice, indexes can be used to enforce PK or UNIQUE constraints. But other techniques could also be used (eg for a small domain: a bitmap)

What you describe would be significantly more expensive than using composite index.
First a set of rows would need to be identified from the first index, then a set of rows from the second and finally the set-intersection performed between the two.
--- UPDATE ---
Note hat this is the price you would pay for every INSERT/UPDATE and every foreign key check, not just SELECT.
Also, there may be concurrency issues involved - depending on how the DBMS is implemented, enforcing uniqueness through a single unique composite index might require less/simpler locking than enforcing uniqueness through two non-unique, non-composite indexes.
And of course, if you intend to cluster your table, the primary index will typically also be the clustering index, and contain all columns anyway, so there isn't much purpose in leaving anything out from the "sorting" portion of the index.

Related

Is a Primary Key necessary in SQL Server?

This may be a pretty naive and stupid question, but I'm going to ask it anyway
I have a table with several fields, none of which are unique, and a primary key, which obviously is.
This table is accessed via the non-unique fields regularly, but no user SP or process access data via the primary key. Is the primary key necessary then? Is it used behind the scenes? Will removing it affect performance Positively or Negatively?
Necessary? No. Used behind the scenes? Well, it's saved to disk and kept in the row cache, etc. Removing will slightly increase your performance (use a watch with millisecond precision to notice).
But ... the next time someone needs to create references to this table, they will curse you. If they are brave, they will add a PK (and wait for a long time for the DB to create the column). If they are not brave or dumb, they will start creating references using the business key (i.e. the data columns) which will cause a maintenance nightmare.
Conclusion: Since the cost of having a PK (even if it's not used ATM) is so small, let it be.
Do you have any foreign keys, do you ever join on the PK?
If the answer to this is no, and your app never retrieves an item from the table by its PK, and no query ever uses it in a where clause, therefore you just added an IDENTITY column to have a PK, then:
the PK in itself adds no value, but does no damage either
the fact that the PK is very likely the clustered index too is .. it depends.
If you have NC indexes, then the fact that you have a narrow artificial clustered key (the IDENTITY PK) is helpful in keeping those indexes narrow (the CDX key is reproduced in every NC leaf slots). So a PK, even if never used, is helpful if you have significant NC indexes.
On the other hand, if you have a prevalent access pattern, a certain query that outweighs all the other is frequency and importance, or which is part of a critical time code path (eg. is the query run on every page visit on your site, or every second by and app etc) then that query is a good candidate to dictate the clustered key order.
And finally, if the table is seldom queried but often written to then it may be a good candidate for a HEAP (no clustered key at all) since heaps are so much better at inserts. See Comparing Tables Organized with Clustered Indexes versus Heaps.
The primary key is behind the scenes a clustered index (by default unless generated as a non clustered index) and holds all the data for the table. If the PK is an identity column the inserts will happen sequentially and no page splits will occur.
But if you don't access the id column at all then you probably want to add some indexes on the other columns. Also when you have a PK you can setup FK relationships
In the logical model, a table must have at least one key. There is no reason to arbitarily specify that one of the keys is 'primary'; all keys are equal. Although the concept of 'primary key' can be traced back to Ted Codd's early work, the mistake was picked up early on has long been corrected in relational theory.
Sadly, PRIMARY KEY found its way into SQL and we've had to live with it ever since. SQL tables can have duplicate rows and, if you consider the resultset of a SELECT query to also be a table, then SQL tables can have duplicate rows too. Relational theorists dislike SQL a lot. However, just because SQL lets you do all kinds of wacky non-relational things, that doesn't mean that you have to actually do them. It is good practice to ensure that every SQL table has at least one key.
In SQL, using PRIMARY KEY on its own has implications e.g. NOT NULL, UNIQUE, the table's default reference for foreign keys. In SQL Server, using PRIMARY KEY on its own has implications e.g. the table's clustered index. However, in all these cases, the implicit behavior can be made explicit using specific syntax.
You can use UNIQUE (constraint rather than index) and NOT NULL in combination to enforce keys in SQL. Therefore, no, a primary key (or even PRIMARY KEY) is not necessary for SQL Server.
I would never have a table without a primary key. Suppose you ever need to remove a duplicate - how would you identify which one to remove and which to keep?
A PK is not necessary.
But you should consider to place a non-unique index on the columns that you use for querying (i.e. that appear in the WHERE-clause). This will considerably boost lookup performance.
The primary key when defined will help improve performance within the database for indexing and relationships.
I always tend to define a primary key as an auto incrementing integer in all my tables, regardless of if I access it or not, this is because when you start to scale up your application, you may find you do actually need it, and it makes life a lot simpler.
A primary key is really a property of your domain model, and it uniquely identifies an instance of a domain object.
Having a clustered index on a montonically increasing column (such as an identity column) will mean page splits will not occur, BUT insertions will unbalance the index over time and therefore rebuilding indexes needs to be done regulary (or when fragmentation reaches a certain threshold).
I have to have a very good reason to create a table without a primary key.
As SQLMenace said, the clustered index is an important column for the physical layout of the table. In addition, having a clustered index, especially a well chosen one on a skinny column like an integer pk, actually increases insert performance.
If you are accessing them via non-key fields the performance probably will not change. However it might be nice to keep the PK for future enhancements or interfaces to these tables. Does your application only use this one table?

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.

Database Design: Alternate to composite keys?

I am building a database system and having trouble with the design of one of my tables.
In this system there is a users table, an object table, an item table and cost table.
A unique record in the cost table is determine by the user, object, item and year. However, there can be multiple records that have the same year if the item is different.
The hierarchy goes user->object->item->year, multiple unique years per item, multiple unique items per object, multiple unique objects per user, multiple unique users.
What would be the best way to design the cost table?
I am thinking of including the userid, objectid and itemid as foreign keys and then using a composite key consisting of userid, objecid, itemid and costyear. I have heard that composite keys are bad design, but I am unsure how to structure this to get away from using a composite key. As you can tell my database building skills are a bit rusty.
Thanks!
P.S. If it matters, this is an interbase db.
To avoid the composite key, you just define a surrogate key. This holds an artificial value, for instance an auto counter.
You still can (and should) define a unique constraint on these columns.
Btw: its not only recommended not to use composite keys, it's also recommendable to use surrogate keys. In all your tables.
Use an internally generated key field (called surrogate keys), something like CostID, that the users will never see but will uniquely identify each entry in the Cost table (in SqlServer, fields like uniqueidentifier or IDENTITY would do the trick.)
Try building your database with a composite key using exactly the columns you outlined, and see what happens. You may be pleasantly surprised. Making sure that there is no missing data in those four columns, and making sure that no two rows have the same value in all four columns will help protect the integrity of your data.
When you declare a composite primary key, the order of columns in your declaration won't affect the logical consequences of the dclaration. However the composite index that the DBMS builds for you will also have the columns in the same order, and the order of columns in a composite index does affect performance.
For queries that specify only one, two, or three of these columns, the index will be useless if the first column in the index is a column not specified in the query. If you know in advance how your queries are gonig to me, and which queries most need to run fast, this can help you declare the columns for the primary key in the right order. In rare circumstances creating two or three additional one column indexes can speed up some queries, at the cost of slowing down updates.

How to use MySQL index columns?

When do you use each MySQL index type?
PRIMARY - Primary key columns?
UNIQUE - Foreign keys?
INDEX - ??
For really large tables, do indexed columns improve performance?
Primary
The primary key is - as the name suggests - the main key of a table and should be a column which is commonly used to select the rows of this table. The primary key is always a unique key (unique identifier). The primary key is not limited to one column, for example in reference tables (many-to-many) it often makes sense to have a primary key including two or more columns.
Unique
A unique index makes sure your DBMS doesn't accept duplicate entries for this column. You ask 'Foreign keys?' NO! That would not be useful since foreign keys are per definition prone to be duplicates, (one-to-many, many-to-many).
Index
Additional indexes can be placed on columns which are often used for SELECTS (and JOINS) which is often the case for foreign keys. In many cases SELECT (and JOIN) queries will be faster, if the foreign keys are indexed.
Note however that - as SquareCog has clarified - Indexes get updated on any modifications to the data, so yes, adding more indexes can lead to degradation in INSERT/UPDATE performance. If indexes didn't get updated, you would get different information depending on whether the optimizer decided to run your query on an index or the raw table -- a highly undesirable situation.
This means, you should carefully assess the usage of indices. One thing is sure on the basis of that: Unused indices have to be avoided, resp. removed!
I'm not that familiar with MySQL, however I believe the following to be true across most database servers. An index is a balanced tree which is used to allow the database to scan the table for given data. For example say you have the following table.
CREATE TABLE person (
id SERIAL,
name VARCHAR(20),
dob VARCHAR(20)
);
If you created an index on the 'name' field this would create in a balanced tree for that data in the table for the name column. Balanced tree data structures allow for faster searching of results (see http://www.solutionhacker.com/tag/balanced-tree/).
You should note however indexing a column only allows you to search on the data as it is stored in the database. For example:
This would not be able to search on the index and would instead do a sequential scan on the table, calling UPPER() on each of the column:name rows in the table.
select *
from person
where UPPER(name) = "BOB";
This would also have the following effect, because the index will be sorted starting with the first letter. Replacing the search term with "%B" would however use the index.
select *
from person
where name like "%B"
Indexes will improve performance on larger tables. Normally, the primary key has an index based on the key. Usually unique.
It is useful to add indexes to fields that are used to search on a lot too such as Street Name or Surname as again it will improve perfomance. Don't need to be unique.
Foreign Keys and Unique Keys are more for keeping your data integrity in order. So that you cannot have duplicate primary keys and so that your child tables don't have data for a parent that has been deleted.
PRIMARY defines a primary key, yes.
UNIQUE simply defines that the specified field has to be unique, it has nothing to do with foreign keys.
INDEX creates an index for the specified column and, yes, it improves performance for large tables, sorting and finding something in this column can be much faster if you use indexing.
The bigger the table, the bigger is gain from using an index. Do note that indexes makes insert (and probably update) operations slower so make sure you don't index too many fields.

Where to place a primary key

To my knowledge SQL Server 2008 will only allow one clustered index per table. For the sake of this question let's say I have a list of user-submitted stories that contains the following columns.
ID (int, primary key)
Title (nvarchar)
Url (nvarchar)
UniqueName (nvarchar) This is the url slug (blah-blah-blah)
CategoryID (int, FK to Category table)
Most of the time stories will never be queried by ID. Most of the queries will be done either by the CategoryID or by the UniqueName.
I'm new to indexing so I assumed that it would be best to place 2 nonclustered indexes on this table. One on UniqueName and one on CategoryID. After doing some reading about indexes it seems like haivng a clustered index on UniqueName would be very beneficial. Considering UniqueName is... unique would it be advantageous to place the primary key on UniuqeName and get rid of the ID field? As for CategoryID I assume a nonclustered index will do just fine.
Thanks.
In the first place you can put the clustered index on unique name, it doesn't have to be onthe id field. If you do little or no joining to this table you could get rid of the id. In any event I would put a unique index on the unique name field (you may find in doing so that it isn't as unique as you thought it would be!).
If you do a lot of joining though, I would keep the id field, it is smaller and more efficient to join on.
Since you say you are new at indexing, I will point out that while primary keys have an index created automatically when they are defined, foreign keys do not. You almost always want to index your foreign key fields.
Just out of habit, I always create an Identity field "ID" like you have as the PK. It makes things consistent. If all "master" tables have a field named "ID" that is INT Identity, then it's always obvious what the PK is. Additionally, if I need to make a bridge entity, I'll be storing two (or more) columns of type INT instead of type nvarchar(). So in your example, I would keep ID as the PK and create a unique index on UniqueName.
Data is stored in order of the clustered key; if you are going to key retrievial of data by one of those fields it would be advantageous to use that assuming values aren't significantly fragmented, which can slow down insert performance.
On the other hand, if this table is joined to a lot on the ID, it probably makes more sense to keep the clustered key on the PK.
Generally it's always best to index a table on a identity key and use this as the clustered index. There's a simple rule of thumb here
Don't use a meaningful column as primary index
The reason for this is that generally using a PK on a meaningful column tends to give rise to maintenance issues. It's a rule of thumb, so can be overridden such circumstances dictate, but usually it's best to work from the assumed default position of each table indexed by a (clustered) meaningless identity column. Such tends to be more efficient for joins, and as it's usually the default design that most DBAs will adopt so won't raise any eyebrows or give any issues because they system isn't as the next DBA might assume. Meaningless PKs are invariably more flexible and can adapted more easily to changing circumstances then otherwise
When to override the rule? Only if you do envisage performance issues. For most databases with reasonable loads on modern hardware suitably indexed you will not have any trouble if you're not squeezing the last millisecond of performance out of them by clustering the optimal index. DBA and Programmer cycles are much more expensive than CPU cycles and if you'll only shave the odd millisecond or so off your queries by adopting a different strategy then it's just not worth it. However should you be looking at a table with approaching a million rows then that's a different matter. It depends very much on circumstances, but generally if I'm designing a database with tables of less than 100,000 rows I will lean heavily towards designing for flexibility, ease of writing stable queries, and along the principals any other designer would expect to see. Over a million rows then I design for performance. Between 100k and a million it's a matter of judgement.
There is no requirement or necessity to have a clustered index at all, primary key or otherwise. It's a performance optimisation tool, like all indexing strategies, and should be applied when an improvement can be gained by using it.
As already mentioned, because the table is physically sorted according to the clustered index key, it's a Highlander situation: there can only be one!
Clustered indexes are mostly useful for situations such as:
you regularly need to retrieve a set of rows whose values for a given column are in a range, so columns that are often the subject of a BETWEEN clause are interesting; or
most of your single-row hits in the table occur in an area that can be described by a subset of the values of a key.
I thought that they were particularly un-useful for situations like when you have high-volume transaction systems with very frequent inserts when a sequential key is the clustered column. You'll get a gang of processes all trying to insert at the same physical location (a "hot-spot"). Turns out, as was commented here before this edit, that I'm sadly out-of-date and showing my age. See this post on the topic by Kimberley Tripp which says it all much better.
Sequential numeric "ID" columns are generally not good candidate columns. Names can be good, dates likewise - if carefully considered.