Composite Index primary key vs Unique auto increment primary key - sql

We have a transaction table of over 111m rows that has a clustered composite primary key of...
RevenueCentreID int
DateOfSale smalldatetime
SaleItemID int
SaleTypeID int
...in a SQL 2008 R2 database.
We are going to be truncating and refilling the table soon for an archiving project, so the opportunity to get the indexes right will be once the table has been truncated.
Would it be better to keep the composite primary key or should we move to a unique auto increment primary key?
Most searches on the table are done using the DateOfSale and RevenueCentreID columns. We also often join to the SaleItemID column. We hardly ever use the SaleType column, in fact it is only included in the primary key for uniqueness. We dont care about how long it takes to insert & delete new sales figures(done over night) but rather the speed of returning reports.

A surrogate key serves no purpose here. I suggest a clustered primary key on the columns as listed, and an index on SaleItemID.

In have learned you want and need both a natural key and a surrogate key.
The natural key keeps the business keys unique and is prefect for indexing. where the surrogate key will help with queries and development.
So in your case a surrogate auto incrementing key is good in the fact it will help keep all the rows of data in tact. And a natural key of DateOfSale, RevenueID and maybe ClientID would make a great way of ensuring no duplicate records are being stored and speed up querying because you can index the natural key.

If you don't care about the speed of inserts and deletions, then you probably want multiple indexes which target the queries precisely.
You could create an auto increment primary key as you suggest, but also create indexes as required to cover the reporting queries. Create a unique constraint on the columns you currently have in the key to enforce uniqueness.
Index tuning wizard will help with defining the optimum set of indexes, but it's better to create your own.
Rule of thumb - you can define columns to index, and also "include" columns.
If your report has an OrderBy or a Where clause on a column then you need the index to be defined against these. Any other fields returned in the select should be included columns.

Related

Primary Key Needed on Fact Tables

I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.
I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.
It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.

Primary key in "many-to-many" table

I have a table in a SQL database that provides a "many-to-many" connection.
The table contains id's of both tables and some fields with additional information about the connection.
CREATE TABLE SomeTable (
f_id1 INTEGER NOT NULL,
f_id2 INTEGER NOT NULL,
additional_info text NOT NULL,
ts timestamp NULL DEFAULT now()
);
The table is expected to contain 10 000 - 100 000 entries.
How is it better to design a primary key? Should I create an additional 'id' field, or to create a complex primary key from both id's?
DBMS is PostgreSQL
This is a "hard" question in the sense that there are pretty good arguments on both sides. I have a bias toward putting in auto-incremented ids in all tables that I use. Over time, I have found that this simply helps with the development process and I don't have to think about whether they are necessary.
A big reason for this is so foreign key references to the table can use only one column.
In a many-to-many junction table (aka "association table"), this probably isn't necessary:
It is unlikely that you will add a table with a foreign key relationship to a junction table.
You are going to want a unique index on the columns anyway.
They will probably be declared not null anyway.
Some databases actually store data based on the primary key. So, when you do an insert, then data must be moved on pages to accommodate the new values. Postgres is not one of those databases. It treats the primary key index just like any other index. In other words, you are not incurring "extra" work by declaring one more more columns as a primary key.
My conclusion is that having the composite primary key is fine, even though I would probably have an auto-incremented primary key with separate constraints. The composite primary key will occupy less space so probably be more efficient than an auto-incremented id. However, if there is any chance that this table would be used for a foreign key relationship, then add in another id field.
A surrogate key wont protect you from adding multiple instances of (f_id1, f_id2) so you should definitely have a unique constraint or primary key for that. What would the purpose of a surrogate key be in your scenario?
Yes that's actually what people commonly do, that key is called surrogate key.. I'm not exactly sure with PostgreSQL, but in MySQL by using surrogate key you can delete/edit the records from the user interface.. Besides, this allows the database to query the single key column faster than it could multiple columns.. Hope it helps..

When should I use primary key or index?

When should I use a primary key or an index?
What are their differences and which is the best?
Basically, a primary key is (at the implementation level) a special kind of index. Specifically:
A table can have only one primary key, and with very few exceptions, every table should have one.
A primary key is implicitly UNIQUE - you cannot have more than one row with the same primary key, since its purpose is to uniquely identify rows.
A primary key can never be NULL, so the row(s) it consists of must be NOT NULL
A table can have multiple indexes, and indexes are not necessarily UNIQUE. Indexes exist for two reasons:
To enforce a uniquness constraint (these can be created implicitly when you declare a column UNIQUE)
To improve performance. Comparisons for equality or "greater/smaller than" in WHERE clauses, as well as JOINs, are much faster on columns that have an index. But note that each index decreases update/insert/delete performance, so you should only have them where they're actually needed.
Differences
A table can only have one primary key, but several indexes.
A primary key is unique, whereas an index does not have to be unique. Therefore, the value of the primary key identifies a record in a table, the value of the index not necessarily.
Primary keys usually are automatically indexed - if you create a primary key, no need to create an index on the same column(s).
When to use what
Each table should have a primary key. Define a primary key that is guaranteed to uniquely identify each record.
If there are other columns you often use in joins or in where conditions, an index may speed up your queries. However, indexes have an overhead when creating and deleting records - something to keep in mind if you do huge amounts of inserts and deletes.
Which is best?
None really - each one has its purpose. And it's not that you really can choose the one or the other.
I recommend to always ask yourself first what the primary key of a table is and to define it.
Add indexes by your personal experience, or if performance is declining. Measure the difference, and if you work with SQL Server learn how to read execution plans.
This might help Back to the Basics: Difference between Primary Key and Unique Index
The differences between the two are:
Column(s) that make the Primary Key of a table cannot be NULL since by definition, the Primary Key cannot be NULL since it helps uniquely identify the record in the table. The column(s) that make up the unique index can be nullable. A note worth mentioning over here is that different RDBMS treat this differently –> while SQL Server and DB2 do not allow more than one NULL value in a unique index column, Oracle allows multiple NULL values. That is one of the things to look out for when designing/developing/porting applications across RDBMS.
There can be only one Primary Key defined on the table where as you can have many unique indexes defined on the table (if needed).
Also, in the case of SQL Server, if you go with the default options then a Primary Key is created as a clustered index while the unique index (constraint) is created as a non-clustered index. This is just the default behavior though and can be changed at creation time, if needed.
Keys and indexes are quite different concepts that achieve different things. A key is a logical constraint which requires tuples to be unique. An index is a performance optimisation feature of a database and is therefore a physical rather than a logical feature of the database.
The distinction between the two is sometimes blurred because often a similar or identical syntax is used for specifying constraints and indexes. Many DBMSs will create an index by default when key constraints are created. The potential for confusion between key and index is unfortunate because separating logical and physical concerns is a highly important aspect of data management.
As regards "primary" keys. They are not a "special" type of key. A primary key is just any one candidate key of a table. There are at least two ways to create candidate keys in most SQL DBMSs and that is either using the PRIMARY KEY constraint or using a UNIQUE constraint on NOT NULL columns. It is a very widely observed convention that every SQL table has a PRIMARY KEY constraint on it. Using a PRIMARY KEY constraint is conventional wisdom and a perfectly reasonable thing to do but it generally makes no practical or logical difference because most DBMSs treat all keys as equal. Certainly every table ought to enforce at least one candidate key but whether those key(s) are enforced by PRIMARY KEY or UNIQUE constraints doesn't usually matter. In principle it is candidate keys that are important, not "primary" keys.
The primary key is by definition unique: it identifies each individual row. You always want a primary key on your table, since it's the only way to identify rows.
An index is basically a dictionary for a field or set of fields. When you ask the database to find the record where some field is equal to some specific value, it can look in the dictionary (index) to find the right rows. This is very fast, because just like a dictionary, the entries are sorted in the index allowing for a binary search. Without the index, the database has to read each row in the table and check the value.
You generally want to add an index to each column you need to filter on. If you search on a specific combination of columns, you can create a single index containing all of those columns. If you do so, the same index can be used to search for any prefix of the list of columns in your index. Put simply (if a bit inaccurately), the dictionary holds entries consisting of the concatenation of the values used in the columns, in the specified order, so the database can look for entries which start with a specific value and still use efficient binary search for this.
For example, if you have an index on the columns (A, B, C), this index can be used even if you only filter on A, because that is the first column in the index. Similarly, it can be used if you filter on both A and B. It cannot, however, be used if you only filter on B or C, because they are not a prefix in the list of columns - you need another index to accomodate that.
A primary key also serves as an index, so you don't need to add an index convering the same columns as your primary key.
Every table should have a PRIMARY KEY.
Many types of queries are sped up by the judicious choice of an INDEX. It may be that the best index is the primary key. My point is that the query is the main factor in whether to use the PK for its index.

Primary Key / Clustered key for Junction Tables

Let's say we have a Product table, and Order table and a (junction table) ProductOrder.
ProductOrder will have an ProductID and an OrderID.
In most of our systems these tables also have an autonumber column called ID.
What is the best practice for placing the primary key (and therefor clustered key)?
Should I keep the primary key of the ID field and create a non-clustered index for the foreign key pair (ProductID and OrderID)
Or should I put the primary key of the foreign key pair (ProductID and OrderID) and put a non-clustered index on the ID column (if even necessary)
Or ... (smart remark by one of you :))
I know these words might make you cringe, but "it depends."
It is most likely that you want the order to be based on the ProductID and/or OrderId and not the autonumber (surrogate) column since the autonumber has no natural meaning in your database. You probably want to order the join table by the same field as the parent table.
First understand why and how you are using the surrogate key ID
in the first place; that will often dictate how you index it. I
assume you are using the surrogate key because you are using some
framework that works well with single column keys. If there is no
specific design reason, then for a join table, I'd simplify the
problem and just remove the autonumber ID, if it brings no other
benefit. The primary key becomes the (ProductID, OrderID). If not,
you need to at least make sure your index on the (ProductID,
OrderID) tuple is unique to preserve data integrity.
Clustered indexes are good for sequential scans/joins when the
query needs the results in the same order that the index is ordered.
So, look at your access patterns, figure out by which key(s) you
will be doing sequential, multi-row selects / scans, and by which
key you'll be doing random, individual row access, and create the
clustered index on the key you'll scan most, and the non-clustered
key index on the key you'll use for random access. You have to
choose one or the other, since you cannot cluster both.
NOTE: If you have conflicting requirements, there is a technique ("trick") that may help. If all of the columns in a query are found in an index, then that index is a candidate table for the database engine to use to satisfy the requirements of the query. You can use this fact to store data in more than one order even if they are in conflict of one another. Just be aware of the pros and cons of adding more fields to an index, and make a conscious decision after understanding nature and frequency of queries that will be processed.
The correct and only answer is:
Primary key is ('orderid' , 'productid')
Another index on ('productid' , 'orderid')
Either can be clustered, but PK is by default
Because:
You don't need an index on orderid or productid alone: the optimiser will use one of the indexes
You'll most likely use the table "both" ways
You don't need a surrogate key because you already have them on the linked tables. So a 3rd columns wastes space.
This appears to be for a dynamic system where many orders will be added. The clustered index should therefore be on your autonumbered column.
You can make index the primary key and put another unique index on the pair of columns. Or, you can make the pair of columns the primary (but non-clustered) key.
The choice of using the primary key or a unique index key is up to you. But I would make sure that the one that is clustered is for your autonumber column.
My preference has always been to create an autonumber for Primary Keys. Then I create a unique index on the two Foreign keys so that they are not duplicated.
The reason I do this is because the more I normalize my data, the more keys I have to use in joins. I have ended up with designs going six to seven levels deep, and if I use keys flowing from one level to another, I could potentially end up with a n^2 keys in the join.
Try convincing my SQL Developers to use all of that for a single query, and they will really like me.
I keep it simple.

Is primary key always clustered?

Please clear my doubt about this, In SQL Server (2000 and above) is primary key automatically cluster indexed or do we have choice to have non-clustered index on primary key?
Nope, it can be nonclustered. However, if you don't explicitly define it as nonclustered and there is no clustered index on the table, it'll be created as clustered.
One might also add that frequently it's BAD to allow the primary key to be clustered. In particular, when the primary key is assigned by an IDENTITY, it has no intrinsic meaning, so any effort to keep the table arranged accordingly would be wasted.
Consider a table Product, with ProductID INT IDENTITY PRIMARY KEY. If this is clustered, then products that are related in some way are likely to be spread all over the disk. It might be better to cluster by something that we're likely to query based on, like the ManufacturerID or the CategoryID. In either of these cases, a clustered index would (other things being equal) make the corresponding query much more efficient.
On the other hand, the foreign key in a child table that points to this might be a good candidate for clustering (my objection is to the column that actually has the IDENTITY attribute, not its relatives). So in my example above, it's likely that ManufacturerID is a foreign key to a Manufacturer table, where it is set as an IDENTITY. That column shouldn't be clustered, but the column in Product that references it might do so to good advantage.