Is primary key always clustered? - sql

Please clear my doubt about this, In SQL Server (2000 and above) is primary key automatically cluster indexed or do we have choice to have non-clustered index on primary key?

Nope, it can be nonclustered. However, if you don't explicitly define it as nonclustered and there is no clustered index on the table, it'll be created as clustered.

One might also add that frequently it's BAD to allow the primary key to be clustered. In particular, when the primary key is assigned by an IDENTITY, it has no intrinsic meaning, so any effort to keep the table arranged accordingly would be wasted.
Consider a table Product, with ProductID INT IDENTITY PRIMARY KEY. If this is clustered, then products that are related in some way are likely to be spread all over the disk. It might be better to cluster by something that we're likely to query based on, like the ManufacturerID or the CategoryID. In either of these cases, a clustered index would (other things being equal) make the corresponding query much more efficient.
On the other hand, the foreign key in a child table that points to this might be a good candidate for clustering (my objection is to the column that actually has the IDENTITY attribute, not its relatives). So in my example above, it's likely that ManufacturerID is a foreign key to a Manufacturer table, where it is set as an IDENTITY. That column shouldn't be clustered, but the column in Product that references it might do so to good advantage.

Related

Primary Key as composite only on Primary table

There is a table in our environment. Recently, it was discovered that performance was greatly improved by sorting datetime, which the dba wanted to make the primary key. Since he can't guarantee uniqueness with the datetime, he added the id that was once the primary key, into his new composite key.
So there is a table with the primary key as datetime / id and also the clustered index is defined this way. All the pk / fk relationships are still set up properly and exist on the id to id paradigm one would expect.
what could be the possible problem of a lopsided primary key?
And performance is considerably improved with this change.
however, in the schema, the actual "primary key" is two columns. what could possibly go wrong?
Do not do that! Set up a unique index with the two fields. It does not have to be primary key. In fact if you want the original key to remain unique, then this is a terrible idea.
EDIT: This answer was assuming Sql Server. If it turns out that it is not then I will delete my answer.
You do not list details so I will have to give a very general answer. In my research I have found that most will recommend a short primary key / clustered index.
The real key here though is what you mean by increased performance. Is it just one query? In other words does this change have beneficial or at least insignificant performance impact on all operations of this data? User interfaces, all reports, and so on. Or is this robbing peter to pay paul?
If this were a reporting database or data warehouse where the majority of reports are based on date, I could see why people might recommend having the clustered index setup in such a way that it would benefit all reports, or the most important ones.
In any other situation I can think of having a non-clustered index would provide almost the same level or performance increase without increasing the size of the PK, which is used in all lookups (more bytes read = slower performance) as well as taking up more space on your data pages.
EDIT:
This article explains this topic better than I could.
https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/
The performance advantage you are currently seeing(if genuine) is due to the clustered index associated with the primary key, and not the primary key itself. If you are happy with the current index, but are concerned about uniqueness you should keep the unique datetime / id as your clustered index but revert to your old unique id as the primary key.
This also addresses the problem where other tables referencing this primary key would have required the creation of a likely inappropriate datetime column to create a foreign key relationship.

Composite Index primary key vs Unique auto increment primary key

We have a transaction table of over 111m rows that has a clustered composite primary key of...
RevenueCentreID int
DateOfSale smalldatetime
SaleItemID int
SaleTypeID int
...in a SQL 2008 R2 database.
We are going to be truncating and refilling the table soon for an archiving project, so the opportunity to get the indexes right will be once the table has been truncated.
Would it be better to keep the composite primary key or should we move to a unique auto increment primary key?
Most searches on the table are done using the DateOfSale and RevenueCentreID columns. We also often join to the SaleItemID column. We hardly ever use the SaleType column, in fact it is only included in the primary key for uniqueness. We dont care about how long it takes to insert & delete new sales figures(done over night) but rather the speed of returning reports.
A surrogate key serves no purpose here. I suggest a clustered primary key on the columns as listed, and an index on SaleItemID.
In have learned you want and need both a natural key and a surrogate key.
The natural key keeps the business keys unique and is prefect for indexing. where the surrogate key will help with queries and development.
So in your case a surrogate auto incrementing key is good in the fact it will help keep all the rows of data in tact. And a natural key of DateOfSale, RevenueID and maybe ClientID would make a great way of ensuring no duplicate records are being stored and speed up querying because you can index the natural key.
If you don't care about the speed of inserts and deletions, then you probably want multiple indexes which target the queries precisely.
You could create an auto increment primary key as you suggest, but also create indexes as required to cover the reporting queries. Create a unique constraint on the columns you currently have in the key to enforce uniqueness.
Index tuning wizard will help with defining the optimum set of indexes, but it's better to create your own.
Rule of thumb - you can define columns to index, and also "include" columns.
If your report has an OrderBy or a Where clause on a column then you need the index to be defined against these. Any other fields returned in the select should be included columns.

Should primary keys be always assigned as clustered index

I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.

Primary Key / Clustered key for Junction Tables

Let's say we have a Product table, and Order table and a (junction table) ProductOrder.
ProductOrder will have an ProductID and an OrderID.
In most of our systems these tables also have an autonumber column called ID.
What is the best practice for placing the primary key (and therefor clustered key)?
Should I keep the primary key of the ID field and create a non-clustered index for the foreign key pair (ProductID and OrderID)
Or should I put the primary key of the foreign key pair (ProductID and OrderID) and put a non-clustered index on the ID column (if even necessary)
Or ... (smart remark by one of you :))
I know these words might make you cringe, but "it depends."
It is most likely that you want the order to be based on the ProductID and/or OrderId and not the autonumber (surrogate) column since the autonumber has no natural meaning in your database. You probably want to order the join table by the same field as the parent table.
First understand why and how you are using the surrogate key ID
in the first place; that will often dictate how you index it. I
assume you are using the surrogate key because you are using some
framework that works well with single column keys. If there is no
specific design reason, then for a join table, I'd simplify the
problem and just remove the autonumber ID, if it brings no other
benefit. The primary key becomes the (ProductID, OrderID). If not,
you need to at least make sure your index on the (ProductID,
OrderID) tuple is unique to preserve data integrity.
Clustered indexes are good for sequential scans/joins when the
query needs the results in the same order that the index is ordered.
So, look at your access patterns, figure out by which key(s) you
will be doing sequential, multi-row selects / scans, and by which
key you'll be doing random, individual row access, and create the
clustered index on the key you'll scan most, and the non-clustered
key index on the key you'll use for random access. You have to
choose one or the other, since you cannot cluster both.
NOTE: If you have conflicting requirements, there is a technique ("trick") that may help. If all of the columns in a query are found in an index, then that index is a candidate table for the database engine to use to satisfy the requirements of the query. You can use this fact to store data in more than one order even if they are in conflict of one another. Just be aware of the pros and cons of adding more fields to an index, and make a conscious decision after understanding nature and frequency of queries that will be processed.
The correct and only answer is:
Primary key is ('orderid' , 'productid')
Another index on ('productid' , 'orderid')
Either can be clustered, but PK is by default
Because:
You don't need an index on orderid or productid alone: the optimiser will use one of the indexes
You'll most likely use the table "both" ways
You don't need a surrogate key because you already have them on the linked tables. So a 3rd columns wastes space.
This appears to be for a dynamic system where many orders will be added. The clustered index should therefore be on your autonumbered column.
You can make index the primary key and put another unique index on the pair of columns. Or, you can make the pair of columns the primary (but non-clustered) key.
The choice of using the primary key or a unique index key is up to you. But I would make sure that the one that is clustered is for your autonumber column.
My preference has always been to create an autonumber for Primary Keys. Then I create a unique index on the two Foreign keys so that they are not duplicated.
The reason I do this is because the more I normalize my data, the more keys I have to use in joins. I have ended up with designs going six to seven levels deep, and if I use keys flowing from one level to another, I could potentially end up with a n^2 keys in the join.
Try convincing my SQL Developers to use all of that for a single query, and they will really like me.
I keep it simple.

Create a mysql primary key without a clustered index?

I'm a SQL Server guy experimenting with MySQL for a large upcoming project (due to licensing) and I'm not finding much information in the way of creating a primary key without a clustered index. All the documentation I've read says on 5.1 says that a primary key is automatically given a clustered index. Since I'm using a binary(16) for the primary key column (GUID), I'd rather not have a clustered index on it. So...
Is it possible to create a primary key without a clustered index? I could always put the clustered index on the date_created column instead, but how do I prevent mysql from creating the clustered index on the primary key automatically?
If not possible, will I be OK performance wise with a unique index on the GUID column and no primary key on the table? I'm planning to use nhibernate here, so I'm not sure if having no primary key is allowed (haven't got that far yet).
It depends on which storage engine you are using. MyISAM tables do not support clustered indices, so primary keys on MyISAM tables are not clustered. The primary key on an InnoDB table, however, is clustered.
You should consult the MySQL Manual for further details about the pros and cons of each storage engine.
You need to have a primary key; if you don't create one yourself, MySQL will create a hidden one for you. You could always just create an AUTO_INCREMENT field for the primary key (this is preferable to having MySQL have hidden fields in your table, I think).
Considering what is said on 13.6.10.1. Clustered and Secondary Indexes, it seems you cannot really define on which column the clustered index is set :
it's either on the PK column
or on the first column with a UNIQUE index that only has non-null values
or on some internal column -- not that usefull in your case ^^
About the second question in your post : no PK on the table, and a UNIQUE index on the GUID ; it might be possible, but it will not change anything about the clustered index : it will still probably be on the GUID column.
Some kind of a hack might be to :
not define a primary key
place a UNIQUE index on your date_created field (if you don't create too many rows in short periods of time, it could be viable... )
Not sure you can place a second UNIQUE index on the GUID... Maybe ^^
And not sure that would change much about the clustered index ; but might be worth a try...