I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.
Related
I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.
I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.
It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.
I'm using newsequentialid to generate GUIDs for my primary key in a table.
According to the documentation (https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-ver15), Sequential GUIDs aren't guarantined to be generated in order.
After restarting Windows, the GUID can start again from a lower range,
but is still globally unique
Basically, they're in order until you reboot the machine.
For autoincrement primary key, it makes sense for it to be the clustered index cause it's guaranteed an inserted row will be at the end.
For a GUID primary key, it doesn't make sense for it to be the clustered index cause it's random it's unlikely an inserted row will be at the end.
What about for a sequential GUID primary key? Should the primary key be the clustered index or should I try to find another column like a DateCreated field? The problem is fields like DateCreated isn't going to be a unique field. If I don't have any fields that are unique fields, what should I make as the clustered index?
Sequential GUIDs are much safer for clustered indexes than non-sequential GUIDs. In general, databases are not restarted particularly often. It is true that restarting can result in page splits and fragmentation, but that is usually not too big a consideration because restarting is rare.
That said, the primary key does not need to be the clustered index key. You can have an identity column or creation date/time as the clustered index, pretty much eliminating this issue.
I wrote a long post about this a while ago. The TL/DR is that using a sequential GUID as a clustered index key is fine. The GUIDs are actually inserted in the middle of the index, but having a small number (here one) mid-index insertion point does not cause expensive page splits or lead to harmful fragmentation.
Good Page Splits and Sequential GUID Key Generation
This same behavior applies to using a compound key as clustered index, where the leading key column has lower cardinality. Eg (CustomerId,TransactionId). Each CustomerId will have a half-full page with space for the next TransactionId, and when that page fills a new one is allocated.
I have a primary key as part of my table defined as follows:
PRIMARY KEY CLUSTERED ([c_number], [property])
Does the database then keep the table sorted ascending by c_number by default? Is it possible to put an ASC clarifier on the statement to ensure this?
This is deployed on a Microsoft Azure SQL database.
In SQL Server, the data is stored in the data pages by the order of the clustered index, regardless of which index is defined as the primary key.
If you do not specify a sort order for the clustered index (or any index, for that matter), the order will be ascending, by default.
The main reason you should care is if you have a large number of range queries that are likely to be able to do sequential I/O if you cluster on a key relevant to those queries, as well as to combat/avoid fragmentation of your table.
The primary key does not have to be clustered, but you absolutely should define a clustered index for each table (else it is a heap table) in most situations, and, if your primary key is sequential in any meaningful way, it usually makes sense for your primary key to be the clustered key.
But again, based on the SQL you showed, the answer to your question is yes - the data will be stored in ascending order by the primary clustered key you have specified. Depending on what that data is, you may be setting your table up for extreme fragmentation (composite clustered keys are rarely a good idea).
Your index is on c_number, property, which means that the index will be ordered by c_number ascending and then by property ascending, which means the data will also be stored that way. As a data page fills up, if you were to perform inserts in the following order:
(1,1)
(1,2)
(2,1)
(1,3)
You would cause fragmentation, as the page will have to be split to insert the (1,3) value between (1,2) and (2,1).
I'd suggest that, unless that sort of situation can never happen or, unless your queries nearly always order, group, or filter on those two columns, that you cluster on a different column (not necessarily changing the columns of the primary key).
In any case, if you end up with a fragmented table, rebuild the clustered index during maintenance windows any time the fragmentation gets out of hand. It'll greatly improve response time due to reduced random I/O.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Performance difference between Primary Key and Unique Clustered Index in SQL Server
I make sure that I searched this forum but nobody asked this question before and I couldn't find any answer in anywhere too.
My question is = "What’s the difference between a primary key and a clustered index?"
Well, for starters, one is a key, and the other one is an index.
In most database lingo, key is something that somehow identifies the data, with no explicit relation to the storage or performance of the data. And a primary key is a piece of data that uniquely identifies that data.
An index on the other hand is something that describes a (faster) way to access data. It does not (generally) concern itself with the integrity and meaning of the data, it's just concerned with performance and storage. In SQL Server specifically, a clustered index is an index that dictates the physical order of storage of the rows. The things that it does are quite complex, but a useful approximation is that the rows are ordered by the value of the clustered index. This means that when you do not specify a order clause, the data is likely to be sorted by the value of the clustered index.
So, they are completely different things, that kinda-sorta compliment each other. That is why SQL Server, when you create a primary key via the designer, throws in a free clustered index along with it.
Before you can ask the difference between primary key and clustered index, you have to know that a key and an index are not the same thing.
A key can be a primary key or a foreign key. There can be only one primary key per table (but it might be more than one column). A key is a logical thing, it serves the business logic and defines the integrity of data. A foreign key is a reference to a primary key of another table.
Indexes helps to speed up your queries, because it builds references to columns of your choice. So it creates separate files that helps your queries that use indexed columns.
A clustered index is a special index that defines the physical order of your table (it should be a sequential data).
I tried to explain this with my own words, but you'll find all resources you need with a google search (and I definitely recommend that you read a lot of this ! )
Primary key is unique identifier for record. It's responsible for unique value of this field. It's simply existing or specially created field or group of fields that uniquely identifies row.
And clustered index is data structure that improves speed of data retrieval operations through an access of ordered records. Index is copy of one part of table. It takes additional physical place on hard disk.
In much of the RDBMS, as far as I know, when you create PK the engine in back creates clustered index. PK used for Entity integrity when clustered index sets data order and used for performance.
Let's say we have a Product table, and Order table and a (junction table) ProductOrder.
ProductOrder will have an ProductID and an OrderID.
In most of our systems these tables also have an autonumber column called ID.
What is the best practice for placing the primary key (and therefor clustered key)?
Should I keep the primary key of the ID field and create a non-clustered index for the foreign key pair (ProductID and OrderID)
Or should I put the primary key of the foreign key pair (ProductID and OrderID) and put a non-clustered index on the ID column (if even necessary)
Or ... (smart remark by one of you :))
I know these words might make you cringe, but "it depends."
It is most likely that you want the order to be based on the ProductID and/or OrderId and not the autonumber (surrogate) column since the autonumber has no natural meaning in your database. You probably want to order the join table by the same field as the parent table.
First understand why and how you are using the surrogate key ID
in the first place; that will often dictate how you index it. I
assume you are using the surrogate key because you are using some
framework that works well with single column keys. If there is no
specific design reason, then for a join table, I'd simplify the
problem and just remove the autonumber ID, if it brings no other
benefit. The primary key becomes the (ProductID, OrderID). If not,
you need to at least make sure your index on the (ProductID,
OrderID) tuple is unique to preserve data integrity.
Clustered indexes are good for sequential scans/joins when the
query needs the results in the same order that the index is ordered.
So, look at your access patterns, figure out by which key(s) you
will be doing sequential, multi-row selects / scans, and by which
key you'll be doing random, individual row access, and create the
clustered index on the key you'll scan most, and the non-clustered
key index on the key you'll use for random access. You have to
choose one or the other, since you cannot cluster both.
NOTE: If you have conflicting requirements, there is a technique ("trick") that may help. If all of the columns in a query are found in an index, then that index is a candidate table for the database engine to use to satisfy the requirements of the query. You can use this fact to store data in more than one order even if they are in conflict of one another. Just be aware of the pros and cons of adding more fields to an index, and make a conscious decision after understanding nature and frequency of queries that will be processed.
The correct and only answer is:
Primary key is ('orderid' , 'productid')
Another index on ('productid' , 'orderid')
Either can be clustered, but PK is by default
Because:
You don't need an index on orderid or productid alone: the optimiser will use one of the indexes
You'll most likely use the table "both" ways
You don't need a surrogate key because you already have them on the linked tables. So a 3rd columns wastes space.
This appears to be for a dynamic system where many orders will be added. The clustered index should therefore be on your autonumbered column.
You can make index the primary key and put another unique index on the pair of columns. Or, you can make the pair of columns the primary (but non-clustered) key.
The choice of using the primary key or a unique index key is up to you. But I would make sure that the one that is clustered is for your autonumber column.
My preference has always been to create an autonumber for Primary Keys. Then I create a unique index on the two Foreign keys so that they are not duplicated.
The reason I do this is because the more I normalize my data, the more keys I have to use in joins. I have ended up with designs going six to seven levels deep, and if I use keys flowing from one level to another, I could potentially end up with a n^2 keys in the join.
Try convincing my SQL Developers to use all of that for a single query, and they will really like me.
I keep it simple.