I'm using newsequentialid to generate GUIDs for my primary key in a table.
According to the documentation (https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-ver15), Sequential GUIDs aren't guarantined to be generated in order.
After restarting Windows, the GUID can start again from a lower range,
but is still globally unique
Basically, they're in order until you reboot the machine.
For autoincrement primary key, it makes sense for it to be the clustered index cause it's guaranteed an inserted row will be at the end.
For a GUID primary key, it doesn't make sense for it to be the clustered index cause it's random it's unlikely an inserted row will be at the end.
What about for a sequential GUID primary key? Should the primary key be the clustered index or should I try to find another column like a DateCreated field? The problem is fields like DateCreated isn't going to be a unique field. If I don't have any fields that are unique fields, what should I make as the clustered index?
Sequential GUIDs are much safer for clustered indexes than non-sequential GUIDs. In general, databases are not restarted particularly often. It is true that restarting can result in page splits and fragmentation, but that is usually not too big a consideration because restarting is rare.
That said, the primary key does not need to be the clustered index key. You can have an identity column or creation date/time as the clustered index, pretty much eliminating this issue.
I wrote a long post about this a while ago. The TL/DR is that using a sequential GUID as a clustered index key is fine. The GUIDs are actually inserted in the middle of the index, but having a small number (here one) mid-index insertion point does not cause expensive page splits or lead to harmful fragmentation.
Good Page Splits and Sequential GUID Key Generation
This same behavior applies to using a compound key as clustered index, where the leading key column has lower cardinality. Eg (CustomerId,TransactionId). Each CustomerId will have a half-full page with space for the next TransactionId, and when that page fills a new one is allocated.
Related
I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.
I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.
It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.
From what I have read, indexing is like writing index page at the front of the book to make sure the db doesnt have to go through all the pages.
If primary key is indexed, wouldnt it be exactly same as going through the entire book because they are all unique anyways so the categorization within the index of primary key is same as the number of documents. If so, what is the purpose of indexing primary keys if there is no performance benefit?
The primary key is an index -- keys are indexes! It's just a special name for a special kind of index which is always unique, and which may have an automatically assigned value.
In some databases, the rows are sometimes (or always) stored in the same order as the primary key. In these situations, the primary key may not need to be separately indexed -- the order of the rows is enough of an index on its own.
In some other databases, the primary key is not treated differently. The rows are stored in an arbitrary order -- perhaps in the order they were last modified, for example. In these situations, an index is needed on the primary key to look up the rows.
I have a primary key as part of my table defined as follows:
PRIMARY KEY CLUSTERED ([c_number], [property])
Does the database then keep the table sorted ascending by c_number by default? Is it possible to put an ASC clarifier on the statement to ensure this?
This is deployed on a Microsoft Azure SQL database.
In SQL Server, the data is stored in the data pages by the order of the clustered index, regardless of which index is defined as the primary key.
If you do not specify a sort order for the clustered index (or any index, for that matter), the order will be ascending, by default.
The main reason you should care is if you have a large number of range queries that are likely to be able to do sequential I/O if you cluster on a key relevant to those queries, as well as to combat/avoid fragmentation of your table.
The primary key does not have to be clustered, but you absolutely should define a clustered index for each table (else it is a heap table) in most situations, and, if your primary key is sequential in any meaningful way, it usually makes sense for your primary key to be the clustered key.
But again, based on the SQL you showed, the answer to your question is yes - the data will be stored in ascending order by the primary clustered key you have specified. Depending on what that data is, you may be setting your table up for extreme fragmentation (composite clustered keys are rarely a good idea).
Your index is on c_number, property, which means that the index will be ordered by c_number ascending and then by property ascending, which means the data will also be stored that way. As a data page fills up, if you were to perform inserts in the following order:
(1,1)
(1,2)
(2,1)
(1,3)
You would cause fragmentation, as the page will have to be split to insert the (1,3) value between (1,2) and (2,1).
I'd suggest that, unless that sort of situation can never happen or, unless your queries nearly always order, group, or filter on those two columns, that you cluster on a different column (not necessarily changing the columns of the primary key).
In any case, if you end up with a fragmented table, rebuild the clustered index during maintenance windows any time the fragmentation gets out of hand. It'll greatly improve response time due to reduced random I/O.
I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.
I'm a SQL Server guy experimenting with MySQL for a large upcoming project (due to licensing) and I'm not finding much information in the way of creating a primary key without a clustered index. All the documentation I've read says on 5.1 says that a primary key is automatically given a clustered index. Since I'm using a binary(16) for the primary key column (GUID), I'd rather not have a clustered index on it. So...
Is it possible to create a primary key without a clustered index? I could always put the clustered index on the date_created column instead, but how do I prevent mysql from creating the clustered index on the primary key automatically?
If not possible, will I be OK performance wise with a unique index on the GUID column and no primary key on the table? I'm planning to use nhibernate here, so I'm not sure if having no primary key is allowed (haven't got that far yet).
It depends on which storage engine you are using. MyISAM tables do not support clustered indices, so primary keys on MyISAM tables are not clustered. The primary key on an InnoDB table, however, is clustered.
You should consult the MySQL Manual for further details about the pros and cons of each storage engine.
You need to have a primary key; if you don't create one yourself, MySQL will create a hidden one for you. You could always just create an AUTO_INCREMENT field for the primary key (this is preferable to having MySQL have hidden fields in your table, I think).
Considering what is said on 13.6.10.1. Clustered and Secondary Indexes, it seems you cannot really define on which column the clustered index is set :
it's either on the PK column
or on the first column with a UNIQUE index that only has non-null values
or on some internal column -- not that usefull in your case ^^
About the second question in your post : no PK on the table, and a UNIQUE index on the GUID ; it might be possible, but it will not change anything about the clustered index : it will still probably be on the GUID column.
Some kind of a hack might be to :
not define a primary key
place a UNIQUE index on your date_created field (if you don't create too many rows in short periods of time, it could be viable... )
Not sure you can place a second UNIQUE index on the GUID... Maybe ^^
And not sure that would change much about the clustered index ; but might be worth a try...