How best to update row where composite primary key values change - sql

We have numerous tables where we have composite keys with MULTIPLE entries. In some cases as many as SIX values that make up the primary key for a table that is not super large, maybe a few thousand entries, and is not accessed very heavily.
A better solution to this would be to use a primary key that is a single auto-incremented ID field and in order to make sure the combination of the six different fields now used as a primary key are unique you can create an index with a unique constraint. The performance might not be quite so good, but the code complexity would be DRASTICALLY reduced.
I was told that making the primary key so complex is necessary because the primary key is the only clustered index on a table and that this enhances the performance. I can understand how this would help, but is it THAT big of a performance enhancement? It seems to be a premature optimization.
Is it common practice to use composite primary keys? I understand that if you had a very large table, with many thousands of entries, and that was hit constantly, then even a small performance enhancement could be worth adding the complexity I am seeing.
It also seems like having a primary key made up of values that can be updated/changed is just asking for trouble. If other tables are referencing it couldn't that lead to issues?
This would mostly be for adding new tables, since changing the structure of the existing tables could be too drastic a change for them to accept. But I want to know if I am out of line before trying to push back against this practice.

Generally using many columns to form a PRIMARY KEY is the worst practice that I found regularly in my databases audits. In fact it was used in the hierarchical database model dating in the 50's... This was dropped due to poor performance !
Database relational model says that the key can be any column or group of columns, but the database experts and practionners have all demonstrates that the best way to have performances in order to ensure the scalability, is to have a key that is only one column, and with a datatype that is :
the shortest in terms of bytes
the biggest in terms of values
with asemantic values
in a monotonous order
The only way to assume all these considerations is to have a PRIMARY KEY with an auto-increment dataype such as IDENTITY or SEQUENCE.
Every other datatype or ways to do so have some extraoverhead or performs poorly.
In the case of PK with compound columns, the statistics for the optimizer are accurate only for the first column of the key. The statistics of the combination of multiple columns does not exists in any accurate way (except for a complete set of all value of the key in the case of a strict equality and of course this is always equal to 1) and conducts the optimizer to get an average of the global selectivity or worst, compute a correlated cardinality. In both case the execution plan will be of poor quality and sometime catastrophic...
For MS SQL Server clustered indexes are the best choice for PK, only if all the specification I wrote are strictly applied.

Related

SQL Server composite unique key performance

I have 6 columns where I want to apply composite unique key for prevent duplication, is it safe as performance respect?
Consider we will perform CRUD on behave of that 6 composite keys.
If you are asking whether an index with six columns is "safe", then it is perfectly fine. If your data model wants the combination of the six columns to be unique, then a unique index or unique constraint (which is implemented using a unique index) is how this is enforced.
There is overhead to an index -- whether it has one column or six. Generally, the cost is low enough to not be an issue. And the gains in data integrity and faster queries outweigh cost.
Do note that there is a limit to the overall size of the keys for any index. If your data values exceed that size, then you will not be able to maintain the index.
composite unique key on 6 columns
It depend upon real example.First thing come to mind is whether how this table is Normalize,or Denormalize ?
It also depend upon each column data type,and how each combination are populated.
This will be really very wide index.Cost of index will increase.
There will be very high Index fragmentation.
So not only cost of CRUD will increase,Select query will also suffer.
Because of high index cost ,optimizer will often choose Index Scan.
However, if you create Clsutered index on INT Identity column.
This will improve your CRUD performance.
Then Create Composite Unique keys with most selective column first.
In other word order of columns in composite index is important.

Do clustered indexes have to be unique?

What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?
Is it "made" unique and if so how? What is the best way to make it unique?
I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.
What happens if you create a CI on a non-unique column
If the clustered index is not a unique
index, SQL Server makes any duplicate
keys unique by adding an internally
generated value called a uniqueifier
Does this lead to bad performance?
Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.
How much data the table contains.
What is the rate of inserts.
How often is the CI used in a select (when no covering indexes exist, pretty much always).
Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).
I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:
I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....
Let's start with the key things that I look for in a clustering key:
* Unique
* Narrow
* Static
Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).
Source: Ever-increasing clustering key debate - again!
Do clustered indexes have to be unique?
They don't, and there are times where it's better if they're not.
Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is
SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%
then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).
Do you have any references?
There's Clustered Index Design Guidelines for example, which says,
With few exceptions, every table
should have a clustered index defined
on the column, or columns, that offer
the following:
Can be used for frequently used queries.
Provide a high degree of uniqueness.
Can be used in range queries.
My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.
If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster.
This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.

Sql Server Legacy Database To Clustered index or not

We have a legacy database which is a sql server db (2005, and 2008).
All of the primary keys in the tables are UniqueIdentifiers.
The tables currently have no clustered index created on them and we are running into performance issues on tables with only 750k records. This is the first database i've worked on with unique identifiers as the sole primary key and I've never seen sql server be this slow with returning data.
I don't want to create a clustered index on the uniqueidentifier as they are not sequential and will therefore slow the apps down when it comes to inserting data.
We cannot remove the uniqueidentifier as that is used for remote site record identity management purposes.
I had thought about adding a big integer identity column to the tables and creating the clustered index on this column and including the unique identifier column.
i.e.
int identity - First column to maintain insert speeds
unique identifier - To ensure the application keeps working as expected.
The goal is to improve the identity query and joined table query performance.
Q1: Will this improve the query performance of the db or will it slow it down?
Q2: Is there an alternative to this that I haven't listed?
Thanks
Pete
Edit: The performance issues are on retrieving data quickly through select statements, especially if a few of the more "transactional / changing" tables are joined together.
Edit 2: The joins between tables are generally all between the primary key and foreign keys, for tables that have foreign keys they are included in the non-clustered index to provide a more covering index.
The tables all have no other values which would provide a good clustered index.
I'm leaning more towards adding an additional identity column on each of the high load tables and then including the current Guid PK column within the clustered index to provide the best query performance.
Edit 3:
I would estimate that 80% of the queries are performed on primary and foreign keys alone through the data access mechanism. Generally our data model has lazy loaded objects which perform the query when accessed, these queries use the objects id and the PK column. We have a large amount of user driven data exclusion / inclusion queries which use the foreign key columns as a filter based on the criteria of for type X exclude the following id's. The remaining 20% is where clauses on Enum (int) or date range columns, very few text based queries are performed in the system.
Where possible I have already added covering indexes to cover the heaviest queries, but as yet i'm still dissapointed by the performance. As bluefooted says the data is being stored as a heap.
If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.
I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.
Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.
I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.
Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.
Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.
It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?
I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.
UPDATE
Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.
As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:
[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);
public static Guid SequentialGuid()
{
const int RPC_S_OK = 0;
Guid g;
if (UuidCreateSequential(out g) != RPC_S_OK)
return Guid.NewGuid();
else
return g;
}
newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.
You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.
You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.

Does clustered index on foreign key column increase join performance vs non-clustered?

In many places it's recommended that clustered indexes are better utilized when used to select range of rows using BETWEEN statement. When I select joining by foreign key field in such a way that this clustered index is used, I guess, that clusterization should help too because range of rows is being selected even though they all have same clustered key value and BETWEEN is not used.
Considering that I care only about that one select with join and nothing else, am I wrong with my guess ?
Discussing this type of issue in the absolute isn't very useful.
It is always a case-by-case situation !
Essentially, access by way of a clustered index saves one indirection, period.
Assuming the key used in the JOIN, is that of the clustered index, in a single read [whether from an index seek or from a scan or partial scan, doesn't matter], you get the whole row (record).
One problem with clustered indexes, is that you only get one per table. Therefore you need to use it wisely. Indeed in some cases, it is even wiser not to use any clustered index at all because of INSERT overhead and fragmentation (depending on the key and the order of new keys etc.)
Sometimes one gets the equivalent benefits of a clustered index, with a covering index, i.e. a index with the desired key(s) sequence, followed by the column values we are interested in. Just like a clustered index, a covering index doesn't require the indirection to the underlying table. Indeed the covering index may be slightly more efficient than the clustered index, because it is smaller.
However, and also, just like clustered indexes, and aside from the storage overhead, there is a performance cost associated with any extra index, during INSERT (and DELETE or UPDATE) queries.
And, yes, as indicated in other answers, the "foreign-key-ness" of the key used for the clustered index, has absolutely no bearing on the the performance of the index. FKs are constraints aimed at easing the maintenance of the integrity of the database but the underlying fields (columns) are otherwise just like any other field in the table.
To make wise decisions about index structure, one needs
to understands the way the various index types (and the heap) work
(and, BTW, this varies somewhat between SQL implementations)
to have a good image of the statistical profile of the database(s) at hand:
which are the big tables, which are the relations, what's the average/maximum cardinality of relation, what's the typical growth rate of the database etc.
to have good insight regarding the way the database(s) is (are) going to be be used/queried
Then and only then, can one can make educated guesses about the interest [or lack thereof] to have a given clustered index.
I would ask something else: would it be wise to put my clustered index on a foreign key column just to speed a single JOIN up? It probably helps, but..... at a price!
A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
So those are my criteria - put your clustering key on a narrow, stable, unique, hopefully ever-increasing column. If your foreign key column matches those - perfect!
However, I would not under any circumstances put my clustering key on a wide or even compound foreign key. Remember: the value(s) of the clustering key are being added to each and every non-clustered index entry on that table! If you have 10 non-clustered indices, 100'000 rows in your table - that's one million entries. It makes a huge difference whether that's a 4-byte integer, or a 200-byte VARCHAR - HUGE. And not just on disk - in server memory as well. Think very very carefully about what to make your clustered index!
SQL Server might need to add a uniquifier - making things even worse. If the values will ever change, SQL Server would have to do a lot of bookkeeping and updating all over the place.
So in short:
putting an index on your foreign keys is definitely a great idea - do it all the time!
I would be very very careful about making that a clustered index. First of all, you only get one clustered index, so which FK relationship are you going to pick? And don't put the clustering key on a wide and constantly changing column
An index on the FK column will help the JOIN because the index itself is ordered: clustered just means that the data on disk (leaf) is ordered rather then the B-tree.
If you change it to a covering index, then clustered vs non-clustered is irrelevant. What's important is to have a useful index.
It depends on the database implementation.
For SQL Server, a clustered index is a data structure where the data is stored as pages and there are B-Trees and are stored as a separate data structure. The reason you get fast performance, is that you can get to the start of the chain quickly and ranges are an easy linked list to follow.
Non-Clustered indexes is a data structure that contains pointers to the actual records and as such different concerns.
Refer to the documentation regarding Clustered Index Structures.
An index will not help in relation to a Foreign Key relationship, but it will help due to the concept of "covered" index. If your WHERE clause contains a constraint based upon the index. it will be able to generate the returned data set faster. That is where the performance comes from.
The performance gains usually come if you are selecting data sequentially within the cluster. Also, it depends entirely on the size of the table (data) and the conditions in your between statement.

Which is the most common ID type in SQL Server databases, and which is better?

Is it better to use a Guid (UniqueIdentifier) as your Primary/Surrogate Key column, or a serialized "identity" integer column; and why is it better? Under which circumstances would you choose one over the other?
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be uniqe, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
Use a GUID in a replicated system where you need to guarantee uniqueness.
Use ints where you have a non-replicated database and you want to maximise performance.
Very Seldomly use GUID.
Use rather a primary key/Surrogate Key for stoage purposes.
Also this will make it easier for human interaction with the data.
Creating Indexes will be a lot more efficient too.
See
How Using GUIDs in SQL Server Affect
Index Performance
Performance Effects of Using GUIDs
as Primary Keys
When considering using integers, be sure to allow for the maximum possible value that might occur. You often end up with skipped numbers because of deletions, so the actual maximum ID might be much larger than the total number of records in the table.
For example, if you aren't sure that a 32-bit integer will do, use a 64-bit integer.
You might also find these other SO discussions useful:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Picking the best primary key + numbering system.
And if you search here in SO for "primary key", you'll find those and a lot more very useful discussions.
There's no single answer to this. The issues that people are quick to jump on with Guid's (that their random nature combined with the default behavior of the primary key also acting as the clustered key) can be easily mitigated. Guids have a larger range than integers do, but as you start to fill that range with values you increase your risk of a collision.
Guid's can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn't cause collisions between the portions of the system. Likewise, integers are useful because they're simple to use (every language has an integral type, not every language has a Guid type) and can be sequential (Guids can, too, but that's not their intended use).
It's all about what you're storing and how. The people that say "never use Guid's!" are just spreading FUD, but they also aren't the answer to every problem.
I believe it is almost always a serialized identy integer, but some will disagree. It does depend on the situation.
The reasons for identity is efficiency and simplicity. It's smaller. More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. Great for indexes on joins. Easier when eyeballing records in a db.
There is definately a place for Guids in certain circumstances. When merging disparate data, or when records have to be created in certain places. Guids should be in your bag of tricks but usually will not be your first choice.
This is an oft debated topic, but I tend to lean more towards identities for a couple of reasons. Firstly, an integer is only 4 bytes vs a 16 byte GUID. This means narrower indexes and more efficient queries. Secondly, we make use of ##IDENTITY and SCOPE_IDENTITY a lot in stored procs, etc which goes out the window with GUIDs.
Here's a nice little article by Jeff Atwood.
Use a GUID if you think you'll ever need to use the data outside the database, i.e. other databases). Some would argue, that is always the case, but it's a judgment call.