Sql Server Legacy Database To Clustered index or not - sql

We have a legacy database which is a sql server db (2005, and 2008).
All of the primary keys in the tables are UniqueIdentifiers.
The tables currently have no clustered index created on them and we are running into performance issues on tables with only 750k records. This is the first database i've worked on with unique identifiers as the sole primary key and I've never seen sql server be this slow with returning data.
I don't want to create a clustered index on the uniqueidentifier as they are not sequential and will therefore slow the apps down when it comes to inserting data.
We cannot remove the uniqueidentifier as that is used for remote site record identity management purposes.
I had thought about adding a big integer identity column to the tables and creating the clustered index on this column and including the unique identifier column.
i.e.
int identity - First column to maintain insert speeds
unique identifier - To ensure the application keeps working as expected.
The goal is to improve the identity query and joined table query performance.
Q1: Will this improve the query performance of the db or will it slow it down?
Q2: Is there an alternative to this that I haven't listed?
Thanks
Pete
Edit: The performance issues are on retrieving data quickly through select statements, especially if a few of the more "transactional / changing" tables are joined together.
Edit 2: The joins between tables are generally all between the primary key and foreign keys, for tables that have foreign keys they are included in the non-clustered index to provide a more covering index.
The tables all have no other values which would provide a good clustered index.
I'm leaning more towards adding an additional identity column on each of the high load tables and then including the current Guid PK column within the clustered index to provide the best query performance.
Edit 3:
I would estimate that 80% of the queries are performed on primary and foreign keys alone through the data access mechanism. Generally our data model has lazy loaded objects which perform the query when accessed, these queries use the objects id and the PK column. We have a large amount of user driven data exclusion / inclusion queries which use the foreign key columns as a filter based on the criteria of for type X exclude the following id's. The remaining 20% is where clauses on Enum (int) or date range columns, very few text based queries are performed in the system.
Where possible I have already added covering indexes to cover the heaviest queries, but as yet i'm still dissapointed by the performance. As bluefooted says the data is being stored as a heap.

If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.
I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.
Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.

I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.
Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.
Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.

It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?
I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.
UPDATE
Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.
As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:
[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);
public static Guid SequentialGuid()
{
const int RPC_S_OK = 0;
Guid g;
if (UuidCreateSequential(out g) != RPC_S_OK)
return Guid.NewGuid();
else
return g;
}
newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.

You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.
You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.

Related

Searching for record(s) in a table that has over 200 Million Rows

Which type of index should be used on the table? It is initially inserted (one a month) into a empty table. I then place a non clustered composite index on two of the columns. Wondering if merging the two fields into one would increase performance when searching. Or does it not matter? Should I be working with an identity column that has a primary key clustered index?
You should index the field(s) most likely to be used in the where clause as people query the table. Don't worry about the primary key - it already has an index.
If you can define a unique primary key that can be used when querying the table, this will be used as the clustered index and will be the fastest for selects.
If your select query has to use the two fields you mentioned, keep them separate. Performance will not be impacted and the schema is not spoiled.
"A clustered index is particularly efficient on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent."
With this in mind you probably won't see much benefit from haveing a clustered index on your primary key (ID) unless it have business meaning for your aplication. If you have a Date value that you are commonly querying, then it may make more sense to add a clustered index to that
select * from table where created > '2013-01-01' and created < '2013-02-01'
I have seen datawarehouses use a concatenated key approach. Whether this works for you depends on your queries. Obviously querying a single field value will be faster than multiple fields, particularly when there is one less lookup in the B-tree index.
Alternatively, if you have 200 million rows in a table you could look at breaking the data out into multiple tables if it makes sense to do so.
You're saying that you're loading all this data every month so I have to assume that all the data is relevant. If there was data in your table that is considered "old" and not relevant to searches, then you could move data out into a archive table (using the same schema) so your queries only run against "current" data.
Otherwise, you can look at a sharding approach as used by NoSQL like MongoDB. If MongoDB is not an option, you could achieve the same shard key like logic in your application. I doubt that your database SQL drivers will support sharding natively.

Do I need to use this many indexes in my SQL Server 2008 database?

I'd appreciate some advice from SQL Server gurus here. Let me explain...
I have an SQL Server 2008 database table that has 21 columns. Here's a quick type of those:
INT Primary Key
Several other INT's that are indexes already (used to reference this and other tables)
Several NVARCHAR(64) to hold user-provided text
Several NVARCHAR(256) to hold longer user-provided text
Several DATETIME2
One BIGINT
Several UNIQUEIDENTIFIER, one is already an index
The way this table is used is that it is presented to a user as a sortable table and a user can choose which column to sort it by. This table may contain many thousands of records (like currently it does 21,000 and it will be growing.)
So my question is, do I need to set each column as an INDEX to enable faster sorting?
PS. Forgot to say. The output obviously supports pagination, so the user sees no more than 100 rows at once.
Contrary to popular belief, just having an index on a column does not guarantee that any queries will be any faster!
If you constantly use SELECT *.. from that table, these non-clustered indices on a single column will most likely not be used at all.
A good nonclustered index is a covering index, which means, it contains all the necessary columns to satisfy one or multiple given queries. If you have this situation, then a nonclustered index can make sense - otherwise, in more cases than not, the nonclustered index is likely to be ignored by the query optimizer. The reason for this being: if you need all the columns anyway, the query would have to do key lookups from the nonclustered index into the actual data (the clustered index) for each row found - and the key lookup is a very expensive operation, so doing this for a lots of hits becomes overly costly, and the query optimizer will rather quickly switch to a index scan (possibly the clustered index scan) to fetch the data.
Don't over-index - use a well-designed clustered index, put indices on the foreign key columns to speed up joins - and then let it be for the time being. Observe your system, measure performance, maybe add an index here or there - but don't just overload the system with tons of indices!
Having too many indices can be worse than having none - every index must be maintained, e.g. updated for each INSERT, UPDATE and DELETE statement - does that take time!
this table is ... presented to a user as a sortable table ... [that] may contain many thousands of records
If you're ordering many thousands of records for display, you're doing it wrong. Typical users can reasonably process at most around 500 typical records. Exceptional users can handle a couple thousand. Any more than that, and you're misleading your users into a false sense that they've seen a representative sample. This results in poor decision making and inefficient user workflow. Instead, you need to focus on a good search algorithm.
Another to keep in mind here is that more indexes means slower inserts and updates. It's a balancing act. Sql Server keeps statistics on what queries and sorts it actually performs, and makes those statistics available to you. There are queries you can run that tell you exactly what indexes Sql Server thinks it could use. I would deploy without any sorting index and let it run for a week or two that way. Then look at data and see what users actually sort on and index just those columns.
Take a look at this link for an example and introduction on finding missing indexes:
http://sqlserverpedia.com/wiki/Find_Missing_Indexes
Generally indexes use to accelerate WHERE conditions (in some cases JOINS). so I don't thinks create index on column except PRIMARY KEY accelerate sorting. you can do your sorting in clients(if you use win forms or wpf) or in database for web scenarios
Good Luck

Do clustered indexes have to be unique?

What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?
Is it "made" unique and if so how? What is the best way to make it unique?
I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.
What happens if you create a CI on a non-unique column
If the clustered index is not a unique
index, SQL Server makes any duplicate
keys unique by adding an internally
generated value called a uniqueifier
Does this lead to bad performance?
Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.
How much data the table contains.
What is the rate of inserts.
How often is the CI used in a select (when no covering indexes exist, pretty much always).
Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).
I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:
I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....
Let's start with the key things that I look for in a clustering key:
* Unique
* Narrow
* Static
Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).
Source: Ever-increasing clustering key debate - again!
Do clustered indexes have to be unique?
They don't, and there are times where it's better if they're not.
Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is
SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%
then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).
Do you have any references?
There's Clustered Index Design Guidelines for example, which says,
With few exceptions, every table
should have a clustered index defined
on the column, or columns, that offer
the following:
Can be used for frequently used queries.
Provide a high degree of uniqueness.
Can be used in range queries.
My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.
If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster.
This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.

Changing the indexing on existing table in SQL Server 2000

Here is the scenario:
SQL Server 2000 (8.0.2055)
Table currently has 478 million rows of data. The Primary Key column is an INT with IDENTITY. There is an Unique Constraint imposed on two other columns with a Non-Clustered Index. This is a vendor application and we are only responsible for maintaining the DB.
Now the vendor has recommended doing the following "to improve performance"
Drop the PK and Clustered Index
Drop the non-clustered index on the two columns with the UNIQUE CONSTRAINT
Recreate the PK, with a NON-CLUSTERED index
Create a CLUSTERED index on the two columns with the UNIQUE CONSTRAINT
I am not convinced that this is the right thing to do. I have a number of concerns.
By dropping the PK and indexes, you will be creating a heap with 478 million rows of data. Then creating a CLUSTERED INDEX on two columns would be a really mammoth task. Would creating another table with the same structure and new indexing scheme and then copying the data over, dropping the old table and renaming the new one be a better approach?
I am also not sure how the stored procs will react. Will they continue using the cached execution plan, considering that they are not being explicitly recompiled.
I am simply not able to understand what kind of "performance improvement" this change will provide. I think that this will actually have the reverse effect.
All thoughts welcome.
Thanks in advance,
Raj
All stored procs will recompile automatically. This will happen anyway when stats change and after index maintenance anyway.
At some point, you have to reorganise 478 million rows (drop/create indexes) or move (new table). Neither way is better then the other, unfortunately. I feel your pain though: we have similarly large tables with pending new columns and an index changes.
Saying that, you should do it step 2-1-4-3 to avoid unnecessary non-clustered index maintenance when you drop/create the clustered index.
And drop the unique constraint. The clustered index could be unique and clustered. A unique constrint is just another index that would be unnecessary.
As for the performance benefit, perhaps ask the vendor why.
The one thing I would have a serious look at is the question what type those other two columns are - how big are they, compared to the INT IDENTITY (4 byte) ??
The reason I ask: the clustering key will be added to all non-clustered indices on the table, too - and if you have close to 500 million rows, it will make a huge difference whether the clustering key is a single 4-byte INT, or e.g. two 16-byte GUID's.
This is not only on disk, mind you - the pages are loaded into SQL Server's RAM in their entirety - so by potentially bloating up your clustering key, you'd incur performance penalties due to the larger number of pages on disk (and in RAM) that your non-clustered indices would need.
The only compelling reason I could see to actually go through with those changes would be if by clustering the table using those two other columns, you'd gain something in terms of query performance, e.g. if some of the most frequent queries would be faster, due to the fact that the table is now clustered by these two columns. That's really hard to know unless you know what the access and query patterns really are....
Would creating another table with the same structure and new indexing scheme and then copying the data over, dropping the old table and renaming the new one be a better approach?
I believe that this is what SQL Enterprise Manager will do behind the scenes anyways if you use the visual tools to do this. If you make a schema change such as add a column in the middle of a table, or change primary keys, there is a little button that will allow you to "Script Changes". If you view this script, you can see the steps that Enterprise Manager will take in order to do what you requested.

Uniqueidentifier PK: Is a SQL Server heap the right choice?

OK. I've read things here and there about SQL Server heaps, but nothing too definitive to really guide me. I am going to try to measure performance, but was hoping for some guidance on what I should be looking into. This is SQL Server 2008 Enterprise. Here are the tables:
Jobs
JobID (PK, GUID, externally generated)
StartDate (datetime2)
AccountId
Several more accounting fields, mainly decimals and bigints
JobSteps
JobStepID (PK, GUID, externally generated)
JobID FK
StartDate
Several more accounting fields, mainly decimals and bigints
Usage: Lots of inserts (hundreds/sec), usually 1 JobStep per Job. Estimate perhaps 100-200M rows per month. No updates at all, and the only deletes are from archiving data older than 3 months.
Do ~10 queries/sec against the data. Some join JobSteps to Jobs, some just look at Jobs. Almost all queries will range on StartDate, most of them include AccountId and some of the other accounting fields (we have indexes on them). Queries are pretty simple - the largest part of the execution plans is the join for JobSteps.
The priority is the insert performance. Some lag (5 minutes or so) is tolerable for data to appear in the queries, so replicating to other servers and running queries off them is certainly allowable.
Lookup based on the GUIDs is very rare, apart from joining JobSteps to Jobs.
Current Setup: No clustered index. The only one that seems like a candidate is StartDate. But, it doesn't increase perfectly. Jobs can be inserted anywhere in a 3 hour window after their StartDate. That could mean a million rows are inserted in an order that is not final.
Data size for a 1 Job + 1 JobStepId, with my current indexes, is about 500 bytes.
Questions:
Is this a good use of a heap?
What's the effect of clustering on StartDate, when it's pretty much non-sequential for ~2 hours/1 million rows? My guess is the constant re-ordering would kill insert perf.
Should I just add bigint PKs just to have smaller, always increasing keys? (I'd still need the guids for lookups.)
I read GUIDs as PRIMARY KEYs and/or the clustering key, and it seemed to suggest that even inventing a key will save considerable space on other indexes. Also some resources suggest that heaps have some sort of perf issues in general, but I'm not sure if that still applies in SQL 2008.
And again, yes, I'm going to try to perf test and measure. I'm just trying to get some guidance or links to other articles so I can make a more informed decision on what paths to consider.
Yes, heaps have issues. Your data will logically fragment all over the show and can not be defragmented simply.
Imagine throwing all your telephone directory into a bucket and then trying to find "bob smith". Or using a conventional telephone directory with a clustered index on lastname, firstname.
The overhead of maintaining the index is trivial.
StartDate, unless unique, is not a good choice. A clustered index requires internal uniqueness for the non-clustered indexes. If not declared unique, SQL Server will add a 4 byte "uniquifier".
Yes, I'd use int or bigint to make it easier. As for GUIDs: see the questions at the right hand side of the screen.
Edit:
Note, PK and clustered index are 2 separate issues even if SQL Server be default will make the PK clustered.
Heap fragmentation isn't necessarily the end of the world. It sounds like you'll rarely be scanning the data, so that's not the end of the world.
Your non-clustered indexes are the things that will impact your performance. Each one will need to store the address of the row in the underlynig table (either a heap or a clustered index). Ideally, your queries never have to use the underlying table itself, because it stores all the information needed in the ideal way (including all columns, so that it's a covering index).
And yes, Kimberly Tripp's stuff is the best around for indexes.
Rob
As your own research has shown, and as all the other answerers have mentioned, using a GUID as the clustered index on a table is a bad idea.
However, having a heap also isn't really a good choice, since heaps have other issues, mostly to do with fragmentation and other things that just don't work well with a heap.
My best practice advice would always be this:
do use a primary, clustered key on any data table (unless it's a temporary table, or a table used for bulk-loading)
try to make sure the clustered key is a INT IDENTITY or BIGINT IDENTITY
I would argue that the benefits you get by adding a INT/BIGINT - even just for the sake of having a good clustered index - far outweigh the drawbacks this has (as Kim Tripp also argues in her blog post you cited).
Marc
As a GUId is your primary and foreign key your database will still need to check the contraints on every insert you will probably need to index this. Indexing a GUId is not advisable due to it's randomness. Therefore I'd say absolutely you should go down the bigint (probably identity) route for your primary key and use it as a clustered index.