Fixing Index Fragmentation in SQL Server - sql-server-2005

Here is the fragmentation status on indexed tables:
And here is my PageDefinition table:
Any suggestion what possible changes I can made in this to low down the fragmentation. I am doing this first time so solution with reason will be very helpful.also please let me know if I need to add some more detail here.
Thanx

Primary reason for fragmentation is pagesplits. Insert of new records or updating existing records might have resulted in changes in the way data allocated in a page
Good Presentation - http://devconnections.com/updates/LasVegas_Fall10/SQL/Randal-SQL-SDB306-Fragmentation.pdf
http://blog.sqlauthority.com/2008/03/27/sql-server-2005-find-index-fragmentation-details-slow-index-performance/
You can provide fillfactor depending on updates/inserts done on table. Example - If you table is read only and no update might take place you can provide 100% fill factor. This would provide allocation so that your inserts/updates does not cause framgmentation.
70% you can configure to experiment with. Rebuilding index is another solution for this problem.

Kimberly Tripp - the Queen of Indexing - has a ton of great blog post on how to select a good clustering key (in SQL Server, the primary key is - by default - your clustering key).
Check them out, read them, learn them - obey them! :-)
GUIDs as Primary and Clustering key
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap...... THAT'S NOT THE POINT!!!
Basically, just read her whole blog - everything on indexing, clustered indices and so on.
Your clustering key (and thus by default - your primary key) should be:
narrow - 4 byte INT is great, anything beyond 16 byte (GUID) is a massive waste of space
unique - read her blog posts on why
static - never change (if possible)
ever-increasing (typically: INT IDENTITY) to avoid page splits which cause most of the fragmentation

Related

Could using Sequential Guid as primary key in SQL Server lead to low performance for big data?

The biggest database that I have experience with has been a SQL Server database in which one of the tables contained 200,000 rows. I used Guid as primary key in that database and NOT sequential guid. I experienced no performance issues in that system which had about 30 concurrent users in it.
Recently I have designed and developed an enterprise application development framework. To take advantage of the "Unit of Work" pattern I used a sequential guid as the primary key, so that the records would be ordered physically. Since my experience with big databases is limited to what I just mentioned, I have a serious concern if I will be going to use this framework to develop an enterprise application for a large organization with 1000 concurrent users that will be going to keep millions of records of data using a sequential guid as the primary key will lead to performance issues?
If yes to what extent? And then again if yes, could it be resolved by improving database server hardware (processor and RAM) and then again to what extent?
Thanks in advance for sharing your experience and knowledge
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the primary key of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
The primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
The clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
And yes - a larger size of a table or of an index automatically means more data pages that need to be loaded from disk, held in memory, transferred to the client - all negatively impacting your performance. How much impact that has really depends on lots of factors of your database design and your data distribution, so any generalized predictions are next to impossible...
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
The issue of sequential GUID versus "regular" GUIDs comes up in the following circumstance:
The GUID column is part of a primary or clustered index (particularly the only key in the index).
Your database has "lots" of inserts on it.
For a clustered index, SQL Server adds new records to the table "in order". So, larger values go at the "end" of the table -- in this case, on the last data page. This is handy for identity columns because they are guaranteed to be larger than any previous value. And the last data page is -- by definition -- not fragmented.
GUIDs do not have this property. They end up being inserted "in the middle", causing fragmentation.
Why do you not see this as a problem? There could be various reasons:
Your application is not doing many inserts.
You are defragging the table on a regular basis.
The table is so small that this is almost irrelevant.
As for the latter point, if the records are small enough, then over a thousand could appear on each page. With 200 pages of data, fragmentation might not be a significant issue.
With 30 concurrent users, you might simply have no overlap of transactions. If each user modified the database once per minute, that gives you 2 seconds to complete a transaction -- typically quite enough time.
Nevertheless, I recommend using a sequential GUID or identity column. This will keep the database cleaner. However, regularly defragging the database is another option that might work.

when do you prefer to use non clustered index over clustered index?

I've some knowledge of using clustered index and non clustered index, but I'm not sure when and on what conditions it would be helpful to use non clustered index over clustered index. can someone explain or provide some links so that it would be helpful to all of us.
pick your clustered index! Every "regular" data table ought to have a clustered index, since having a clustered index does indeed speed up a lot of operations - yes, speed up, even inserts and deletes! But only if you pick a good clustered index.
It's the most replicated data structure in your SQL Server database. The clustering key will be part of each and every non-clustered index on your table, too.
You should use extreme care when picking a clustering key - it should be:
narrow (4 bytes ideal)
unique (it's the "row pointer" after all. If you don't make it unique SQL Server will do it for you in the background, costing you a couple of bytes for each entry times the number of rows and the number of nonclustered indices you have - this can be very costly!)
static (never change - if possible)
ideally ever-increasing so you won't end up with horrible index fragmentation (a GUID is the total opposite of a good clustering key - for that particular reason)
it should be non-nullable and ideally also fixed width - a varchar(250) makes a very poor clustering key
Anything else should really be second and third level of importance behind these points ....
See some of Kimberly Tripp's (The Queen of Indexing) blog posts on the topic - anything she has written in her blog is absolutely invaluable - read it, digest it - live by it!
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
A non-clustered index is useful for columns that have some repeated values.
A clustered index is automatically created when we create the primary key for the table. We need to take care of the creation of the non-clustered index.
clustered index is one per table, When creating the clustered index, SQL server 2005 reads the column and forms a Binary tree on it. This binary tree information is then stored separately in the disc. With the use of the binary tree, now the search for the specific value based on the column decreases the number of comparisons to a large amount.

What are the consequences of converting heap-indexes to clustered indexes on SQL Server?

I've recently got the advice, that I should convert all our tables from using heap indexes such that each table has a clustered index. What are the consequences of persuing this strategy? E.g. is it more important to regularly reorganize the database? datagrowth? danger of really slow inserts? Danger of page-defragmentation if the PK is a GUID? Noticable speed-increase of my application? What are your experiences?
To serve as inspiration for good answers, here are some of the "facts" I've picked up from other threads here on stackoverflow
Almost certainly want to establish a clustered index on every table in your database. If a table does not have one. Performance of most common queries is better.
Clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. The INSERT speed will suffer, but the SELECT speed will be improved.
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
Clustered index on GUID is ok in situations where the GUID has a meaning and improves performance by placing related data close to each other http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html
Clustering doesn't affect lookup speed - a unique non-clustered index should do the job.
If your key is a GUID, then a non-clustered index on it is probably just as effective as a clustered index on it. This is because on GUIDs you absolutely never ever can have range scans on them (what could between 'b4e8e994-c315-49c5-bbc1-f0e1b000ad7c' and '3cd22676-dffe-4152-9aef-54a6a18d32ac' possibly mean??). With a width of 16 bytes, a GUID clustered index key is wider than a row id that you'd get from a heap, so a NC index on a PK guid is actually strategy that can be defended in a discussion.
But making the primary key the clustered index key is not the only way to build a clustered index over your heap. Do you have other frequent queries that request ranges over a certain column? Typical candidates are columns like date, state or deleted. If you do, then you should consider making those columns the clustered index key (it does not have to be unique) because doing so may help queries that request ranges, like 'all records from yesterday'.
The only scenario where heaps have significant performance benefit is inserts, specially bulk inserts. IF your load is not insert heavy, then you should definitely go for a clustered index. See Clustered Index Design Guidelines.
Going over over your points:
Almost certainly want to establish a clustered index on every table in
your database. If a table does not
have one. Performance of most common
queries is better.
A clustered index that can satisfy range requirements for most queries will dramatically improve performance, true. A clustered index that can satisfy order requirements can be helpful too, but nowhere as helpful as one that can satisfy a range.
Clustered indexes are not always bad on GUIDs... it all depends upon
the needs of your application. The
INSERT speed will suffer, but the
SELECT speed will be improved.
Only probe SELECTs will be improved: SELECT ... WHERE key='someguid';. Queries by object ID and Foreign key lookups will benefit from this clustered index. A NC index can server the same purpose just as well.
The problem with clustered indexes in a GUID field are that the GUIDs are
random, so when a new record is
inserted, a significant portion of the
data on disk has to be moved to insert
the records into the middle of the
table.
Wrong. Insert into position in an index does not have to move data. The worst it can happen is a page-split. A Page-split is (somehow) expensive, but is not the end of the world. You comment suggest that all data (or at least a 'significant' part) has to be moved to make room for the new row, this is nowhere near true.
Clustered index on GUID is ok in situations where the GUID has a
meaning and improves performance by
placing related data close to each
other
http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html
I can't possibly imagine a scenario where GUID can have 'related data'. A GUID is the quintessential random structure how could two random GUIDs relate in any way? The scenario Donald gives has a better solution: Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads, which is cheaper to implement (less storage required) and works for unique keys too (the solution in linked article would not work for unique keys, only for foreign keys).
Clustering doesn't affect lookup speed - a unique non-clustered index
should do the job.
For probes (lookup a specific unique key) yes. A NC index is almost as fast as the clustered index (the NC index lookup does require and additional key lookup to fetch in the rest of the columns). Where clustered index shines is range scans, as it the clustered index can cover any query, while a NC index that could potentially satisfy the same range may loose on the coverage and trigger the Index Tipping Point.
I would also recommend you read Kimberly Tripp's The Clustered Index Debate Continues... in which she details quite clearly all the benefits of having a *good clustering key over having a heap.
Pretty much all operations are faster - yes! even inserts and updates!
But this requires a good clustering key, and a GUID with its very random and unpredictable nature is not considered a good candidate for a clustering key. GUIDs as clustering key are bad - whether they have application meaning or not - just avoid those.
Your best bet is a key which is narrow, stable, unique and ever-increasing - a column of type INT IDENTITY fulfills all those requirements ideally.
For a lot more background on why a GUID doesn't make a good clustering key, and on just how bad it is, see more of Kim Tripp's blog posts:
GUIDs as PRIMARY key and/or clustering key
Disk space is cheap... but that's not the point!
I can recommend the book "SQL Performance Explained" - it is a 200 page book about indexes.
It also mentions when clustered indexes have worse performance than normal indexs. One of the problems is that the clustered index itself is a B-tree. So when you have other indexes on the same table, they can't point to a specifik row - instead they point to a "key" in the clustered index, so "the way" to the data gets longer.

What are the reasons *not* to use a GUID for a primary key?

Whenever I design a database I automatically start with an auto-generating GUID primary key for each of my tables (excepting look-up tables)
I know I'll never lose sleep over duplicate keys, merging tables, etc. To me it just makes sense philosophically that any given record should be unique across all domains, and that that uniqueness should be represented in a consistent way from table to table.
I realize it will never be the most performant option, but putting performance aside, I'd like to know if there are philosophical arguments against this practice?
Based on the responses let me clarify:
I'm talking about consistently using a GUID surrogate key as a primary key- irrespective of whether and how any natural or sequential keys are designed on a table. These are my assumptions:
Data integrity based on natural keys can be designed for, but not assumed.
A primary key's function is referential integrity, irrespective of performance, sequencing, or data.
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table.
What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to. The main reason for this is indeed performance, which will come and bite you down the road... (it will, trust me - just a matter of time) - plus also a waste of resources (disk space and RAM in your SQL Server machine) which is really not necessary.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
Jeff Atwood talks about this in great detail:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Guid Pros:
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
Guid Cons:
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
Adding to ewwwn:
Pros
It makes it nearly impossible for developers to "accidentally" expose the surrogate key to users (unlike integers where it happens almost all the time).
Makes merging databases several orders of magnitude simpler than dealing with identity columns.
Cons
Fatter. The real problem with it being fatter is that it eats up more space per page and more space in your indexes making them slower. The additional storage space of Guids is frankly irrelevant in today's world.
You absolutely must be careful about how new values are created. Truly random values do not index well. You are compelled to use a COMB guid or some variant that adds a sequential element to the guid.
You still implement the natural key of each table as well don't you? - GUID keys alone obviously won't prevent duplicate data, redundancy and consequent loss of data integrity.
Assuming that you do enforce other keys then adding GUIDs to every table without exception is probably just adding unnecessary complexity and overhead. It doesn't really make it easier to merge data in different tables because you still have to modify/de-duplicate the other key(s) of the table anyway. I suggest you should evaluate the use of a GUID surrogate on a case-by-case basis. Having a blanket rule for every table isn't necessary or helpful because every table models a different thing after all.
Simple answer: it's not relational.
The record (as defined by the GUID) may be unique, but none of the associated attributes can be said to be occuring uniquely with that record.
Using a GUID (or any purely surrogate key) is no more relational than declaring a flat file to be relational, on the basis that each record can be identified by its row number.
A potentially big reason, but one often not thought of, is if you might have to provide compatibility with an Oracle database in the future.
Since Oracle doesn't have a uniqueid column data type, it can lead to a bit of a nightmare when you have two different data types for the same primary key across two different databases, especially when an ORM is involved.
I wonder why there's no standard "miniGUID" type? It would seem that performing a decent hash on a GUID should yield a 64-bit number which would have a trivial probability of duplication in any universe which doesn't have a billion or more things in it. Since the universe in which most GUID/miniGUID identifiers are used will never grow beyond a million things, much less a billion, I would think a smaller 8-byte miniGuid would be very useful.
That would not, of course, suggest that it should be used as a clustered index; that would greatly impede performance. Nonetheless, an 8-byte miniGUID would only waste a third the space of a full GUID (when compared to a 4-byte index).
I can see the case for a given application's or enterprise's own identifiers to be unique and be represented in a consistent way across all its own domains (i.e. because they may span more than one database) but a GUID is overkill for these purposes. I guess they are popular because they are available out of the box and designing and implementing an 'enterprise key' takes time and effort. The rule when designing an artifical identifier is to make it as simple as possible but no simpler. IDENTITY is too simple, a GUID isn't simple enough.
Entities that exist outside of the application/enterprise usually have their own identifiers (e.g. a car has a VIN, a book has an ISBN, etc) maintained by an external trusted source and in such cases the GUID adds nothing. So I guess the philosphical argument against I'm getting at here is that using a artifical identifier on every table is unnecessary.

INT vs Unique-Identifier for ID field in database

I am creating a new database for a web site using SQL Server 2005 (possibly SQL Server 2008 in the near future). As an application developer, I've seen many databases that use an integer (or bigint, etc.) for an ID field of a table that will be used for relationships. But lately I've also seen databases that use the unique identifier (GUID) for an ID field.
My question is whether one has an advantage over the other? Will integer fields be faster for querying and joining, etc.?
UPDATE: To make it clear, this is for a primary key in the tables.
GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?
Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.
For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).
There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).
I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.
And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.
The GUID is going to take up more space and be slower than an int - even if you use the newsequentialid() function. If you are going to do replication or use the sync framework you pretty much have to use a guid.
INTs are 4 bytes, BIGINTs ar 8 bytes, and GUIDS are 16 bytes. The more space required to represent the data, the more resources required to process it -- disk space, memory, etc. So (a) they're slower, but (b) this probably only matters if volume is an issue (millions of rows, or thousands of transactions in very, very little time.)
The advantage of GUIDs is that they are (pretty much) Globally Unique. Generate a guid using the proper algorithm (and SQL Server xxxx will use the proper algorithm), and no two guids will ever be alike--no matter how many computers you have generating them, no matter how frequently. (This does not apply after 72 years of usage--I forget the details.)
If you need unique identifiers generated across multiple servers, GUIDs may be useful. If you need mondo perforance and under 2 billion values, ints are probably fine. Lastly and perhaps most importantly, if your data has natural keys, stick with them and forget the surrogate values.
if you positively, absolutely have to have a unique ID, then GUID. Meaning if you're ever gonna merge, sync, replicate, you probably should use a GUID.
For less robust things, an int, should suffice depending upon how large the table will grow.
As in most cases, the proper answer is, it depends.
Use them for replication etc, not as primary keys.
Kimberly L Tripp article
Against: Space, not strictly monotonic, page splits, bookmark/RIDs etc
For: er...
Fully agreed with JBrooks.
I want to say that when your table is large, and you use selects with JOINS, especially with derived tables, using GUIDs can significally decrease performance.