SQL Server Clustered Index - Order of Index Question - sql

I have a table like so:
keyA keyB data
keyA and keyB together are unique, are the primary key of my table and make up a clustered index.
There are 5 possible values of keyB but an unlimited number of possible values of keyA,. keyB generally increments.
For example, the following data can be ordered in 2 ways depending on which key column is ordered first:
keyA keyB data
A 1 X
B 1 X
A 3 X
B 3 X
A 5 X
B 5 X
A 7 X
B 7 X
or
keyA keyB data
A 1 X
A 3 X
A 5 X
A 7 X
B 1 X
B 3 X
B 5 X
B 7 X
Do I need to tell the clustered index which of the key columns has fewer possible values to allow it to order the data by that value first? Or does it not matter in terms of performance which is ordered first?

You should order your composite clustered index with the most selective column first. This means the column with the most distinct values compared to total row count.
"B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table." http://www.akadia.com/services/ora_index_selectivity.html?
This article is for Oracle, but still relevant.
Also, if you have a query that runs constantly and returns few fields, you may consider creating a composite index that contains all the fields - it will not have to access the base table, but will instead pull data from the index.
ligget78's comment on making sure to mention the first column in a composite index is important to remember.

If you create an index (regardless clustered or not) with (keyA, keyB) then this is how values will be ordered, e.g. first keyA, then keyB (this is the second case in your question). If you want it the other way around, you need to specify (keyB, keyA).
It could matter performance-wise, depends on your query of course. For example, if you have (keyA, keyB) index and the query looks like WHERE keyB = ... (without mentioning keyA) then the index can't be utilized.

As others have said, the ordering is based on how you specify it in the index creation script (or PK constraint). One thing about clustered indexes though is that there is a lot to keep in mind.
You may get better overall performance by using your clustered index on something other than the PK. For example, if you are writing a financial system and reports are almost always based on date and time of an activity (all activity for the past year, etc.) then a clustered index on that date column might be better. As HLGEM says, sorting can also be affected by your selection of clustered index.
Clustered indexes can also affect inserts more than other indexes. If you have a high volume of inserts and your clustered index is on something like an IDENTITY column then there could be contention problems for that particular part of the disk since all of the new rows are being inserted into the same place.
For small look-up tables I always just put the clustered index on the PK. For high-impact tables though it's a good idea to spend the time thinking about (and testing) various possible clustered indexes before choosing the best one.

I believe that SQL Server orders it exactly the way you tell it. It assumes that you know best how to access your index.
In any case, I would say it's a good idea where possible to specify what you want exactly rather than hoping the database will figure it out.
You can also try it both ways, run a bunch of representative queries and then compare the generated execution plans to determine which is best for you.

Remember that the clustered index is the physical order in which the table is stored on disk.
So if your clustered index is defined as ColA, ColB queries will be faster when order in the same order as your clustered index. If SQL has to order B,A it will require post execution sorting to achieve the correct order.
My suggestion is to add a second non-clustered index on B,A. Also depending on the size of your data column to INCLUDE(read included column) it to prevent the need for key lookups. That is, of course, provided that this table is not heavily inserted, as you always must balance query speed vs. write speed.
Realistically, your clustered index should represent the order in which the data is most likely to be accessed as well as maintaining a delicate balance of insert\update IO cost. If your clustered index is such that you are constantly inserting into the middle of pages, you may suffer performance losses there.
Like others have said, without knowing the table length, column sizes, etc. there is no correct answer. Trial and error with a heavy dose of testing is your best bet.

Just in case this isn't obvious: the sort order of your index does not promise much about the the sort order of the results in a query.
In your queries, you must still add an
ORDER BY KeyA, KeyB
or
ORDER BY KeyB, KeyA
The optimizer may be pleased to find the data already physically ordered in the index as desired and save some time, but every query that is supposed to deliver data in a particular order must have an ORDER BY clause at the end of it. Without an order by, SQL Server makes no promises with respect to the order of a recordset, or even that it will come back in the same order from query to query.

The best thing you can do is to try both solutions and measure the execution time.
In my experience, index tuning is all but exact-science.
Maybe having keyB before keyA in the index column order would be better

You specify the columns in the order in which you would normally want them sorted in reports and queries.
I would be wary of creating a multicolumn clustered index though. Depending on how wide this is, you could have a huge impact on the size of any other indexes you create because all non-clustered indexes contain the clustered index value in them. Also the rows have to be re-ordered if the values frequently change and it is my experience that non-surrogate keys tend to change more frequently. Therefore creating this as a clustered vice nonclustered index could be much more time consuming of server resources if you have values that are likely to change. I'm not saying you shouldn't do this as I don't know what type of data your columns actually contain (although I suspect they are more complex that A1, a2, etc); I'm saying you need to think about the ramifications of doing it. It would probably be a good idea to thoroughly read BOL about clustered vice nonclustered indexes before committing to doing this.

Yes you should suggest, normally query engine try to find out the best execution plan and the index to utilize, however sometime it is better to force query engine to use the specific index. There are some other consideration when planning for index as well as when utilizing the index in your query. for example, the column ordering in index, column ordering in where clause. you could refer following link to know about:
http://ashishkhandelwal.arkutil.com/sql-server/quick-and-short-database-indexes/
Best Practices to use indexes
How to get best performance form indexes
Clustered index Considerations
Nonclustered Indexes Considerations
I am sure this will help you when planning for index.

Related

SQL Server indexing includes questions

I've been trouble shooting some bad SQL calls in my works applications. I've been reading up on indexes, tweaking and benchmarking things. Here's some of the rules I've gathered (let me know if this sounds right):
For heavily used quires, boil down the query to only what is needed and rework the where statements to use the most common columns first. Then make a non clustered index on the columns used in the where statement and do INCLUDING on any remaining select columns (excluding large columns of course like nvarchar(max)).
If a query is going to return > 20% of the entries table contents, it's best to do a table scan and not use an index
Order in an index matters. You have to make sure to structure your where statement like the index is built.
Now one thing I'm having trouble finding info on is what if a query is selecting on columns that are not part of any index but is using a where statement that is? Is the index used and leaf node hits the table and looks at the associated row for it?
ex: table
Id col1 col2 col3
CREATE INDEX my_index
ON my_table (col1)
SELECT Id, col1, col2, col3
FROM my_table
WHERE col1 >= 3 AND col1 <= 6
Is my_index used here? If so, how does it resolve Id, col2, col3? Does it point back to table rows and pick up the values?
To answer your question, yes, my_index is used. And yes, your index will point back to the table rows and pick the id, col2 and col3 values there. That is what an index does.
Regarding your 'rules'
Rule 1 makes sense. Except for the fact that I usually do not 'include' other columns in my index. As explained above, the index will refer back to the table and quickly retrieve the row(s) that you need.
Rule 2, I don't really understand. You create the index and SQL Server will decide which indices to use or not use. You don't really have to worry about it.
Rule 3, the order does not really make a difference.
I hope this helps.
From dba.stackexchange.com:
There are a few concepts and terms that are important to understand
when dealing with indexes. Seeks, scans, and lookups are some of the
ways that indexes will be utilized through select statements.
Selectivity of key columns is integral to determining how effective an
index can be.
A seek happens when the SQL Server Query Optimizer determines that the
best way to find the data you have requested is by scanning a range
within an index. Seeks typically happen when a query is "covered" by
an index, which means the seek predicates are in the index key and the
displayed columns are either in the key or included. A scan happens
when the SQL Server Query Optimizer determines that the best way to
find the data is to scan the entire index and then filter the results.
A lookup typically occurs when an index does not include all requested
columns, either in the index key or in the included columns. The query
optimizer will then use either the clustered key (against a clustered
index) or the RID (against a heap) to "lookup" the other requested
columns.
Typically, seek operations are more efficient than scans, due to
physically querying a smaller data set. There are situations where
this is not the case, such as a very small initial data set, but that
goes beyond the scope of your question.
Now, you asked how to determine how effective an index is, and there
are a few things to keep in mind. A clustered index's key columns are
called a clustering key. This is how records are made unique in the
context of a clustered index. All nonclustered indexes will include
the clustered key by default, in order to perform lookups when
necessary. All indexes will be inserted to, updated to, or deleted
from for every respective DML statement. That having been said, it is
best to balance performance gains in select statements against
performance hits in insert, delete, and update statements.
In order to determine how effective an index is, you must determine
the selectivity of your index keys. Selectivity can be defined as a
percentage of distinct records to total records. If I have a [person]
table with 100 total records and the [first_name] column contains 90
distinct values, we can say that the [first_name] column is 90%
selective. The higher the selectivity, the more efficient the index
key. Keeping selectivity in mind, it is best to put your most
selective columns first in your index key. Using my previous [person]
example, what if we had a [last_name] column that was 95% selective?
We would want to create an index with [last_name], [first_name] as the
index key.
I know this was a bit long-winded answer, but there really are a lot
of things that go into determining how effective an index will be, and
a lot things you must weigh any performance gains against.

Is a clustered index faster than a non-clustered index with includes? [duplicate]

This question already has answers here:
Better to use a Clustered index or a Non-Clustered index with included columns?
(3 answers)
Closed 9 years ago.
I have a table with columns a,b,c,d,e,f,g that has roughly 500,000 rows.
There is a query that gets run very often that does a SELECT * FROM table WHERE a = #a AND b = #b AND c = #c.
Is it better to create a clustered index on a, b, and c, OR am I better off creating a non-clustered index on a, b, and c INCLUDE (d, e, f, g).
Not sure the include would help speed up the query since the select * was issued.
Any help would be appreciated!
A clustered index would be the fastest for that SELECT, but it may not necessarily be correct choice.
A clustered index determines the order in which records are physically stored (which is why you can only have one per table). So while it would be the fastest for THAT query, it may slow down other queries and could KILL updates and inserts if one of those columns was changing, which could mean that the record would need to be physically re-located.
An INCLUDE would also speed up that query at the expense of extra storage and extra index maintenance if any of those fields (including the included fields) were updated.
I would START with a non-clustered index on a, b, and c and see if that gets your performance to a reasonable level. Anything more could just be trading speed in one area for slowness in another.
The clustered index will be faster.
With SELECT *, both your clustered and non-clustered (with include-all) contain all the columns within each page. However, the non-clustered index ALSO contains a reference back to the clustered key - this is required in case you add more columns to the table, but really also because all indexes (except indexed views) are pointers to the data pages. The NCI will not feature the new columns (fixed include list) but the data pages will.
SQL Server may be smart enough to find out that SELECT * can be fulfilled by an INDEX SCAN on the NCI (+includes) without a bookmark lookup back to the data pages, but even then, that index scan will be one column wider than the equivalent clustered index scan.
It is normally not a good idea to have a 3-column clustering key. You may consider an alternative of using a simple single-column identity clustering key, and creating an indexed view clustered around the 3 columns.
The answer to the question as stated in your subject line in general is no. Because you generally would much prefer to have the narrowest covering (probably non-clustered) index.
But in your case you are selecting * so if the clustered index is good enough match to your seek criteria it's always going to be picked, since anything narrower will need to do a bookmark lookup.
So this raises a big question of why this query is the way it is, whether there is a better choice of clustered index in general for your app (narrow, static, increasing, unique), and whether you really need to be getting all the columns. Because neither of the two options you give is really that typical of a good design.
500000 rows is fairly small, but if performance is an issue, you want to see how many rows are fitting per page and whether you could improve that be being more selective in your query and having a covering non-clustered index.
Your Clustered index is the order the data is stored in the table, so you can only have one clustered index per table. If you create a new index (non-clustered by default) sure the columns are defined in the index in the same order they are used in the WHERE clause that will allow SQL do a direct index scan to find the record(s) you're looking for.

Clustered index dilemma - ID or sort?

I have a table with two very important fields:
id INT identity(1,1) PRIMARY KEY
identifiersortcode VARCHAR(900)
My app always sorts and pages search results in the UI based on identifiersortcode, but all table joins (and they are legion) are on the id field. (Aside: yes, the sort code really is that long. There's a strong BL reason.)
Also, due to O/RM use, most SELECT statements are going to pull almost every column.
Currently, the clustered index is on id, but I'm wondering if the TOP / ORDER BY portion of most queries would make identifiersortcode a more attractive option as the clustered key, even considering all of the table joins going on.
Inserts on the table and changes to the identifiersortcode are limited enough that changing my clustered index would be a problem for insert/update operations.
Trying to make the sort code's non-clustered index a covering index (using INCLUDE) is not a good option. There are a number of large columns, and some of them have a lot of update activity.
Kimberly L. Tripp's criteria for a clustered index are that it be:
Unique
Narrow
Static
Ever Increasing
Based on that, I'd stick with your integer identity id column, which satisfies all of the above. Your identifiersortcode would fail most, if not all, of those requirements.
To correctly determine which field will benefit most from the clustered index, you need to do some homework. The first thing that you should consider is the selectivity of your joins. If your execution plans filter rows from this table FIRST, then join on the other tables, then you are not really benefiting from having the clustered index on the primary key, and it makes more sense to have it on the sort key.
If however, your joins are selective on other tables (they are filtered, then an index seek is performed to select rows from this table), then you need to compare the performance of the change manually versus the status quo.
Currently, the clustered index is on id, but I'm wondering if the TOP / ORDER BY portion of most queries would make identifiersortcode a more attractive option as the clustered key, even considering all of the table joins going on.
Making identifiersortcode a CLUSTERED KEY will only help if it is used both in filtering and ordering conditions.
This means that it is chosen a leading table in all your joins and uses Clustered Index Scan or Clustered Index Range Scan access path.
Otherwise, it will only make the things worse: first, all secondary indexes will be larger in size; second, inserts in non-increasing order will result in page splits which will make them run longer and result in a larger table.
Why, for God's sake, does your identifier sort code need to be 900 characters long? If you really need 900 characters to be distinct for sorting, it should probably be broken up into multiple fields.
Appart from repeating what Chris B. said, I think you should really stick to your current PK, since - as you said - all joins are on the Id.
I guess you already have indexed the identifiersortcode....
Nevertheless, IF you have performance issues, would reaaly think twice about this ##"%$£ identifiersortcode !-)

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

SQL Server 2000 Index - Clustered vs Non Clustered

I have inherited a database where there are clustered indexes and additional duplicate indexes for each of the clustered index.
i.e
IX_PrimaryKey is a clustered index on the column ID.
IX_ID is a non clustered index on the column ID.
I want to clean up these duplicate non clustered indexes and I wanted to check to see if anyone could think of a reason to do this.
Can anyone think of a performance benefit for doing this?
For exact same indexes, there's no performance gain. Actually, it incurs performance loss in insertion and updates. However, if there are multicolumn indexes with different column order, there might be a valid reason for them.
Maybe I'm not thinking hard enough, but I can't see any reason to do this; the nature of the clustered index is that the data is organized in the order of the index. It seems that the extra index is a complete waste.
Digging through BOL and watching this question, though ...
There seems no sensible reason for doing this, and there is a performance hit.
The only thing I could think of to do this is to create an index with an incredibly narrow row width so that the rows per page was very high, making it very quick to scan / seek. But since it contains no other fields (except the clustered key, which is the same value) I still cannot see a reason for it.
It's quite possible the original creator was not aware that the PK was defaulting to a clustered index and created an NC index without realising it was a duplicate.
I presume what would have happened is that SQL Server would have automatically created clustered index when a primary key constraint was specified (this would happen if another index (non-clustered/clustered) is not present already) and then some one might have created a non-clustered index for the primary key column.
Such a scenario would:
Have some adverse effect on performance as indexes are updated when inserts/deletes/updates happen.
Use additional disk space.
Might lead to deadlocks.
Would contribute to more time in backup/restore of database.
cheers
It will be a waste to create a clustered primary key. Unless you have query that search for records using WHERE ID = 10 ?
You may want to create a clustered index on the column which will be frequently queried on WHERE City = 'Sydney'. Clustered means that SQL will group the data in the table based on the clustered index. By grouping the City values in the table means SQL can search for data quicker.
Storing two indexes over the same data is a waste of disk space and the processing needed to maintain the data.
However, I can imagine a product which depends on the existence of an index named IX_PrimaryKey. E.G.
string queryPattern = "select * from {0} as t with (index(IX_PrimaryKey))";
You can make the argument that the clustered index itself occupies much less space than the others, since the leaf is the actual data. On the other hand, the clustered index can be more susceptible to page splitting, and some indexes are better non-clustered.
Putting this together, I can definitely think of scenarios where removing the duplicate indexes would be a Bad Thing:
Code like above which depends on a known index name.
Code which can alter the clustered index to any of the non-clustered indexes.
Code which uses the presence/absence of IX_PrimaryKey to treat the table in a certain way.
I don't consider any of these good design, but I can definitely imagine someone doing it. (Have you posted this to DailyWTF?)
There are cases where it makes sense to have overlapping indexes which are not identical:
create index IX_1 on table1 (ID)
create index IX_2 on table1 (ID, TYPE, ORDER_DATE, TOTAL_CHARGES)
If you are looking up strictly by ID, SQL can optimize and use IX_1. If you are running a query based on ID, TYPE, ORDER_DATE and summing up TOTAL_CHARGES, SQL can use IX_2 as a "covering index", satisfying all the query details from the index without ever touching the table. Generally this is something you add in the course of performance tuning, after extensive testing.
Looking at your given example of two indexes on exactly the same field, I don't see a great fit. Perhaps SQL can use IX_ID as a "covering index" when checking for the existence of a value and bypass some blocking on IX_PrimaryKey?