I have been reading in many SQL books and articles that selectivity is an important factor in creating index. If a column has low selectivity, an index seek does more harm that good. But none of the articles explain why. Can anybody explain why it is so, or provide a link to a relevant article?
From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask
The ratio of unique values within a key column is referred to as index
selectivity. The more unique the values, the higher the selectivity,
which means that a unique index has the highest possible selectivity.
The query engine loves highly selective key columns, especially if
those columns are referenced in the WHERE clause of your frequently
run queries. The higher the selectivity, the faster the query engine
can reduce the size of the result set. The flipside, of course, is
that a column with relatively few unique values is seldom a good
candidate to be indexed.
Also check these articles:
Check this post by Pinal Dave
this other on SQL Serverpedia
This forum post on SqlServerCentral can help you too.
This article on SqlServerCentral also
From the SqlServerCentral article:
In general, a nonclustered index should be selective. That is, the
values in the column should be fairly unique and queries that filter
on it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations
and if a nonclustered index is to be used to evaluate a query it needs
to be covering or sufficiently selective that the costs of the lookups
aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the
query would be seeking on) insufficiently selective then it is very
likely that the index will be ignored and the query executed as a
clustered index (table) scan.
It is important to note that this does not just apply to the leading
column. There are scenarios where a very unselective column can be
used as the leading column, with the other columns in the index making
it selective enough to be used.
I try to write a very simple explanation (based on my current knowledge of Sql Server):
If an index has low selectivity it means that for the same value a bigger percentage of the total rows are found. (like 200 from the 500 rows has the same value on your index based)
Usually if the index does not contain all the column information what you need, then it is using a pointer, where to find the row physically which is connected to that "entry" on the index. Then in a secpnd step the engine has to read out that row.
So as you see a search like this using two step. And here comes the selectivity:
More results you get becuse of the low selectivity more double work the engine has to do. So there are some cases because of this fact where even a table scan is more efficient then an index seek with very low selectivity.
Related
I've got a Stored Procedure in SQL Server 2005 and when I run it and I look at its Execution Plan I notice it's doing a Clustered Index Scan, and this is costing it the 84%. I've read that I've got to modify some things to get a Clustered Index Seek there, but I don't know what to modify.
I'll appreciate any help with this.
Thanks,
Brian
W/o any detail is hard to guess what the problem is, and even whether is a problem at all. The choice of a scan instead of a seek could be driven by many factors:
The query expresses a result set that covers the entire table. Ie. the query is a simple SELECT * FROM <table>. This is a trivial case that would be perfectly covered by a clustred index scan with no need to consider anything else.
The optimizer has no alternatives:
the query expresses a subset of the entire table, but the filtering predicate is on columns that are not part of the clustered key and there are no non-clustred indexes on those columns either. These is no alternate plan other than a full scan.
The query has filtering predicates on columns in the clustred index key, but they are not SARGable. The filtering predicate usually needs to be rewritten to make it SARGable, the proper rewrite depends from case to case. A more subtle problem can appear due to implicit conversion rules, eg. the filtering predicate is WHERE column = #value but column is VARCHAR (Ascii) and #value is NVARCHAR (Unicode).
The query has SARGale filtering predicates on columns in the clustered key, but the leftmost column is not filtered. Ie. clustred index is on columns (foo, bar) but the WHERE clause is on bar alone.
The optimizer chooses a scan.
When the alternative is a non-clustered index then scan (or range seek) but the choice is a to use the clustered index the cause can be usually tracked down to the index tipping point due to lack of non-clustered index coverage for the query projection. Note that this is not your question, since you expect a clustered index seek, not a non-clustred index seek (assumming the question is 100% accurate and documented...)
Cardinality estimates. The query cost estimate is based on the clustered index key(s) statistics which provide an estimate of the cardinality of the result (ie. how many rows will match). On a simple query This cannot happen, as any estimate for a seek or range seek will be lower than the one for a scan, no matter how off the statistics are, but on a complex query, with joins and filters on multiple tables, things are more complex and the plan may include a scan where a seek was expected because the query optimizer may choose plan on which the join evaluation order is reversed to what the observer expects. The reverse order choice may e correct (most times) or may be problematic (usually due to statistics being obsolete or to parameter sniffing).
An ordering guarantee. A scan will produce results in a guaranteed order and elements higher on the execution tree may benefit from this order (eg. a sort or spool may be eliminated, or a merge join can be used instead of hash/nested joins). Overall the query cost is better as a result of choosing an apparently slower access path.
These are some quick pointers why a clustered index scan may be present when a clustered index seek is expected. The question is extremly generic and is impossible to give an answer 'why', other than relying on an 8 ball. Now if I take your question to be properly documented and correctly articulated, then to expect a clustered index seek it means you are searching an unique record based on a clustred key value. In this case the problem has to be with the SARGability of the WHERE clause.
If the Query incldues more than a certain percentage of the rows in the table, the optimizer will elect to do a scan instead of a seek, because it predicts that it will require fewer disk IOs in that case (For a Seek, It needs one Disk IO per level in the index for each row it returns), whereas for a scan there is only one disk IO per row in the entire table.
So if there are, say 5 levels in the b-tree Index, then if the query will generate more than 20% of the rows in the table, it is cheaper to read the whole table than make 5 IOs for each of the 20% rows...
Can you narrow the output of the query a bit more, to reduce the number of rows returned by this step in the process? That would help it choose the seek over the scan.
I have created script to find selectivity of each columns for every tables. In those some tables with less than 100 rows but selectivity of column is more than 50%.
where Selectivity = Distinct Values / Total Number Rows
So, is those column are eligible for index?
Or, can you tell, how much minimum rows require for eligibility for creating index?
I think I understand what you are trying to accomplish by calculating a 'Selectivity' value for your data but you cannot apply the rule blindly.
In fact in for certain queries the 'Selectivity' value might be really low an index will still be very beneficial. For example:
Assume a 'inbox' table with millions of rows, these rows have a 'Read' boolean field. In this case the distinct values over the number of rows will be really low. If most items are read most of the time then finding unread items with an index on this field will be very efficient.
Creating indexes index come at a cost. Although you get the benefit for reads, you pay for writes and disk usage.
I would rather recommend you profile your queries and index accordingly. You can also look at the data from sys.dm_db_missing_index_group_stats and other Dynamic management views that will give you insight on indexes usage (or missing) ones.
You can create a index on a table with 0 rows, 1 row or a 100 million rows. You can create an index where every column has the same value or unique values.
So you can create an index. The question is really should you create an index and no tool is going to tell you that because indexes can also be multi-value and it depends on what queries you run. Creating indexes is something done when performance tuning queries or preemptively when you know that you'll be creating queries that are using it.
Every index comes with a cost in terms of space and time required to do updates, inserts and deletes. You don't want to be creating them spuriously so you're really going to have to do this by hand, not as a result of a script to see how unique the value of a column is.
A general rule of thumb says that if you have a very large table (over 1 million rows), you should only use an index if a WHERE clause based on that index selects at most something in the neighborhood of 1-2% of the data.
If you have a "gender" column and roughly 50% of values are "male" and roughly 50% "female", then having an index on that really doesn't give you much - SQL Server and most other RDBMS will most likely still do a full table scan in this case, since on average, they'd have to scan at least half the table anyway, so the "detour" by using an index first and then looking up the actual full data based on that index value is just not worth it.
An index is excellent if you have something like unique keys (customer number), or a value that is quite selective. An index is not without cost - it uses up disk space, it needs to be maintained, it will slightly slow down all operations besides the SELECT - so thread carefully, it's not the best idea to just blindly index everything. Having too few indices is bad - but having too many, and the wrong ones, can be even worse! :-) Nobody ever claimed getting your indices right was easy.... :-)
But there's definitely help out there - the best source I know are Kimberly Tripp's excellent blog posts on SQL Server indexing (and many other topics).
Marc
Can a select query use different indexes if a change the value of a where condition?
The two following queries use different indexes and the only difference is the value of the
condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='EXPORT'
and nbessais<1
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='MAIL'
and nbessais<1
Can anyone give me an explanation?
Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.
Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:
range High value
number of values in the range
number of distinct values in the range (cardinality)
number of values equal to the High value
...and so on.
You can view the statistics on a given index with:
DBCC SHOW_STATISTICS(<tablename>, <indexname>)
Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.
As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.
Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.
The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.
In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.
The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.
Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:
Table and index row counts
Distributions histograms of the values in individual columns.
If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.
Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.
Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values.
If I were to put an index on the column would it be large?
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.
It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.
Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?
And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.
Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.
Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.
Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.
YMMV.
The short answer:
Do indexes suck: Yes and No
The longer answer:
They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.
Good starting points:
http://www.sqlservercentral.com/articles/Indexing/
No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.
First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance
Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.
It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.
Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.
Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).
Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.
To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.
Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?
The index size will be proportional to the number of your rows and the length of the indexed values.
The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).
If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same
In this case, you would come with 20 indexes being same in size in total as your original one.
This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.
Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).
The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.
Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.
Sorry, I'm not quite sure what you mean by "large".
If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.
If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.
The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.
Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.
Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.
As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.
It would be large enough to hold those values for all the rows, in a sorted order.
Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.
I have a number of indexes on some tables, they are all similar and I want to know if the Clustered Index is on the correct column. Here are the stats from the two most active indexes:
Nonclustered
I3_Identity (bigint)
rows: 193,781
pages: 3821
MB: 29.85
user seeks: 463,355
user_scans: 784
user_lookups: 0
updates: 256,516
Clustered Primary Key
I3_RowId (varchar(80))
rows: 193,781
pages: 24,289
MB: 189.76
user_seeks: 2,473,413
user_scans: 958
user_lookups: 463,693
updates: 2,669,261
As you can see, the PK is being seeked often, but all the seeks for the i3_identity column are doing key lookups to this PK as well, so am I really benefiting from the index on I3_Identity much at all? Should I change to using the I3_Identity as the clustered? This could have a huge impact as this table structure is repeated about 10000 times where I work, so any help would be appreciated.
Frederik sums it up nicely, and that's really what Kimberly Tripp also preaches: the clustering key should be stable (never changes), ever increasing (IDENTITY INT), small and unique.
In your scenario, I'd much rather put the clustering key on the BIGINT column rather than the VARCHAR(80) column.
First of all, with the BIGINT column, it's reasonably easy to enforce uniqueness (if you don't enforce and guarantee uniqueness yourself, SQL Server will add a 4-byte "uniquefier" to each and every one of your rows) and it's MUCH smaller on average than a VARCHAR(80).
Why is size so important? The clustering key will also be added to EACH and every one of your non-clustered indexes - so if you have a lot of rows and a lot of non-clustered indexes, having 40-80 byte vs. 8 byte can quickly make a HUGE difference.
Also, another performance tip: in order to avoid the so-called bookmark lookups (from a value in your non-clustered index via the clustering key into the actual data leaf pages), SQL Server 2005 has introduced the notion of "included columns" in your non-clustered indexes. Those are extremely helpful, and often overlooked. If your queries often require the index fields plus just one or two other fields from the database, consider including those in order to achieve what is called "covering indexes". Again - see Kimberly Tripp's excellent article - she's the SQL Server Indexing Goddess! :-) and she can explain that stuff much better than I can...
So to sum it up: put your clustering key on a small, stable, unique column - and you'll do just fine!
Marc
quick 'n dirty:
Put the clustered index on:
a column who's values (almost) never change
a column for which values on new records increase / decrease
sequentially
a column where you perform range - searches
Here's the best discussion I've found about the topic. Kimberly Tripp is a MS blogger that stays on top of the debate. I could interpret it for you, but you obviously uonderstand the basic words and concepts, and the article is highly readable. So enjoy!
Hint: you'll find that short answers are almost always too simplistic.
From what I've read in the past, two of the most important measures with regards to indexing tables are the number of queries performed against the index and the index density. By using DBCC_SHOWSTATISTICS([table],[index]), you can examine index density. The idea is that you want your clustered index on the columns that provide the most distinctness per query.
In short, if you look at the "All density" measure from DBCC SHOW_STATISTICS and notice the number is very low, this is a good index to cluster. It makes logical sense to cluster on an index that provides more uniqueness, but only if it's actively queried against. Clustering on a seldom-used index will probably do more harm than good.
In the end it's a judgment call. You may want to talk with your DBA and analyze your code to see where you'll get the biggest benefit. In this limited example, your indexing seems to be clustered in the right area if you only consider usage (and even when you consider all density, given the fact that a primary key provides the most uniqueness you can muster.)
Edit: There's a pretty good article on MSDN that explains what SHOW_STATISTICS provides you. I'm certainly not an uber DBA, but most of the information I've provided here came from guidance given by our DBA :)
Here's the article: http://msdn.microsoft.com/en-us/library/ms174384.aspx
Generally, when I see key lookups to the primarykey/clustered key, it means I need to include (using the INCLUDE statement) more columns in the non-clustered key. Look at your queries and see what columns are being selected/used in those statements. If you include those columns in the non-clustered key, then it won't need to do the key lookup any more.