SQL Server indexing includes questions - sql

I've been trouble shooting some bad SQL calls in my works applications. I've been reading up on indexes, tweaking and benchmarking things. Here's some of the rules I've gathered (let me know if this sounds right):
For heavily used quires, boil down the query to only what is needed and rework the where statements to use the most common columns first. Then make a non clustered index on the columns used in the where statement and do INCLUDING on any remaining select columns (excluding large columns of course like nvarchar(max)).
If a query is going to return > 20% of the entries table contents, it's best to do a table scan and not use an index
Order in an index matters. You have to make sure to structure your where statement like the index is built.
Now one thing I'm having trouble finding info on is what if a query is selecting on columns that are not part of any index but is using a where statement that is? Is the index used and leaf node hits the table and looks at the associated row for it?
ex: table
Id col1 col2 col3
CREATE INDEX my_index
ON my_table (col1)
SELECT Id, col1, col2, col3
FROM my_table
WHERE col1 >= 3 AND col1 <= 6
Is my_index used here? If so, how does it resolve Id, col2, col3? Does it point back to table rows and pick up the values?

To answer your question, yes, my_index is used. And yes, your index will point back to the table rows and pick the id, col2 and col3 values there. That is what an index does.
Regarding your 'rules'
Rule 1 makes sense. Except for the fact that I usually do not 'include' other columns in my index. As explained above, the index will refer back to the table and quickly retrieve the row(s) that you need.
Rule 2, I don't really understand. You create the index and SQL Server will decide which indices to use or not use. You don't really have to worry about it.
Rule 3, the order does not really make a difference.
I hope this helps.

From dba.stackexchange.com:
There are a few concepts and terms that are important to understand
when dealing with indexes. Seeks, scans, and lookups are some of the
ways that indexes will be utilized through select statements.
Selectivity of key columns is integral to determining how effective an
index can be.
A seek happens when the SQL Server Query Optimizer determines that the
best way to find the data you have requested is by scanning a range
within an index. Seeks typically happen when a query is "covered" by
an index, which means the seek predicates are in the index key and the
displayed columns are either in the key or included. A scan happens
when the SQL Server Query Optimizer determines that the best way to
find the data is to scan the entire index and then filter the results.
A lookup typically occurs when an index does not include all requested
columns, either in the index key or in the included columns. The query
optimizer will then use either the clustered key (against a clustered
index) or the RID (against a heap) to "lookup" the other requested
columns.
Typically, seek operations are more efficient than scans, due to
physically querying a smaller data set. There are situations where
this is not the case, such as a very small initial data set, but that
goes beyond the scope of your question.
Now, you asked how to determine how effective an index is, and there
are a few things to keep in mind. A clustered index's key columns are
called a clustering key. This is how records are made unique in the
context of a clustered index. All nonclustered indexes will include
the clustered key by default, in order to perform lookups when
necessary. All indexes will be inserted to, updated to, or deleted
from for every respective DML statement. That having been said, it is
best to balance performance gains in select statements against
performance hits in insert, delete, and update statements.
In order to determine how effective an index is, you must determine
the selectivity of your index keys. Selectivity can be defined as a
percentage of distinct records to total records. If I have a [person]
table with 100 total records and the [first_name] column contains 90
distinct values, we can say that the [first_name] column is 90%
selective. The higher the selectivity, the more efficient the index
key. Keeping selectivity in mind, it is best to put your most
selective columns first in your index key. Using my previous [person]
example, what if we had a [last_name] column that was 95% selective?
We would want to create an index with [last_name], [first_name] as the
index key.
I know this was a bit long-winded answer, but there really are a lot
of things that go into determining how effective an index will be, and
a lot things you must weigh any performance gains against.

Related

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.
A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?
Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.
Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)
It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers
One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

SQL Indexing: None, Single Column, and Multiple Columns

How does indexing work in SQL and what benefits does it provide? What reason would there be for not indexing? And what is the difference between indexing a single column vs. indexing multiple columns?
How does indexing work in SQL and what benefits does it provide?
When you index columns you express your intent to query the indexed columns in conditional expressions, such as equality or range queries. With this information the storage engine can build a structure that makes such queries faster, often arranging them in tree structures. B-trees are the most common ones, but a lot of different structures exists, such as hash indices, R-tree indices for spatial data etc. Each structure is specialized in a certain type of look ups. For instance, hash indices are very fast for equality conditions, such as:
SELECT * FROM example_table WHERE type = "example";
SELECT * FROM example_table WHERE id = X;
B-trees are also fairly quick for equality look ups, but their main strength is that they support range queries:
SELECT * FROM example_table WHERE id > 5 AND id < 10
SELECT * FROM example_table WHERE type = "example" and value > 25
It is VERY important, however, when you build B-tree indices to understand that the tree is ordered in a "left-to-right" manner. I.e, if you build a B-tree index (lets call it A) on {type, value}, then you NEED to have a condition on the type-column in order for the query to be able to utilize the index. The example index can NOT be used in a query where the condition solely depends on value.
Furthermore, if you mix equality and a range condition, make sure that the equality columns are listed first in the index, otherwise the index can only be partially used.
What reason would there be for not indexing?
If the selectivity of the index is low, then you might not gain much over a table scan. say for instance that you have an index on a field called gender. Then the selectivity of that index will be low, since a lookup on that index will return half the rows of the original table. You can read a pretty simple explanation on selectivity here, and the reasoning behind it: http://mattfleming.com/node/192
Also, maintaining an index has a cost. For each data manipulation the index might need restructuring. So keeping the amount of indices to the minimum required to perform well on the queries against that table might be desirable.
What is the difference between indexing a single column vs. indexing multiple columns?
Once again, it depends on the type of queries you issue. Indexing a single column gender might not be a good idea, since the selectivity is low. When the selectivity is high then such an index makes much more sense. For instance, indices on the primary key is a very good index, since the selectivity is high (actually, it is as high as it gets. Each key in the index corresponds to exactly on record), and indices on columns with unique or highly different values (such as slugs, password hashes and what not) are also good single column indices.
There is also the concept of covering indices. Basically, each leaf in an index contains a pointer into the table where the row is stored (unless the index is a clustered index. In this case the leaf is the record). So for each index hit, the query engine has to fetch the corresponding table row, increasing the number of I/O-operations. Since I/O is extremely slow, you want to keep this to a minimum. Now, lets say that you often need to query for something, and also fetch some additional columns, then you can create a covering index, trading storage space for query performance. Example: Let's find the name and email of all users who has joined in the last 6 months (assuming MySQL):
With index on {joined_at}:
SELECT first_name, last_name, email
FROM users
WHERE joined_at > NOW() - INTERVAL 6 MONTH;
Query explanation:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users ALL test NULL NULL NULL 873 Using where
As you can see in the type-column, the query engine resorted to a full table scan, since the index selectivity was too low to be worthwhile using in this query (too many results would be returned, and thus followed into the table, costing too much in I/O)
With index on {joined_at, first_name, last_name, email}:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users range test,test2 test2 8 NULL 514 Using where;
Using index
Now, since all the information that is necessary to complete the query is available in the index, the query engine evaluates that it is much better to use the index (with 514 rows) instead of doing a full table scan.
So as you can see, by using covering indices we can speed up queries for partial selects of a table, even if the selectivity of the index is quite small.
How does indexing work in SQL
That's a pretty open question but basically databases store a structure that enables faster look up of information. That structure is dependent on the implementation but its typically a type of tree.
what benefits does it provide?
Queries that are SARGable can be significantly faster.*
What reason would there be for not indexing?
Some data modification queries can take longer and there is storage cost to indexes but generally speaking, both these considerations are negligible.
And what is the difference between indexing a single column vs. indexing multiple columns?
There isn't much difference but sometimes people create covering indexes** that index mutliple columns to increase the performance of a specific query.
*SARGable is from Search ARGument ABLE. Basically if you do WHERE FOO > 5 it can be faster if FOO is indexed. On the other hand WHERE h(FOO) > 5 probably won't benefit from an index.
** If all the fields used in the SELECT JOIN and WHERE of a statement are also in an index a database can retrieve all the information it needs without going back to the base table. This is called a covering index. If all the fields were in separate indexes it would only use the ones for the joins and where and then go back to the base table for the columns in the select.

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+

What are Covering Indexes and Covered Queries in SQL Server?

Can you explain the concepts of, and relationship between, Covering Indexes and Covered Queries in Microsoft's SQL Server?
A covering index is one which can satisfy all requested columns in a query without performing a further lookup into the clustered index.
There is no such thing as a covering query.
Have a look at this Simple-Talk article: Using Covering Indexes to Improve Query Performance.
If all the columns requested in the select list of query, are available in the index, then the query engine doesn't have to lookup the table again which can significantly increase the performance of the query. Since all the requested columns are available with in the index, the index is covering the query. So, the query is called a covering query and the index is a covering index.
A clustered index can always cover a query, if the columns in the select list are from the same table.
The following links can be helpful, if you are new to index concepts:
Excellent video on advantages and disadvantages of index and covering
queries and indexes.
Indexes in SQL Server
A Covering Index is a Non-Clustered index. Both Clustered and Non-Clustered indexes use B-Tree data structure to improve the search for data, the difference is that in the leaves of a Clustered Index a whole record (i.e. row) is stored physically right there!, but this is not the case for Non-Clustered indexes. The following examples illustrate it:
Example: I have a table with three columns: ID, Fname and Lname.
However, for a Non-Clustered index, there are two possibilities: either the table already has a Clustered index or it doesn't:
As the two diagrams show, such Non-Clustered indexes do not provide a good performance, because they cannot find the favorite value (i.e. Lname) solely from the B-Tree. Instead they have to do an extra Look Up step (either Key or RID look up) to find the value of Lname. And, this is where covered index comes to the screen. Here, the Non-Clustered index on ID coveres the value of Lname right next to it in the leaves of the B-Tree and there is no need for any type of look up anymore.
A covered query is a query where all the columns in the query's result set are pulled from non-clustered indexes.
A query is made into a covered query by the judicious arrangement of indexes.
A covered query is often more performant than a non-covered query in part because non-clustered indexes have more rows per page than clustered indexes or heap indexes, so fewer pages need to be brought into memory in order to satisfy the query. They have more rows per page because only part of the table row is part of the index row.
A covering index is an index which is used in a covered query. There is no such thing as an index which, in and of itself, is a covering index. An index may be a covering index with respect to query A, while at the same time not being a covering index with respect to query B.
Here's an article in devx.com that says:
Creating a non-clustered index that contains all the columns used in a SQL query, a technique called index covering
I can only suppose that a covered query is a query that has an index that covers all the columns in its returned recordset. One caveat - the index and query would have to be built as to allow the SQL server to actually infer from the query that the index is useful.
For example, a join of a table on itself might not benefit from such an index (depending on the intelligence of the SQL query execution planner):
PersonID ParentID Name
1 NULL Abe
2 NULL Bob
3 1 Carl
4 2 Dave
Let's assume there's an index on PersonID,ParentID,Name - this would be a covering index for a query like:
SELECT PersonID, ParentID, Name FROM MyTable
But a query like this:
SELECT PersonID, Name FROM MyTable LEFT JOIN MyTable T ON T.PersonID=MyTable.ParentID
Probably wouldn't benifit so much, even though all of the columns are in the index. Why? Because you're not really telling it that you want to use the triple index of PersonID,ParentID,Name.
Instead, you're building a condition based on two columns - PersonID and ParentID (which leaves out Name) and then you're asking for all the records, with the columns PersonID, Name. Actually, depending on implementation, the index might help the latter part. But for the first part, you're better off having other indexes.
A covering query is on where all the predicates can be matched using the indices on the underlying tables.
This is the first step towards improving the performance of the sql under consideration.
a covering index is the one which gives every required column and in which SQL server don't have hop back to the clustered index to find any column. This is achieved using non-clustered index and using INCLUDE option to cover columns.
Non-key columns can be included only in non-clustered indexes. Columns can’t be defined in both the key column and the INCLUDE list. Column names can’t be repeated in the INCLUDE list. Non-key columns can be dropped from a table only after the non-key index is dropped first. Please see details here
When I simply recalled that a Clustered Index consists of a key-ordered non-heap list of ALL the columns in the defined table, the lights went on for me. The word "cluster", then, refers to the fact that there is a "cluster" of all the columns, like a cluster of fish in that "hot spot". If there is no index covering the column containing the sought value (the right side of the equation), then the execution plan uses a Clustered Index Seek into the Clustered Index's representation of the requested column because it does not find the requested column in any other "covering" index. The missing will cause a Clustered Index Seek operator in the proposed Execution Plan, where the sought value is within a column inside the ordered list represented by the Clustered Index.
So, one solution is to create a non-clustered index that has the column containing the requested value inside the index. In this way, there is no need to reference the Clustered Index, and the Optimizer should be able to hook that index in the Execution Plan with no hint. If, however, there is a Predicate naming the single column clustering key and an argument to a scalar value on the clustering key, the Clustered Index Seek Operator will still be used, even if there is already a covering index on a second column in the table without an index.
Page 178, High Performance MySQL, 3rd Edition
An index that contains (or "covers") all the data needed to satisfy a query is called a covering index.
When you issue a query that is covered by an index (an indexed-covered query), you'll see "Using Index" in the Extra column in EXPLAIN.

SQL Server Clustered Index - Order of Index Question

I have a table like so:
keyA keyB data
keyA and keyB together are unique, are the primary key of my table and make up a clustered index.
There are 5 possible values of keyB but an unlimited number of possible values of keyA,. keyB generally increments.
For example, the following data can be ordered in 2 ways depending on which key column is ordered first:
keyA keyB data
A 1 X
B 1 X
A 3 X
B 3 X
A 5 X
B 5 X
A 7 X
B 7 X
or
keyA keyB data
A 1 X
A 3 X
A 5 X
A 7 X
B 1 X
B 3 X
B 5 X
B 7 X
Do I need to tell the clustered index which of the key columns has fewer possible values to allow it to order the data by that value first? Or does it not matter in terms of performance which is ordered first?
You should order your composite clustered index with the most selective column first. This means the column with the most distinct values compared to total row count.
"B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table." http://www.akadia.com/services/ora_index_selectivity.html?
This article is for Oracle, but still relevant.
Also, if you have a query that runs constantly and returns few fields, you may consider creating a composite index that contains all the fields - it will not have to access the base table, but will instead pull data from the index.
ligget78's comment on making sure to mention the first column in a composite index is important to remember.
If you create an index (regardless clustered or not) with (keyA, keyB) then this is how values will be ordered, e.g. first keyA, then keyB (this is the second case in your question). If you want it the other way around, you need to specify (keyB, keyA).
It could matter performance-wise, depends on your query of course. For example, if you have (keyA, keyB) index and the query looks like WHERE keyB = ... (without mentioning keyA) then the index can't be utilized.
As others have said, the ordering is based on how you specify it in the index creation script (or PK constraint). One thing about clustered indexes though is that there is a lot to keep in mind.
You may get better overall performance by using your clustered index on something other than the PK. For example, if you are writing a financial system and reports are almost always based on date and time of an activity (all activity for the past year, etc.) then a clustered index on that date column might be better. As HLGEM says, sorting can also be affected by your selection of clustered index.
Clustered indexes can also affect inserts more than other indexes. If you have a high volume of inserts and your clustered index is on something like an IDENTITY column then there could be contention problems for that particular part of the disk since all of the new rows are being inserted into the same place.
For small look-up tables I always just put the clustered index on the PK. For high-impact tables though it's a good idea to spend the time thinking about (and testing) various possible clustered indexes before choosing the best one.
I believe that SQL Server orders it exactly the way you tell it. It assumes that you know best how to access your index.
In any case, I would say it's a good idea where possible to specify what you want exactly rather than hoping the database will figure it out.
You can also try it both ways, run a bunch of representative queries and then compare the generated execution plans to determine which is best for you.
Remember that the clustered index is the physical order in which the table is stored on disk.
So if your clustered index is defined as ColA, ColB queries will be faster when order in the same order as your clustered index. If SQL has to order B,A it will require post execution sorting to achieve the correct order.
My suggestion is to add a second non-clustered index on B,A. Also depending on the size of your data column to INCLUDE(read included column) it to prevent the need for key lookups. That is, of course, provided that this table is not heavily inserted, as you always must balance query speed vs. write speed.
Realistically, your clustered index should represent the order in which the data is most likely to be accessed as well as maintaining a delicate balance of insert\update IO cost. If your clustered index is such that you are constantly inserting into the middle of pages, you may suffer performance losses there.
Like others have said, without knowing the table length, column sizes, etc. there is no correct answer. Trial and error with a heavy dose of testing is your best bet.
Just in case this isn't obvious: the sort order of your index does not promise much about the the sort order of the results in a query.
In your queries, you must still add an
ORDER BY KeyA, KeyB
or
ORDER BY KeyB, KeyA
The optimizer may be pleased to find the data already physically ordered in the index as desired and save some time, but every query that is supposed to deliver data in a particular order must have an ORDER BY clause at the end of it. Without an order by, SQL Server makes no promises with respect to the order of a recordset, or even that it will come back in the same order from query to query.
The best thing you can do is to try both solutions and measure the execution time.
In my experience, index tuning is all but exact-science.
Maybe having keyB before keyA in the index column order would be better
You specify the columns in the order in which you would normally want them sorted in reports and queries.
I would be wary of creating a multicolumn clustered index though. Depending on how wide this is, you could have a huge impact on the size of any other indexes you create because all non-clustered indexes contain the clustered index value in them. Also the rows have to be re-ordered if the values frequently change and it is my experience that non-surrogate keys tend to change more frequently. Therefore creating this as a clustered vice nonclustered index could be much more time consuming of server resources if you have values that are likely to change. I'm not saying you shouldn't do this as I don't know what type of data your columns actually contain (although I suspect they are more complex that A1, a2, etc); I'm saying you need to think about the ramifications of doing it. It would probably be a good idea to thoroughly read BOL about clustered vice nonclustered indexes before committing to doing this.
Yes you should suggest, normally query engine try to find out the best execution plan and the index to utilize, however sometime it is better to force query engine to use the specific index. There are some other consideration when planning for index as well as when utilizing the index in your query. for example, the column ordering in index, column ordering in where clause. you could refer following link to know about:
http://ashishkhandelwal.arkutil.com/sql-server/quick-and-short-database-indexes/
Best Practices to use indexes
How to get best performance form indexes
Clustered index Considerations
Nonclustered Indexes Considerations
I am sure this will help you when planning for index.