What's the difference between these T-SQL queries (one uses INCLUDE)? [duplicate] - sql

While studying for the 70-433 exam I noticed you can create a covering index in one of the following two ways.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
-- OR --
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
The INCLUDE clause is new to me. Why would you use it and what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?

If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause is where you use INCLUDE.
The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree.
This makes the index smaller because it's not part of the tree
INCLUDE columns are not key columns in the index, so they are not ordered.
This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s)
Another MSDN article with a worked example

You would use the INCLUDE to add one or more columns to the leaf level of a non-clustered index, if by doing so, you can "cover" your queries.
Imagine you need to query for an employee's ID, department ID, and lastname.
SELECT EmployeeID, DepartmentID, LastName
FROM Employee
WHERE DepartmentID = 5
If you happen to have a non-clustered index on (EmployeeID, DepartmentID), once you find the employees for a given department, you now have to do "bookmark lookup" to get the actual full employee record, just to get the lastname column. That can get pretty expensive in terms of performance, if you find a lot of employees.
If you had included that lastname in your index:
CREATE NONCLUSTERED INDEX NC_EmpDep
ON Employee(DepartmentID)
INCLUDE (Lastname, EmployeeID)
then all the information you need is available in the leaf level of the non-clustered index. Just by seeking in the non-clustered index and finding your employees for a given department, you have all the necessary information, and the bookmark lookup for each employee found in the index is no longer necessary --> you save a lot of time.
Obviously, you cannot include every column in every non-clustered index - but if you do have queries which are missing just one or two columns to be "covered" (and that get used a lot), it can be very helpful to INCLUDE those into a suitable non-clustered index.

This discussion is missing out on the important point: The question is not if the "non-key-columns" are better to include as index-columns or as included-columns.
The question is how expensive it is to use the include-mechanism to include columns that are not really needed in index? (typically not part of where-clauses, but often included in selects). So your dilemma is always:
Use index on id1, id2 ... idN alone or
Use index on id1, id2 ... idN plus include col1, col2 ... colN
Where:
id1, id2 ... idN are columns often used in restrictions and col1, col2 ... colN are columns often selected, but typically not used in restrictions
(The option to include all of these columns as part of the index-key is just always silly (unless they are also used in restrictions) - cause it would always be more expensive to maintain since the index must be updated and sorted even when the "keys" have not changed).
So use option 1 or 2?
Answer: If your table is rarely updated - mostly inserted into/deleted from - then it is relatively inexpensive to use the include-mechanism to include some "hot columns" (that are often used in selects - but not often used on restrictions) since inserts/deletes require the index to be updated/sorted anyway and thus little extra overhead is associated with storing off a few extra columns while already updating the index. The overhead is the extra memory and CPU used to store redundant info on the index.
If the columns you consider to add as included-columns are often updated (without the index-key-columns being updated) - or - if it is so many of them that the index becomes close to a copy of your table - use option 1 I'd suggest! Also if adding certain include-column(s) turns out to make no performance-difference - you might want to skip the idea of adding them:) Verify that they are useful!
The average number of rows per same values in keys (id1, id2 ... idN) can be of some importance as well.
Notice that if a column - that is added as an included-column of index - is used in the restriction: As long as the index as such can be used (based on restriction against index-key-columns) - then SQL Server is matching the column-restriction against the index (leaf-node-values) instead of going the expensive way around the table itself.

Basic index columns are sorted, but included columns are not sorted. This saves resources in maintaining the index, while still making it possible to provide the data in the included columns to cover a query. So, if you want to cover queries, you can put the search criteria to locate rows into the sorted columns of the index, but then "include" additional, unsorted columns with non-search data. It definitely helps with reducing the amount of sorting and fragmentation in index maintenance.

One reason to prefer INCLUDE over key-columns if you don't need that column in the key is documentation. That makes evolving indexes much more easy in the future.
Considering your example:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
That index is best if your query looks like this:
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
Of course you should not put columns in INCLUDE if you can get an additional benefit from having them in the key part. Both of the following queries would actually prefer the col2 column in the key of the index.
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
AND col2 = ...
SELECT TOP 1 col2, col3
FROM MyTable
WHERE col1 = ...
ORDER BY col2
Let's assume this is not the case and we have col2 in the INCLUDE clause because there is just no benefit of having it in the tree part of the index.
Fast forward some years.
You need to tune this query:
SELECT TOP 1 col2
FROM MyTable
WHERE col1 = ...
ORDER BY another_col
To optimize that query, the following index would be great:
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2)
If you check what indexes you have on that table already, your previous index might still be there:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
Now you know that Col2 and Col3 are not part of the index tree and are thus not used to narrow the read index range nor for ordering the rows. Is is rather safe to add another_column to the end of the key-part of the index (after col1). There is little risk to break anything:
DROP INDEX idx1 ON MyTable;
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2, Col3);
That index will become bigger, which still has some risks, but it is generally better to extend existing indexes compared to introducing new ones.
If you would have an index without INCLUDE, you could not know what queries you would break by adding another_col right after Col1.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
What happens if you add another_col between Col1 and Col2? Will other queries suffer?
There are other "benefits" of INCLUDE vs. key columns if you add those columns just to avoid fetching them from the table. However, I consider the documentation aspect the most important one.
To answer your question:
what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?
If you add a column to the index for the sole purpose to have that column available in the index without visiting the table, put it into the INCLUDE clause.
If adding the column to the index key brings additional benefits (e.g. for order by or because it can narrow the read index range) add it to the key.
You can read a longer discussion about this here:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes

The reasons why (including the data in the leaf level of the index) have been nicely explained. The reason that you give two shakes about this, is that when you run your query, if you don't have the additional columns included (new feature in SQL 2005) the SQL Server has to go to the clustered index to get the additional columns which takes more time, and adds more load to the SQL Server service, the disks, and the memory (buffer cache to be specific) as new data pages are loaded into memory, potentially pushing other more often needed data out of the buffer cache.

An additional consideraion that I have not seen in the answers already given, is that included columns can be of data types that are not allowed as index key columns, such as varchar(max).
This allows you to include such columns in a covering index. I recently had to do this to provide a nHibernate generated query, which had a lot of columns in the SELECT, with a useful index.

There is a limit to the total size of all columns inlined into the index definition. That said though, I have never had to create index that wide.
To me, the bigger advantage is the fact that you can cover more queries with one index that has included columns as they don't have to be defined in any particular order. Think about is as an index within the index.
One example would be the StoreID (where StoreID is low selectivity meaning that each store is associated with a lot of customers) and then customer demographics data (LastName, FirstName, DOB):
If you just inline those columns in this order (StoreID, LastName, FirstName, DOB), you can only efficiently search for customers for which you know StoreID and LastName.
On the other hand, defining the index on StoreID and including LastName, FirstName, DOB columns would let you in essence do two seeks- index predicate on StoreID and then seek predicate on any of the included columns. This would let you cover all possible search permutationsas as long as it starts with StoreID.

Related

Optimizing index update speed during bulk insert into table

When inserting large sets of data into a table (from another table, in no particular order), how do you optimize a multi-column index so that the index is updated in the fastest possible way?
Assume the index is never used in any SELECT, DELETE or UPDATE query.* Assume also that the distinct counts for the columns as follows (for example):
COLUMN | DISTINCT COUNT
col1 | 634
col2 | 9,923
col3 | 2,357
col4 | 3
* Reason for not using the index in selecting data is this is a primary key index or a unique constraint index. The index is in place so that inserts violating the constraint should fail.
I have read that the most selective column should come first. Is that correct, and is the index then to be created as follows?
(col2, col3, col1, col4)
If that is wrong, how do you determine the best order for column in an index which will only see bulk INSERTs into the corresponding table? The goal is to speed up the updating of the index during the bulk INSERT.
The quickest way is to DROP INDEX, then do the bulk inserts and CREATE INDEX when you are done inserting.
The proper structure of the index does not have so much to do with the distribution of values in the columns but with the retrieval strategies, presumably for UPDATE and DELETE only, and then specifically when you do partial filtering on some but not all always all columns of the index. Those more frequent filters should come first in your index columns. But you probably want to reconsider your indexing strategy more radically if this is the case: it may be better to have two or more indexes to match your typical retrieval strategies.
Ignoring your call for ignorance: why would you not apply the index to SELECT statements? Indexes are useful only for selecting subsets of data from your tables, whether that is for SELECT or a qualified UPDATE or DELETE. There is no functional difference for using indexes in any of these three operations.
Addendum after comments from OP: Indexes are useful for many purposes but their maintenance is relatively expensive, where "relatively" becomes "impossibly" very quickly with increasing table size. In your case you have to compare every record from your source table with every record in your destination table, or O(m*n) order. That is unworkable with tables of a large size, even with an index. Your best bet is to drop the index, do the inserts, create an index which is not unique, find and delete all duplicates, drop the index, finally create a new unique index.
The order of the columns is not terribly important for a uniqueness enforcement purposes. But it would be unusual for a unique index to not also happen to be useful for some queries, and so I would order the columns to take advantage of that.
For bulk inserting into this index rapidly, I'd try inserting in index order. So add an order by (col2, col3, col1, col4) to the select part of your insert. This leads to more efficient IO.

SQL Server multiple index order optimization

I have a table with a nonclustered index1 on ID1 and ID2, in that order.
Select count(distinct(id1)) from table
returns 1
and Select count(distinct(id2)) from table has all the values of the table.
The querys to that table uses ... where id1= XX and id2 = XX
Could it make any performance improvement if I switch the order of the fields of index1 ?
I know it SHOULD be better but maybe: is it indifferent because id1 has only 1 value?
If I understand correctly, you are comparing these two statements:
where id1= XX and id2 = XX
Under most circumstances, this would use either an index on table(id1, id2) or table(id2, id1). The order of the comparisons in the where (or on) clauses has no impact on which indexes can be used.
Whether you should include a column that has only a single value in the unique index is a different matter. There is a minor performance effect to having a more complex index -- the tree structure has to store more bytes for each key. However, the query:
select count(distinct id2)
from table
where id1 = xx and idx = xx
will actually run faster with a composite index than with a singleton index table(id2). The reason is that the composite index can be used to entirely satisfy the query (in the jargon, it is a "covering index for the query"). The singleton index would need to look up the value of id1 in the table data, which requires extra processing.
The order you define the columns in your Index matters. If your column ID1 will always only have 1 value, then there is no point in putting it into the index, unless you are using it in a Covering Index in a Non-Clustered Index (meaning an Index not the physical ordering of the Table itself). In general, your first column defined in your Index should be the column with the most Varying Values that you need to search through. Visualize it this way, if you had a table of 1 million rows, and the first Column in your Index only had 1 (or small number) of varying values, then would that Index help you in finding the rows you want among the 1 million? Or would it be better to have ID2 first, which would be more efficient for the search, and which would be more frequently used, is what you have to ask yourself. Below is also more info on your question.
SQL Server Clustered Index - Order of Index Question
If you are using a Non-Clustered index, it may appear to not make a Different if your first Column in your Index is all the same values. However it does matter, the reason being is a Non-Clustered Index is stored on a number of Pages. The more entries you can store on a Page which helps you search faster the better. If you include a Column on a Page which adds no value to the Search, then it will requires the same Index to span more Pages. Meaning more Pages to flip through and Longer Lookups. It also means less Room to add new entries to an Existing Page during Inserts when the index is updated, causing more Page Splits. So there are side effects to the decision to add a Column of only 1 value to the Index. If you are using the Column to "cover" retrieved values in common selects, then you can also use Included Columns in your Index, which has the added benefit of not reordering your Index and yet acts like a Covered Index. If that was the intended purpose originally for adding a Column which only has 1 value.

Database covering a query [duplicate]

This question already has answers here:
What are Covering Indexes and Covered Queries in SQL Server?
(9 answers)
Closed 8 years ago.
Trying to understand what "covering a query means" with a specific example
If I have a table with say 3 columns:
Col1 Col2 Col3
And I put an index on Col1 and Col2
is "covering a query" determine by the columns selected in the SELECT or the columns in the the WHERE Clause?
Thus :
1) select Col1, Col2 from MyTable where Col3=XXX
2) Select Col3 from MyTable where Col1=xxx and Col2=yyy
3) Select Col1, Col2 from MyTable where Col1=xxx and Col2=yyy
Which of these three are truly "Covered"?
Only the third example is covered. To be covered, a query must be fully satisfied from the index. Your first example produces results that are entirely within the index, but it needs information that is not part of the index to complete, and so is not covered. To match your first example, you need an index that lists Col3 first.
One important feature of indexes is the ability to include a set of column in the index without actually indexing those columns. So an index example for your table might look like this:
CREATE INDEX [ix_MyTable] ON [MyTable]
(
[Col1] ASC,
[Col2] ASC
)
INCLUDE ( [Col3])
Now samples 2 and 3 are both covered. Sample 1 is still not covered, because the index is still not useful for the WHERE clause.
Why INCLUDE Col3, rather than just listing it with the others? It's important to remember that as you add indexes or make them more complex, operations that change data using those indexes will require more and more work, because each change will also require updating the indexes. If you include a column in an index, without actually indexing it, an update to that column still needs to go back and update the index as well, so that the data in the index is accurate... but it doesn't also need to re-order the index based on the new value. So this saves some work for our database server. To put it another way, if a column will only be in the select list, and not in the where clause, you might get a small performance benefit by including it in an index to get the benefit of covering a query from the index, without actually indexing on the column.
It is not just the where clause and select clause. A group by clause also needs its columns to be covered by the index for it to be a covering index. Basically, to be a covering index, it needs to contain all the column used in the query for a given table. However, if you don't include them in the right order, the index won't be used.
If the column order in the index is (col1, col2, col3), then the index can't be used for query one since you are selecting by col3. Think of it like a phone book sorted by last name, then first name, then middle initial. Finding everyone with a last name Smith is easy, finding everyone with the first name John isn't helped by the sorting, you have to read the whole phone book. Same for the index. Finding a col1 value is easy. Finding a col1 value and then col2 values is fine. Just finding col3 or just col2 is not helped by the index.

Decision when to create Index on table column in database?

I am not db guy. But I need to create tables and do CRUD operations on them. I get confused should I create the index on all columns by default
or not? Here is my understanding which I consider while creating index.
Index basically contains the memory location range ( starting memory location where first value is stored to end memory location where last value is
stored). So when we insert any value in table index for column needs to be updated as it has got one more value but update of column
value wont have any impact on index value. Right? So bottom line is when my column is used in join between two tables we should consider
creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of
updating index value when new value is inserted in column.Right?
Consider this scenario where table mytable contains two three columns i.e col1,col2,col3. Now we fire this query
select col1,col2 from mytable
Now there are two cases here. In first case we create the index on col1 and col2. In second case we don't create any index.** As per my understanding
case 1 will be faster than case2 because in case 1 we oracle can quickly find column memory location. So here I have not used any join columns but
still index is helping here. So should I consider creating index here or not?**
What if in the same scenario above if we fire
select * from mytable
instead of
select col1,col2 from mytable
Will index help here?
Don't create Indexes in every column! It will slow things down on insert/delete/update operations.
As a simple reminder, you can create an index in columns that are common in WHERE, ORDER BY and GROUP BY clauses. You may consider adding an index in colums that are used to relate other tables (through a JOIN, for example)
Example:
SELECT col1,col2,col3 FROM my_table WHERE col2=1
Here, creating an index on col2 would help this query a lot.
Also, consider index selectivity. Simply put, create index on values that has a "big domain", i.e. Ids, names, etc. Don't create them on Male/Female columns.
but update of column value wont have any impact on index value. Right?
No. Updating an indexed column will have an impact. The Oracle 11g performance manual states that:
UPDATE statements that modify indexed columns and INSERT and DELETE
statements that modify indexed tables take longer than if there were
no index. Such SQL statements must modify data in indexes and data in
tables. They also create additional undo and redo.
So bottom line is when my column is used in join between two tables we should consider creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of updating index value when new value is inserted in column. Right?
Not just Inserts but any other Data Manipulation Language statement.
Consider this scenario . . . Will index help here?
With regards to this last paragraph, why not build some test cases with representative data volumes so that you prove or disprove your assumptions about which columns you should index?
In the specific scenario you give, there is no WHERE clause, so a table scan is going to be used or the index scan will be used, but you're only dropping one column, so the performance might not be that different. In the second scenario, the index shouldn't be used, since it isn't covering and there is no WHERE clause. If there were a WHERE clause, the index could allow the filtering to reduce the number of rows which need to be looked up to get the missing column.
Oracle has a number of different tables, including heap or index organized tables.
If an index is covering, it is more likely to be used, especially when selective. But note that an index organized table is not better than a covering index on a heap when there are constraints in the WHERE clause and far fewer columns in the covering index than in the base table.
Creating indexes with more columns than are actually used only helps if they are more likely to make the index covering, but adding all the columns would be similar to an index organized table. Note that Oracle does not have the equivalent of SQL Server's INCLUDE (COLUMN) which can be used to make indexes more covering (it's effectively making an additional clustered index of only a subset of the columns - useful if you want an index to be unique but also add some data which you don't want to be considered in the uniqueness but helps to make it covering for more queries)
You need to look at your plans and then determine if indexes will help things. And then look at the plans afterwards to see if they made a difference.

What is a Covered Index?

I've just heard the term covered index in some database discussion - what does it mean?
A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.
Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.
Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/
Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/