Do covering indices safely replace smaller covering indices? - sql

For example:
Given columns A,B,C,D,
IX_A is an index on 'A'
IX_AB is a covering index on 'AB'
IX_A can be safely removed, for it is redundant: IX_AB will be used in its place.
I want to know if this generalizes:
If I have:
IX_AB
IX_ABC
IX_ABCD
and so forth,
Can the lesser indices still be safely removed?
That is, does IX_ABC make IX_AB redundant, and does IX_ABCD make both IX_AB and IX_ABC redundant?

In general -- and this varies from server to server -- a covering index will cover smaller-selections of the index.
So if you have an index that covers a, b, c, that usually automatically gives you an index that covers a, and a, b.
You are not guaranteed to have, for example, a covering index of b, c.

Yes, for the most part.
However, IX_ABCD isn't terribly helpful as a replacement for, say, IX_BCD.
There is a caveat, however: indexes still may require disk reads, so if C and D explode the size of the index, there will be some inefficiency in looking up A,B in IX_ABCD that wouldn't occur when looking it up in IX_AB.
However, that difference is likely outweighed by the additional performance hit of maintaining IX_AB separately.

The important thing is the leading columns in the index. If you have the index IX_ABCD the following queries will use the index:
select * from table where A = 1
select * from table where A = 1 and B = 1
select * from table where A = 1 and B = 1 and C = 1
However, the following will most likely not uses the index (at least not how you intended):
select * from table where B = 1
select * from table where C = 1
select * from table where B = 1 and C = 1
The important thing is that the leading columns are used. Therefore the order of the columns when the index is created does matter.

Not necessarily. While is true that an index on (A, B, C) can be used for a filtering predicate on A or an ordering request on A or a join condition on A, that does not necessarily mean that the index (A) alone is useless. If the index on (A, B, C) is considerably wider than (A), then a range scan on A alone will save significant I/O because it would have to read fewer pages (narrower index).
ut I admint that this would be the exception rather than the rule. In general is safe to remove an index on A if another one on (A, B) exists. Note that an index on (A,B) does not satisfy any filtering on B so, is safe to remove only if the leftmost column(s) are the same. Some databases have 'skip-scan' operators that can use an index on (A,B) for looking up B, but that is a very narrow border case.

Always best not to assume anything about database engine internals and actually check the actual query plans being used.

Related

Composite Indexes, the “Include” Keyword, and How They Work

In SQL Server (and most other relational databases), a "Composite Index" is an index with multiple keys. Let's say we have this query that gets run a lot, and we want to create a covering index for this query to speed it up;
SELECT a, b FROM MyTable WHERE c = #val1 AND d = #val2
These are all possible composite indexes that would cover this query;
CREATE INDEX ix1 ON MyTable (c, d, a, b)
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
CREATE INDEX ix3 ON MyTable (d) INCLUDE (a, b, c)
CREATE INDEX ix4 ON MyTable (c) INCLUDE (a, b, d)
But apparently, they don't perform equally. According to Erlan Sommarskog (Microsoft MVP), the first two are faster than the 3rd and 4th, and the 4th is faster than the 3rd.
He goes on to explain;
ix2 is the "best" index, because a and b will not take up space in the higher levels of the index tree. Also, if a or b are updated, in ix2 there can be no page splits or similar as the index tree is unaffected.
However, I am having a hard time grasping what exactly is going on. I do have the general knowledge on b-tree indexes and how they work, but I don't understand the logic behind composite keys. For example;
CREATE INDEX ix1 ON MyTable (c, d, a, b)
Does the order of the columns here matter? If so, why? Also;
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.
Note: I know there are a lot of posts on Composite Keys, but I believe my last two questions are specific enough to not be a duplicate.
Does the order of the columns here matter?
Considering only the query in your question with 2 equality predicates, the order of the composite index key columns doesn't matter as long as both are the leftmost key columns of the composite index. Any of the covering indexes below will optimize this query:
CREATE INDEX ix1 ON MyTable (c, d, a, b);
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b);
CREATE INDEX ix3 ON MyTable (d, c, a, b);
CREATE INDEX ix4 ON MyTable (d, c, b, a);
CREATE INDEX ix5 ON MyTable (d, c) INCLUDE (a, b);
That said, the stats histogram contains only the leftmost index key column so the general guidance is to specify the most selective column first to improve row count estimates and execution plan quality. This consideration is more important for non-trivial queries where the optimizer has many choices and row count estimates are an important factor in choosing the best plan.
Another consideration for key order, which may conflict with the above general guidance, is when the index supports different queries and only some of the key columns are specified (e.g. SELECT a, b FROM MyTable WHERE d = #val2;). In that case, it would be better to specify d as the leftmost column regardless of selectivity in order to allow a single index to optimize multiple queries instead of creating a separate index to optimize the second query.
What is the difference between this composite key and the one above? I
don't understand what difference "INCLUDE" makes.
Included columns are not key columns. Key columns are maintained in logical order at every level throughout the b-tree whereas included columns are present only in the b-tree leaf nodes and not ordered. Consequently, the specified order of included columns does not matter. The only purpose of included columns is to help cover queries without adding them as key columns and incurring the associated overhead.
CREATE INDEX ix1 ON MyTable (c, d, a, b)
Does the order of the columns here matter? If so, why? Also;
Yes, order is very important while creating index, because each column is (from left) next level of deepness in index, so to determine the compilator to use this index you need always seek for c which is the "opener" of this set.
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.
But keep in mind that for each level of the index it starts to be less efficient, so if you know that > 80% of your queries will only seek by c & d and not a & b, but you will need that information in your SELECT (nor in WHERE) you should INCLUDE them, as part of the leaf at the last level of the index.
There are better explanations than mine so feel free to look at them:
INCLUDE equivalent in Oracle -> INCLUDE
How important is the order of columns in indexes? -> ORDER in INDEX set

Maximum number of useful indexes a table can have?

The Meeting
In a meeting last week the client was discussing how to make an important search page faster. The page searches on a single table (12 columns, 20 million rows) by asking for values (strings) on any field; it returns 50 rows (with pagination), starting with the specified criteria (each column can be ascending or descending). When the criteria doesn't match the existing indexes, the search becomes slow, and the client is not happy.
And then -- in the middle of the meeting -- the semi-technical analyst threw this one into the air: Why don't we create all possible indexes on the table to make everything fast?
I responded at once "No, there are too many and that would make the table really slow to modify, so we need to create few cleverly chosen indexes to do it". We ended up creating the most useful ones, and the page is now much faster. Problem solved.
The Question
But still... I keep thinking about that question and I wanted to have a better understanding of it, so here it is:
In theory, how many possible useful indexes can I create on a table with N columns?
I think that by useful we should consider (I can be wrong):
Indexes not already covered by other ones: for example (a, b) should not be counted if (a, b, c) is included.
In order to show multiple rows (not just equality) ascending and descending indexes should be counted as separate ones when they are part of a composite index. That is: (a) serves the same purpose of (a DESC), but (a, b) serves a different purpose than (a DESC, b).
So, a table with a single column (a) can have only a single index:
(a)
With two columns (a, b) I can have four useful indexes:
(a, b)
(b, a)
(a DESC, b)
(b DESC, a)
(a) -- already covered by #1
(b) -- already covered by #2
(a, b DESC) -- already coverred by #1 (reading index in reverse)
(b, a DESC) -- already covered by #2
(a DESC, b DESC) -- already covered by #3
(b DESC, a DESC) -- already covered by #4
(a DESC) -- already covered by #3
(b DESC) -- already covered by #4
With three columns (a, b, c):
(a, b, c)
(a, c, b)
(b, c, a)
(b, a, c)
(c, a, b)
(c, b, a)
...
Let's say you have a table t with columns a, b, and c.
For the query
select a from t where b = 1 order by c;
the best index is on t(b,c,a), because you first look up values using b, then order results by c and have a in the results.
For this query:
select a from t where c = 1 order by b;
the best index is on t(c,b,a).
For this query:
select b from t where c = 1 order by a;
the best index is on t(c,a,b).
With more columns a query could look like this:
select a from t where b = 1 order by c, d, e;
and you'd best want an index on t(b,c,d,e,a).
While for
select a from t where b = 1 order by e, d, c;
you'd want an index on t(b,e,d,c,a).
So the maximal number of useful indexes for n columns is n!, i.e. all permutations.
This is for indexes on the mere columns alone. As Gordon Linoff has mentioned in the comments section to your request, you may also want function indexes (e.g. on t(upper(a),lower(b)). The number of usefull function indexes is theoretically unlimited. And yes, Gordon is also right about further index types.
So the final answer is that theoretically the number of useful indexes per table is unlimited.
All the other answers contain something valuable, but there is enough that I have to say about it to warrant a third one.
There is no exact answer to the question like you put it. In a way, it is like asking “What's the limit beyond which you would call a person crazy?” There is a large grey area.
My points are:
What would happen if you add too many indexes:
Modifying the table gets substantially slower. Even with few indexes, data manipulation will already become an order of magnitude slower. If you ever want to INSERT, UPDATE or DELETE, a table with all conceivable indexes would make such an operation glacially slow.
With many indexes, the query planner has to consider many different access paths, so planning the query will become slightly slower with any index you add. With very many indexes, it may well be that the planning overhead will make the query too slow even before the executor has started working.
What can you do to reduce the number of indexes needed:
Look at the operators. If the operators <, <=, >= and > are never used, there is no point in adding indexes with descending columns.
Remember that an index on (a, b, c) can also be used for a query that only uses a in its condition, so you don't need an extra index on (a).
What is a practical way forward for you?
I have two suggestions:
One way it to add a simple index on each of your twelve columns.
Twelve indexes are already quite a lot, but you are still not in the crazy range.
PostgreSQL can use these indexes efficiently in a query with conditions on more than one column, and even if none of the conditions alone would be selective enough to warrant an index scan.
This is because PostgreSQL has bitmap index scans. See this example from the documentation:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244)
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
Index Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0)
Index Cond: (unique2 > 9000)
Each index is scanned and a bitmap is formed that contains 1 for each row that matches the condition. Then the bitmaps are combined, and finally the rows are fetched from the table.
The other idea is to use a Bloom filter.
If the only operator in your conditions is =, you can
CREATE EXTENSION bloom;
and create a single index USING bloom over all table columns.
Such an index can be used for queries with any combination of columns in the WHERE clause. The down side is that it is a lossy index, so you will get false positive results that have to be fetched and filtered out.
It depends on your case, but this might be an elegant (and underestimated!) solution that balances query and update speed.
In theory, how many possible useful indexes can I create on a table with N columns?
Rather than answering this question theoretically, a practical answer is much better.
The first point to note is that all sequential searches should be avoided (unless the table is very small). By "very small", I mean, just a few rows (say, max 10). (However, even in such a table, a primary key is encouraged, to enforce uniqueness. This would, of course, be implemented as an index.)
Therefore, if the client has a valid search path, an index is required. If an existing index serves the purpose, that's OK; else, in all probability, an additional index is needed.
One transaction table in one application in my experience had 8 indexes. The client insisted on certain search paths, and so we had no choice but to provide them. Of course, we informed the client that updates would slow down, but the client found that acceptable. In reality, the slowdown in speed during updates wasn't appreciable.
So that is the approach suggested - warn the client accordingly.
It is important to verify, during design, that a SQL statement uses indexed search paths (for every accessed table), rather than searching sequentially. ORACLE has a tool for this, called EXPLAIN PLAN. Other DBs should also have similar tools.

Does an extra non indexed argument with a clustered index in a where clause make it any less efficient?

I have a table with a clustered index (let's say on columns a, b ,c). Would it be just as fast if my where clause looked like
WHERE a = x
AND b = y
AND c = z
vs
WHERE a = x
AND b = y
AND c = z
AND d = w
where d is a column in that table that is not indexed?
The two queries would, in fact, have remarkably similar performance under most circumstances. Both would have to scan all rows in the index with the matching a, b, and c values. Typically, both queries would also have to scan the associated data pages as well.
I can readily think of two things that affect performance, and they could go either way. If the first query only selects those three columns, then the clustered index is a covering index for the query, meaning that the data pages don't need to be accessed. Then, adding the condition on d might slow the query down because of the extra access to data page(s).
Second, the volume for the second query is (presumably) smaller than the data for the first version. This could speed up the query, particularly if other processing (say group by or order by) is involved.

Is an index on A, B redundant if there is an index on A, B, C?

Having years of experience as a DBA, I do believe I know the answer to the question, but I figured it never hurts to check my bases.
Using SQL Server, assuming I have a table which has an index on column A and column B, and a second index on columns A, B, and C, would it be safe to drop the first index, as the second index basically would satisfy queries that would benefit from the first index?
It depends, but the answer is often 'Yes, you could drop the index on (A,B)'.
The counter-case (where you would not drop the index on (A,B)) is when the index on (A,B) is a unique index that is enforcing a constraint; then you do not want to drop the index on (A,B). The index on (A,B,C) could also be unique, but the uniqueness is redundant because the (A,B) combination is unique because of the other index.
But in the absence of such unusual cases (for example, if both (A,B) and (A,B,C) allow duplicate entries), then the (A,B) index is logically redundant. However, if the column C is 'wide' (a CHAR(100) column perhaps), whereas A and B are small (say INTEGER), then the (A,B) index is more efficient than the (A,B,C) index because you can get more information read per page of the (A,B) index. So, even though (A,B) is redundant, it may be worth keeping. You also need to consider the volatility of the table; if the table seldom changes, the extra indexes don't matter much; if the table changes a lot, extra indexes slow up modifications to the table. Whether that's significant is difficult to guess; you probably need to do the performance measurements.
The first index covers queries that look up on A , A,B and the second index can be used to cover queries that look up on A , A,B or A,B,C which is clearly a superset of the first case.
If C is very wide however the index on A,B may still be useful as it can satisfy certain queries with fewer reads.
e.g. if C was a char(800) column the following query may benefit significantly from having the narrower index available.
SELECT a,b
FROM YourTable
ORDER BY a,b
Yes, this is a common optimization. Any query that would benefit from the index on A,B can also benefit just as well from the index on A,B,C.
In the MySQL community, there's even a tool to search your whole schema for redundant indexes: http://www.percona.com/doc/percona-toolkit/pt-duplicate-key-checker.html
The possible exception case would be if the index on A,B were more compact and used much more frequently, and you wanted to control which index was kept loaded in memory.
Much of what I was thinking was written by Jonathan in a previous answer. Uniqueness, faster work, and one other thing I think he missed.
If the first index is made A desc, B asc and second A asc, B asc, C asc, then deleting the first index isn't really a way to go, because the second one isn't a superset of the first one, and your query cannot benefit from the second index if ordering is as written in the first one.
In some cases like when you use the first index, you can order by A desc, B asc (of course) and A asc, B desc, but you can also make a query that will use any part of that index, like Order by A desc.
But a query like order by A asc, B asc, will not be 'covered' by the first index.
So I would add up, you can usually delete the first index, but that depends on your table configuration and your query (and, of course, indexes).
I typically would find this "almost" similar index in table that contains historical data. If column C is a date or integer column, be careful. It is most likely used to satisfy the MAX function as in WHERE tblA.C = MAX(tblB.C), which skips the table altogether and utilize an index only access path.

is a db index composite by default?

when I create an index on a db2, for example with the following code:
CREATE INDEX T_IDX ON T(
A,
B)
is it a composite index?
if not: how can I then create a composite index?
if yes: in order to have two different index should I create them separately as:
CREATE INDEX T1_IDX ON T(A)
CREATE INDEX T2_IDX ON T(A)
EDIT: this discussion is not going in the direction I expect (but in a better one :)) I actually asked how, and not why to create separate indexes, I planed to do that in a different question, but since you anticipated me:
suppose I have a table T(A,B,C) and a search function search() that select from the table using any of the following method
WHERE A = x
WHERE B = x
WHERE C = x
WHERE A = x AND B=y (and so on AC, CB, ABC)
if I create a compose index ABC, is it going to working for example when I select on just C?
the table is quite big, and the insert\update not so frequent
Yep multiple fields on create index = composite by definition: Specify two or more column names to create a composite index.
Understanding when to use composite indexes appears to be your last question...
If all columns selected by a query are in a composite index, then the dbengine can return these values from the index without accessing the table. so you have faster seek time.
However if one or the other are used in queries, then creating individual indexes will serve you best. It depends on the types of queries executed and what values they contain/filter/join.
If you sometimes have one, the other, or both, then creating all 3 indexes is a possibility as well. But keep in mind each additional index increases the amount of time it takes to insert, update or delete, so on highly maintained tables, more indexes are generally bad since the overhead to maintain the indexes effects performance.
The index on A, B is a composite index, and can be used to seek on just A or a seek on A with B or for a general scan, of course.
There is usually not much of a point in having an index on A, B and an index on just A, since a partial search on A, B can be used if you only have A. That wider index will be a little less efficient, however, so if the A lookup is extremely frequent and the write requirements mean that it is acceptable to update the extra index, it could be justifiable.
Having an index on B may be necessary, since the A, B index is not very suitable for searches based on B only.
First Answer: YES
CREATE INDEX JOB_BY_DPT
ON EMPLOYEE (WORKDEPT, JOB)
Second Answer:
It depends on your query; if most of the time your query referrence a single column in where clause like select * from T where A = 'something' then a single index would be what you want but if both column A and B get referrenced then you should go for creating a composite one.
For further referrence please check
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/r0000919.htm