index with multiple columns - ok when doing query on only one column? - sql

If I have an table
create table sv ( id integer, data text )
and an index:
create index myindex_idx on sv (id,text)
would this still be usefull if I did a query
select * from sv where id = 10
My reason for asking is that i'm looking through a set of tables with out any indexes, and seeing different combinations of select queries. Some uses just one column other has more than one. Do I need to have indexes for both sets or is an all-inclusive-index ok?
I am adding the indexes for faster lookups than full table scans.
Example (based on the answer by Matt Huggins):
select * from table where col1 = 10
select * from table where col1 = 10 and col2=12
select * from table where col1 = 10 and col2=12 and col3 = 16
could all be covered by index table (co1l1,col2,col3) but
select * from table where col2=12
would need another index?

It should be useful since an index on (id, text) first indexes by id, then text respectively.
If you query by id, this index will be used.
If you query by id & text, this index will be used.
If you query by text, this index will NOT be used.
Edit: when I say it's "useful", I mean it's useful in terms of query speed/optimization. As Sune Rievers pointed out, it will not mean you will get a unique record given just ID (unless you specify ID as unique in your table definition).

Oracle supports a number of ways of using an index, and you ought to start by understanding all of them so have a quick read here: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#sthref973
Your query select * from table where col2=12 could usefully leverage an index skip scan if the leading column is of very low cardinality, or a fast full index scan if it is not. These would probably be fine for running reports, however for an OLTP query it is likely that you would do better to create an index with col2 as the leading column.

I assume id is primary key. There is no point in adding a primary key to the index, as this will always be unique. Adding something unique to something else will also be unique.
Add a unique index to text, if you really need it, otherwise just use id is uniqueness for the table.
If id is not your primary key, then you will not be guaranteed to get a unique result from your query.
Regarding your last example with lookup on col2, I think you could need another index. Indexes are not a cure-all solution for performance problems though, sometimes your database design or your queries needs to be optimized, for instance rewritten into stored procedures (while I'm not totally sure Oracle has them, I'm sure there's an Oracle equivalent).

If the driver behind your question is that you have a table with several columns and any combination of these columns may be used in a query, then you should look at BITMAP indexes.
Looking at your example:
select * from mytable where col1 = 10 and col2=12 and col3 = 16
You could create 3 bitmap indexes:
create bitmap index ix_mytable_col1 on mytable(col1);
create bitmap index ix_mytable_col2 on mytable(col2);
create bitmap index ix_mytable_col3 on mytable(col3);
These bitmap indexes have the great benefit that they can be combined as required.
So, each of the following queries would use one or more of the indexes:
select * from mytable where col1 = 10;
select * from mytable where col2 = 10 and col3 = 16;
select * from mytable where col3 = 16;
So, bitmap indexes may be an option for you. However, as David Aldridge pointed out, depending on your particular data set a single index on (col1,col2,col3) might be preferable. As ever, it depends. Take a look at your data, the likely queries against that data, and make sure your statistics are up to date.
Hope this helps.

Related

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?
Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

What's the difference between these T-SQL queries (one uses INCLUDE)? [duplicate]

While studying for the 70-433 exam I noticed you can create a covering index in one of the following two ways.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
-- OR --
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
The INCLUDE clause is new to me. Why would you use it and what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?
If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause is where you use INCLUDE.
The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree.
This makes the index smaller because it's not part of the tree
INCLUDE columns are not key columns in the index, so they are not ordered.
This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s)
Another MSDN article with a worked example
You would use the INCLUDE to add one or more columns to the leaf level of a non-clustered index, if by doing so, you can "cover" your queries.
Imagine you need to query for an employee's ID, department ID, and lastname.
SELECT EmployeeID, DepartmentID, LastName
FROM Employee
WHERE DepartmentID = 5
If you happen to have a non-clustered index on (EmployeeID, DepartmentID), once you find the employees for a given department, you now have to do "bookmark lookup" to get the actual full employee record, just to get the lastname column. That can get pretty expensive in terms of performance, if you find a lot of employees.
If you had included that lastname in your index:
CREATE NONCLUSTERED INDEX NC_EmpDep
ON Employee(DepartmentID)
INCLUDE (Lastname, EmployeeID)
then all the information you need is available in the leaf level of the non-clustered index. Just by seeking in the non-clustered index and finding your employees for a given department, you have all the necessary information, and the bookmark lookup for each employee found in the index is no longer necessary --> you save a lot of time.
Obviously, you cannot include every column in every non-clustered index - but if you do have queries which are missing just one or two columns to be "covered" (and that get used a lot), it can be very helpful to INCLUDE those into a suitable non-clustered index.
This discussion is missing out on the important point: The question is not if the "non-key-columns" are better to include as index-columns or as included-columns.
The question is how expensive it is to use the include-mechanism to include columns that are not really needed in index? (typically not part of where-clauses, but often included in selects). So your dilemma is always:
Use index on id1, id2 ... idN alone or
Use index on id1, id2 ... idN plus include col1, col2 ... colN
Where:
id1, id2 ... idN are columns often used in restrictions and col1, col2 ... colN are columns often selected, but typically not used in restrictions
(The option to include all of these columns as part of the index-key is just always silly (unless they are also used in restrictions) - cause it would always be more expensive to maintain since the index must be updated and sorted even when the "keys" have not changed).
So use option 1 or 2?
Answer: If your table is rarely updated - mostly inserted into/deleted from - then it is relatively inexpensive to use the include-mechanism to include some "hot columns" (that are often used in selects - but not often used on restrictions) since inserts/deletes require the index to be updated/sorted anyway and thus little extra overhead is associated with storing off a few extra columns while already updating the index. The overhead is the extra memory and CPU used to store redundant info on the index.
If the columns you consider to add as included-columns are often updated (without the index-key-columns being updated) - or - if it is so many of them that the index becomes close to a copy of your table - use option 1 I'd suggest! Also if adding certain include-column(s) turns out to make no performance-difference - you might want to skip the idea of adding them:) Verify that they are useful!
The average number of rows per same values in keys (id1, id2 ... idN) can be of some importance as well.
Notice that if a column - that is added as an included-column of index - is used in the restriction: As long as the index as such can be used (based on restriction against index-key-columns) - then SQL Server is matching the column-restriction against the index (leaf-node-values) instead of going the expensive way around the table itself.
Basic index columns are sorted, but included columns are not sorted. This saves resources in maintaining the index, while still making it possible to provide the data in the included columns to cover a query. So, if you want to cover queries, you can put the search criteria to locate rows into the sorted columns of the index, but then "include" additional, unsorted columns with non-search data. It definitely helps with reducing the amount of sorting and fragmentation in index maintenance.
One reason to prefer INCLUDE over key-columns if you don't need that column in the key is documentation. That makes evolving indexes much more easy in the future.
Considering your example:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
That index is best if your query looks like this:
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
Of course you should not put columns in INCLUDE if you can get an additional benefit from having them in the key part. Both of the following queries would actually prefer the col2 column in the key of the index.
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
AND col2 = ...
SELECT TOP 1 col2, col3
FROM MyTable
WHERE col1 = ...
ORDER BY col2
Let's assume this is not the case and we have col2 in the INCLUDE clause because there is just no benefit of having it in the tree part of the index.
Fast forward some years.
You need to tune this query:
SELECT TOP 1 col2
FROM MyTable
WHERE col1 = ...
ORDER BY another_col
To optimize that query, the following index would be great:
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2)
If you check what indexes you have on that table already, your previous index might still be there:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
Now you know that Col2 and Col3 are not part of the index tree and are thus not used to narrow the read index range nor for ordering the rows. Is is rather safe to add another_column to the end of the key-part of the index (after col1). There is little risk to break anything:
DROP INDEX idx1 ON MyTable;
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2, Col3);
That index will become bigger, which still has some risks, but it is generally better to extend existing indexes compared to introducing new ones.
If you would have an index without INCLUDE, you could not know what queries you would break by adding another_col right after Col1.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
What happens if you add another_col between Col1 and Col2? Will other queries suffer?
There are other "benefits" of INCLUDE vs. key columns if you add those columns just to avoid fetching them from the table. However, I consider the documentation aspect the most important one.
To answer your question:
what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?
If you add a column to the index for the sole purpose to have that column available in the index without visiting the table, put it into the INCLUDE clause.
If adding the column to the index key brings additional benefits (e.g. for order by or because it can narrow the read index range) add it to the key.
You can read a longer discussion about this here:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes
The reasons why (including the data in the leaf level of the index) have been nicely explained. The reason that you give two shakes about this, is that when you run your query, if you don't have the additional columns included (new feature in SQL 2005) the SQL Server has to go to the clustered index to get the additional columns which takes more time, and adds more load to the SQL Server service, the disks, and the memory (buffer cache to be specific) as new data pages are loaded into memory, potentially pushing other more often needed data out of the buffer cache.
An additional consideraion that I have not seen in the answers already given, is that included columns can be of data types that are not allowed as index key columns, such as varchar(max).
This allows you to include such columns in a covering index. I recently had to do this to provide a nHibernate generated query, which had a lot of columns in the SELECT, with a useful index.
There is a limit to the total size of all columns inlined into the index definition. That said though, I have never had to create index that wide.
To me, the bigger advantage is the fact that you can cover more queries with one index that has included columns as they don't have to be defined in any particular order. Think about is as an index within the index.
One example would be the StoreID (where StoreID is low selectivity meaning that each store is associated with a lot of customers) and then customer demographics data (LastName, FirstName, DOB):
If you just inline those columns in this order (StoreID, LastName, FirstName, DOB), you can only efficiently search for customers for which you know StoreID and LastName.
On the other hand, defining the index on StoreID and including LastName, FirstName, DOB columns would let you in essence do two seeks- index predicate on StoreID and then seek predicate on any of the included columns. This would let you cover all possible search permutationsas as long as it starts with StoreID.

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).
Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.
I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.
Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.
I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

Need to index column in AND statement?

I have to do a SELECT on a table like this:
id
username
speed
is_running
The statement is like:
SELECT *
FROM mytable
WHERE username = 'foo'
AND is_running = 1
I have an index on "username". If I'm running the above statement, do I need to also index "is_running" for best performance? Or does only the first column of the select make a difference? I'm using mysql 5.0.
It depends on the type of data you're storing. If it's bool, you may not see a gain from an index on that column alone. You may want to try to add a composite index on the two columns:
ALTER TABLE mytable ADD INDEX `IDX_USERNAME_IS_RUNNING` ( `username` , `is_running` );
It will ultimately depend on the amount of data in the table as to if you require the index. In many cases, the engine might just do a table scan and omit your index all together if it thinks that is faster. Do you have 100 users, or 100,000 users?
On a bit/bool column you are not going to utilize a ton of storage space for the index, so it probably won't hurt unless you have a really high insertion rate.
If you have a table with 1 million users and only 1 or 2 is running at any one time - sure, index by is_running and it will give you fantastic performance. This specific use case would do best to have 2 indexes individually on columns - username, isrunning. The reason for 2 indexes is if you are asking for is_running=0, in which case it uses the username index instead.
Otherwise, there is 0% chance of a composite index on (username, isrunning) helping anything. Stick to a single index on username.
Finally, do you really need to whole record? Select *? In some scenarios close to the tipping point (when MySQL thinks the index+lookups becomes less efficient than a straight scan), you can make this query run faster than the original. Have an index on (username,id)
SELECT mytable.*
FROM (
SELECT id
FROM mytable
WHERE username = 'foo'
AND is_running = 1
) X
INNER JOIN mytable on mytable.id = X.id

Does SQLite multi column primary key need an additional index?

If I create a table like so:
CREATE TABLE something (column1, column2, PRIMARY KEY (column1, column2));
Neither column1 nor column2 are unique by themselves. However, I will do most of my queries on column1.
Does the multi column primary key create an index for both columns separately? I would think that if you specify a multi column primary key it would index them together, but I really don't know.
Would there be any performance benefit to adding a UNIQUE INDEX on column1?
There will probably not be a performance benefit, because queries against col1=xxx and col2=yyy would use the same index as queries like col1=zzz with no mention of col2. But my experience is only Oracle, SQL Server, Ingres, and MySQL. I don't know for sure.
You certainly don't want to add a unique index on column 1 as you just stated:
Neither column1 nor column2 are unique by themselves.
If column one comes first, it will be first in the multicolumn index in most databases and thus it is likely to be used. The second column is the one that might not use the index. I wouldn't add one on the second column unless you see problems and again, I would add an index not a unique index based on the comment you wrote above.
But SQL lite must have some way of seeing what it is using like most other databases, right? Set the Pk and see if queries uing just column1 are using it.
I stumbled across this question while researching this same question, so figured I'd share my findings. Note that all of the below is tested on SQLite 3.39.4. I make no guarantees about how it will hold up on old/future versions. That said, SQLite is not exactly known for radically changing behavior at random.
To give a concrete answer for SQLite specifically: an index on column1 would provide no benefits, but an index on column2 would.
Let's look at a simple SQL script:
CREATE TABLE tbl (
column1 TEXT NOT NULL,
column2 TEXT NOT NULL,
val INTEGER NOT NULL,
PRIMARY KEY (column1, column2)
);
-- Uncomment to make the final SELECT fast
-- CREATE INDEX column2_ix ON tbl (column2);
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1' AND column2 = 'column2';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column2 = 'column2';
EXPLAIN QUERY PLAN is SQLite's method of allowing you to inspect what its query planner is actually going to do.
You can execute the script via something like:
$ sqlite3 :memory: < sample.sql
This gives the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SCAN tbl
So the first two queries, the ones which SELECT on (column1, column2) and (column1), will use the index to perform the search. Which should be nice and fast.
Note that the last query, the SELECT on (column2) has different output, though. It says it's going to SCAN the table -- that is, go through each row one by one. This will be significantly less performant.
What happens if we uncomment the CREATE INDEX in the above script? This will give the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SEARCH tbl USING INDEX column2_ix (column2=?)
Now the query on column2 will also use an index, and should be just as performant as the others.