Suppose that I created a compound primary key on col1, col2, col3, will indices be created on each of the column?
I know that the primary key constraint will create index for the combination of (col1, col2, col3) so that the search on these 3 columns will be faster. But I'm not sure if the database will create index on each of the column so that the search on individual column, like search on column2, will be speed up.
Can anybody tell me what happens on these columns in term of index?
It depends. Generally one index will be created so only queries using the "left most" columns can make use of it, i.e. in your example "SELECT * from T where COL2='x'" will not use the index. That said, Oracle can use an index "skip" where the first column is kind of compressed and the plan can jump past it looking at the next column in.
Your best approach would be to look at the manual and try it in the DBMS of your choice then look at the query plan.
Summary: it depends.
Related
I have a table mytable of 5 million records and a query that looks like
select *
from mytable
where column1 = 'value1' and column2 = 'value2' and column3 = 'value3'
So I thought about creating an index based on the 3 columns but my problem is that I have no best column to put in the first position of the index because there is no column that is really discrimating compared to the others.
Therefore I would like to build something similar to the hash tables with a hash code based on these 3 columns. I tried a function-based index based on the concatenation of those 3 columns but it's taking so long to create that I never got it created and I believe it's the wrong way to achieve what I want. What is the correct way to achieve this ?
Just create an index with three columns:
create idx_mytable_col1_col2_col3 on mytable(col1, col2, col3)
You have equality comparisons. The order of the columns in the index does not matter in this case.
Let the database do the work for you.
ASE's indexes are generally stored as b-trees, and while there's some hashing 'magic' that takes place during index searching, there's still a bit of traversal/searching required; if the first column of an index is not very selective then you can see some degradation in index search performance when compared to an index where the more selective column(s) is listed first; the difference in performance is really going to depend on the selectivity of the column(s) in question and the sheer size of the index (ie, number of index levels and pages that have to be read/processed).
If you're running ASE 15.0.3+, and you're running ASE on linux, you may want to take a look at virtually-hashed tables. In a nutshell ... ASE stores the PK index as a hash instead of the normal b-tree, with the net result being that index search times are reduced. There are quite a few requirements/restrictions on virtually-hashed tables so I suggest you take a look at your Transact-SQL User's Guide for more details.
Obviously (?) there's a good bit more to table/index design than can be discussed here; certainly not something that can be addressed by looking at a single, generic query. ("Duh, Mark!" ?)
I've been asked to create an index on a database table. I've got very little info about the underlying database model, but that there shouldn't be any duplicates.
My questions are:
In this regard does no duplicates mean the same as an Unique index?
This index is to be for two columns, how can I easily check for duplicate values?
Any help will be greatly appreciated.
You would check for duplicates by doing a group by and checking for counts greater than 1:
select col1, col2, count(*)
from your_table
group by col1, col2
having count(*) > 1;
If you have a requirement that you cannot have more than one row with the same values in a set of columns (i.e. no duplicates), that's the same as saying those columns must be unique across all the rows.
If you want to enforce uniqueness, you would be best off creating a unique constraint (yes, a unique index will enforce the uniqueness, but creating a unique constraint on top of the index is better practice. The more information you can give to the optimizer, the greater the chance it will pick good execution paths for the queries!).
While studying for the 70-433 exam I noticed you can create a covering index in one of the following two ways.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
-- OR --
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
The INCLUDE clause is new to me. Why would you use it and what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?
If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause is where you use INCLUDE.
The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree.
This makes the index smaller because it's not part of the tree
INCLUDE columns are not key columns in the index, so they are not ordered.
This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s)
Another MSDN article with a worked example
You would use the INCLUDE to add one or more columns to the leaf level of a non-clustered index, if by doing so, you can "cover" your queries.
Imagine you need to query for an employee's ID, department ID, and lastname.
SELECT EmployeeID, DepartmentID, LastName
FROM Employee
WHERE DepartmentID = 5
If you happen to have a non-clustered index on (EmployeeID, DepartmentID), once you find the employees for a given department, you now have to do "bookmark lookup" to get the actual full employee record, just to get the lastname column. That can get pretty expensive in terms of performance, if you find a lot of employees.
If you had included that lastname in your index:
CREATE NONCLUSTERED INDEX NC_EmpDep
ON Employee(DepartmentID)
INCLUDE (Lastname, EmployeeID)
then all the information you need is available in the leaf level of the non-clustered index. Just by seeking in the non-clustered index and finding your employees for a given department, you have all the necessary information, and the bookmark lookup for each employee found in the index is no longer necessary --> you save a lot of time.
Obviously, you cannot include every column in every non-clustered index - but if you do have queries which are missing just one or two columns to be "covered" (and that get used a lot), it can be very helpful to INCLUDE those into a suitable non-clustered index.
This discussion is missing out on the important point: The question is not if the "non-key-columns" are better to include as index-columns or as included-columns.
The question is how expensive it is to use the include-mechanism to include columns that are not really needed in index? (typically not part of where-clauses, but often included in selects). So your dilemma is always:
Use index on id1, id2 ... idN alone or
Use index on id1, id2 ... idN plus include col1, col2 ... colN
Where:
id1, id2 ... idN are columns often used in restrictions and col1, col2 ... colN are columns often selected, but typically not used in restrictions
(The option to include all of these columns as part of the index-key is just always silly (unless they are also used in restrictions) - cause it would always be more expensive to maintain since the index must be updated and sorted even when the "keys" have not changed).
So use option 1 or 2?
Answer: If your table is rarely updated - mostly inserted into/deleted from - then it is relatively inexpensive to use the include-mechanism to include some "hot columns" (that are often used in selects - but not often used on restrictions) since inserts/deletes require the index to be updated/sorted anyway and thus little extra overhead is associated with storing off a few extra columns while already updating the index. The overhead is the extra memory and CPU used to store redundant info on the index.
If the columns you consider to add as included-columns are often updated (without the index-key-columns being updated) - or - if it is so many of them that the index becomes close to a copy of your table - use option 1 I'd suggest! Also if adding certain include-column(s) turns out to make no performance-difference - you might want to skip the idea of adding them:) Verify that they are useful!
The average number of rows per same values in keys (id1, id2 ... idN) can be of some importance as well.
Notice that if a column - that is added as an included-column of index - is used in the restriction: As long as the index as such can be used (based on restriction against index-key-columns) - then SQL Server is matching the column-restriction against the index (leaf-node-values) instead of going the expensive way around the table itself.
Basic index columns are sorted, but included columns are not sorted. This saves resources in maintaining the index, while still making it possible to provide the data in the included columns to cover a query. So, if you want to cover queries, you can put the search criteria to locate rows into the sorted columns of the index, but then "include" additional, unsorted columns with non-search data. It definitely helps with reducing the amount of sorting and fragmentation in index maintenance.
One reason to prefer INCLUDE over key-columns if you don't need that column in the key is documentation. That makes evolving indexes much more easy in the future.
Considering your example:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
That index is best if your query looks like this:
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
Of course you should not put columns in INCLUDE if you can get an additional benefit from having them in the key part. Both of the following queries would actually prefer the col2 column in the key of the index.
SELECT col2, col3
FROM MyTable
WHERE col1 = ...
AND col2 = ...
SELECT TOP 1 col2, col3
FROM MyTable
WHERE col1 = ...
ORDER BY col2
Let's assume this is not the case and we have col2 in the INCLUDE clause because there is just no benefit of having it in the tree part of the index.
Fast forward some years.
You need to tune this query:
SELECT TOP 1 col2
FROM MyTable
WHERE col1 = ...
ORDER BY another_col
To optimize that query, the following index would be great:
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2)
If you check what indexes you have on that table already, your previous index might still be there:
CREATE INDEX idx1 ON MyTable (Col1) INCLUDE (Col2, Col3)
Now you know that Col2 and Col3 are not part of the index tree and are thus not used to narrow the read index range nor for ordering the rows. Is is rather safe to add another_column to the end of the key-part of the index (after col1). There is little risk to break anything:
DROP INDEX idx1 ON MyTable;
CREATE INDEX idx1 ON MyTable (Col1, another_col) INCLUDE (Col2, Col3);
That index will become bigger, which still has some risks, but it is generally better to extend existing indexes compared to introducing new ones.
If you would have an index without INCLUDE, you could not know what queries you would break by adding another_col right after Col1.
CREATE INDEX idx1 ON MyTable (Col1, Col2, Col3)
What happens if you add another_col between Col1 and Col2? Will other queries suffer?
There are other "benefits" of INCLUDE vs. key columns if you add those columns just to avoid fetching them from the table. However, I consider the documentation aspect the most important one.
To answer your question:
what guidelines would you suggest in determining whether to create a covering index with or without the INCLUDE clause?
If you add a column to the index for the sole purpose to have that column available in the index without visiting the table, put it into the INCLUDE clause.
If adding the column to the index key brings additional benefits (e.g. for order by or because it can narrow the read index range) add it to the key.
You can read a longer discussion about this here:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes
The reasons why (including the data in the leaf level of the index) have been nicely explained. The reason that you give two shakes about this, is that when you run your query, if you don't have the additional columns included (new feature in SQL 2005) the SQL Server has to go to the clustered index to get the additional columns which takes more time, and adds more load to the SQL Server service, the disks, and the memory (buffer cache to be specific) as new data pages are loaded into memory, potentially pushing other more often needed data out of the buffer cache.
An additional consideraion that I have not seen in the answers already given, is that included columns can be of data types that are not allowed as index key columns, such as varchar(max).
This allows you to include such columns in a covering index. I recently had to do this to provide a nHibernate generated query, which had a lot of columns in the SELECT, with a useful index.
There is a limit to the total size of all columns inlined into the index definition. That said though, I have never had to create index that wide.
To me, the bigger advantage is the fact that you can cover more queries with one index that has included columns as they don't have to be defined in any particular order. Think about is as an index within the index.
One example would be the StoreID (where StoreID is low selectivity meaning that each store is associated with a lot of customers) and then customer demographics data (LastName, FirstName, DOB):
If you just inline those columns in this order (StoreID, LastName, FirstName, DOB), you can only efficiently search for customers for which you know StoreID and LastName.
On the other hand, defining the index on StoreID and including LastName, FirstName, DOB columns would let you in essence do two seeks- index predicate on StoreID and then seek predicate on any of the included columns. This would let you cover all possible search permutationsas as long as it starts with StoreID.
When inserting large sets of data into a table (from another table, in no particular order), how do you optimize a multi-column index so that the index is updated in the fastest possible way?
Assume the index is never used in any SELECT, DELETE or UPDATE query.* Assume also that the distinct counts for the columns as follows (for example):
COLUMN | DISTINCT COUNT
col1 | 634
col2 | 9,923
col3 | 2,357
col4 | 3
* Reason for not using the index in selecting data is this is a primary key index or a unique constraint index. The index is in place so that inserts violating the constraint should fail.
I have read that the most selective column should come first. Is that correct, and is the index then to be created as follows?
(col2, col3, col1, col4)
If that is wrong, how do you determine the best order for column in an index which will only see bulk INSERTs into the corresponding table? The goal is to speed up the updating of the index during the bulk INSERT.
The quickest way is to DROP INDEX, then do the bulk inserts and CREATE INDEX when you are done inserting.
The proper structure of the index does not have so much to do with the distribution of values in the columns but with the retrieval strategies, presumably for UPDATE and DELETE only, and then specifically when you do partial filtering on some but not all always all columns of the index. Those more frequent filters should come first in your index columns. But you probably want to reconsider your indexing strategy more radically if this is the case: it may be better to have two or more indexes to match your typical retrieval strategies.
Ignoring your call for ignorance: why would you not apply the index to SELECT statements? Indexes are useful only for selecting subsets of data from your tables, whether that is for SELECT or a qualified UPDATE or DELETE. There is no functional difference for using indexes in any of these three operations.
Addendum after comments from OP: Indexes are useful for many purposes but their maintenance is relatively expensive, where "relatively" becomes "impossibly" very quickly with increasing table size. In your case you have to compare every record from your source table with every record in your destination table, or O(m*n) order. That is unworkable with tables of a large size, even with an index. Your best bet is to drop the index, do the inserts, create an index which is not unique, find and delete all duplicates, drop the index, finally create a new unique index.
The order of the columns is not terribly important for a uniqueness enforcement purposes. But it would be unusual for a unique index to not also happen to be useful for some queries, and so I would order the columns to take advantage of that.
For bulk inserting into this index rapidly, I'd try inserting in index order. So add an order by (col2, col3, col1, col4) to the select part of your insert. This leads to more efficient IO.
If I create a table like so:
CREATE TABLE something (column1, column2, PRIMARY KEY (column1, column2));
Neither column1 nor column2 are unique by themselves. However, I will do most of my queries on column1.
Does the multi column primary key create an index for both columns separately? I would think that if you specify a multi column primary key it would index them together, but I really don't know.
Would there be any performance benefit to adding a UNIQUE INDEX on column1?
There will probably not be a performance benefit, because queries against col1=xxx and col2=yyy would use the same index as queries like col1=zzz with no mention of col2. But my experience is only Oracle, SQL Server, Ingres, and MySQL. I don't know for sure.
You certainly don't want to add a unique index on column 1 as you just stated:
Neither column1 nor column2 are unique by themselves.
If column one comes first, it will be first in the multicolumn index in most databases and thus it is likely to be used. The second column is the one that might not use the index. I wouldn't add one on the second column unless you see problems and again, I would add an index not a unique index based on the comment you wrote above.
But SQL lite must have some way of seeing what it is using like most other databases, right? Set the Pk and see if queries uing just column1 are using it.
I stumbled across this question while researching this same question, so figured I'd share my findings. Note that all of the below is tested on SQLite 3.39.4. I make no guarantees about how it will hold up on old/future versions. That said, SQLite is not exactly known for radically changing behavior at random.
To give a concrete answer for SQLite specifically: an index on column1 would provide no benefits, but an index on column2 would.
Let's look at a simple SQL script:
CREATE TABLE tbl (
column1 TEXT NOT NULL,
column2 TEXT NOT NULL,
val INTEGER NOT NULL,
PRIMARY KEY (column1, column2)
);
-- Uncomment to make the final SELECT fast
-- CREATE INDEX column2_ix ON tbl (column2);
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1' AND column2 = 'column2';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column1 = 'column1';
EXPLAIN QUERY PLAN SELECT val FROM tbl WHERE column2 = 'column2';
EXPLAIN QUERY PLAN is SQLite's method of allowing you to inspect what its query planner is actually going to do.
You can execute the script via something like:
$ sqlite3 :memory: < sample.sql
This gives the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SCAN tbl
So the first two queries, the ones which SELECT on (column1, column2) and (column1), will use the index to perform the search. Which should be nice and fast.
Note that the last query, the SELECT on (column2) has different output, though. It says it's going to SCAN the table -- that is, go through each row one by one. This will be significantly less performant.
What happens if we uncomment the CREATE INDEX in the above script? This will give the output
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=? AND column2=?)
QUERY PLAN
`--SEARCH tbl USING INDEX sqlite_autoindex_tbl_1 (column1=?)
QUERY PLAN
`--SEARCH tbl USING INDEX column2_ix (column2=?)
Now the query on column2 will also use an index, and should be just as performant as the others.