SQL Indexing: None, Single Column, and Multiple Columns - sql

How does indexing work in SQL and what benefits does it provide? What reason would there be for not indexing? And what is the difference between indexing a single column vs. indexing multiple columns?

How does indexing work in SQL and what benefits does it provide?
When you index columns you express your intent to query the indexed columns in conditional expressions, such as equality or range queries. With this information the storage engine can build a structure that makes such queries faster, often arranging them in tree structures. B-trees are the most common ones, but a lot of different structures exists, such as hash indices, R-tree indices for spatial data etc. Each structure is specialized in a certain type of look ups. For instance, hash indices are very fast for equality conditions, such as:
SELECT * FROM example_table WHERE type = "example";
SELECT * FROM example_table WHERE id = X;
B-trees are also fairly quick for equality look ups, but their main strength is that they support range queries:
SELECT * FROM example_table WHERE id > 5 AND id < 10
SELECT * FROM example_table WHERE type = "example" and value > 25
It is VERY important, however, when you build B-tree indices to understand that the tree is ordered in a "left-to-right" manner. I.e, if you build a B-tree index (lets call it A) on {type, value}, then you NEED to have a condition on the type-column in order for the query to be able to utilize the index. The example index can NOT be used in a query where the condition solely depends on value.
Furthermore, if you mix equality and a range condition, make sure that the equality columns are listed first in the index, otherwise the index can only be partially used.
What reason would there be for not indexing?
If the selectivity of the index is low, then you might not gain much over a table scan. say for instance that you have an index on a field called gender. Then the selectivity of that index will be low, since a lookup on that index will return half the rows of the original table. You can read a pretty simple explanation on selectivity here, and the reasoning behind it: http://mattfleming.com/node/192
Also, maintaining an index has a cost. For each data manipulation the index might need restructuring. So keeping the amount of indices to the minimum required to perform well on the queries against that table might be desirable.
What is the difference between indexing a single column vs. indexing multiple columns?
Once again, it depends on the type of queries you issue. Indexing a single column gender might not be a good idea, since the selectivity is low. When the selectivity is high then such an index makes much more sense. For instance, indices on the primary key is a very good index, since the selectivity is high (actually, it is as high as it gets. Each key in the index corresponds to exactly on record), and indices on columns with unique or highly different values (such as slugs, password hashes and what not) are also good single column indices.
There is also the concept of covering indices. Basically, each leaf in an index contains a pointer into the table where the row is stored (unless the index is a clustered index. In this case the leaf is the record). So for each index hit, the query engine has to fetch the corresponding table row, increasing the number of I/O-operations. Since I/O is extremely slow, you want to keep this to a minimum. Now, lets say that you often need to query for something, and also fetch some additional columns, then you can create a covering index, trading storage space for query performance. Example: Let's find the name and email of all users who has joined in the last 6 months (assuming MySQL):
With index on {joined_at}:
SELECT first_name, last_name, email
FROM users
WHERE joined_at > NOW() - INTERVAL 6 MONTH;
Query explanation:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users ALL test NULL NULL NULL 873 Using where
As you can see in the type-column, the query engine resorted to a full table scan, since the index selectivity was too low to be worthwhile using in this query (too many results would be returned, and thus followed into the table, costing too much in I/O)
With index on {joined_at, first_name, last_name, email}:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users range test,test2 test2 8 NULL 514 Using where;
Using index
Now, since all the information that is necessary to complete the query is available in the index, the query engine evaluates that it is much better to use the index (with 514 rows) instead of doing a full table scan.
So as you can see, by using covering indices we can speed up queries for partial selects of a table, even if the selectivity of the index is quite small.

How does indexing work in SQL
That's a pretty open question but basically databases store a structure that enables faster look up of information. That structure is dependent on the implementation but its typically a type of tree.
what benefits does it provide?
Queries that are SARGable can be significantly faster.*
What reason would there be for not indexing?
Some data modification queries can take longer and there is storage cost to indexes but generally speaking, both these considerations are negligible.
And what is the difference between indexing a single column vs. indexing multiple columns?
There isn't much difference but sometimes people create covering indexes** that index mutliple columns to increase the performance of a specific query.
*SARGable is from Search ARGument ABLE. Basically if you do WHERE FOO > 5 it can be faster if FOO is indexed. On the other hand WHERE h(FOO) > 5 probably won't benefit from an index.
** If all the fields used in the SELECT JOIN and WHERE of a statement are also in an index a database can retrieve all the information it needs without going back to the base table. This is called a covering index. If all the fields were in separate indexes it would only use the ones for the joins and where and then go back to the base table for the columns in the select.

Related

Is unique composite index as effective as non-composite for queries on first column?

I have a table with b-tree index on column A (non-unique). Now I want to add a check for uniqueness of column A and column B combination when inserting, so I want to add a unique composite index (A, B).
Should I drop existing non-composite index? (queries in most cases use single index, as I have read)?
Will unique composite index be as effective as non-unique non-composite one for queries only on column A?
If you have a lot of queries going for column A only in a where clause, then most likely you should keep the index on column A in addition to the new one.
The amount of queries which would use the index and the query cost difference are the 2 most important criteria for deciding whether or not to leave the index. As it depends on many factors like amount of content in the table and also query parameters, as Frank Heikens comment says, you can use the EXPLAIN ANALYZE statements to check important queries with and without the index to confirm your hypothesis.
There is a very small probability it would make sense to keep both indexes. If the unique index is almost never exercised (because you never do inserts or non-HOT updates, or queries that benefit from both columns) and you have precisely the right amount of memory and memory usage patterns, then it is possible for the single-column index to be small enough to stay in cache while the composite would not be.
But most likely what would happen is that the composite index would be used at least enough of the time that both indexes would be fighting with each other for cache space, making it overall less effective.

SQL Server indexing includes questions

I've been trouble shooting some bad SQL calls in my works applications. I've been reading up on indexes, tweaking and benchmarking things. Here's some of the rules I've gathered (let me know if this sounds right):
For heavily used quires, boil down the query to only what is needed and rework the where statements to use the most common columns first. Then make a non clustered index on the columns used in the where statement and do INCLUDING on any remaining select columns (excluding large columns of course like nvarchar(max)).
If a query is going to return > 20% of the entries table contents, it's best to do a table scan and not use an index
Order in an index matters. You have to make sure to structure your where statement like the index is built.
Now one thing I'm having trouble finding info on is what if a query is selecting on columns that are not part of any index but is using a where statement that is? Is the index used and leaf node hits the table and looks at the associated row for it?
ex: table
Id col1 col2 col3
CREATE INDEX my_index
ON my_table (col1)
SELECT Id, col1, col2, col3
FROM my_table
WHERE col1 >= 3 AND col1 <= 6
Is my_index used here? If so, how does it resolve Id, col2, col3? Does it point back to table rows and pick up the values?
To answer your question, yes, my_index is used. And yes, your index will point back to the table rows and pick the id, col2 and col3 values there. That is what an index does.
Regarding your 'rules'
Rule 1 makes sense. Except for the fact that I usually do not 'include' other columns in my index. As explained above, the index will refer back to the table and quickly retrieve the row(s) that you need.
Rule 2, I don't really understand. You create the index and SQL Server will decide which indices to use or not use. You don't really have to worry about it.
Rule 3, the order does not really make a difference.
I hope this helps.
From dba.stackexchange.com:
There are a few concepts and terms that are important to understand
when dealing with indexes. Seeks, scans, and lookups are some of the
ways that indexes will be utilized through select statements.
Selectivity of key columns is integral to determining how effective an
index can be.
A seek happens when the SQL Server Query Optimizer determines that the
best way to find the data you have requested is by scanning a range
within an index. Seeks typically happen when a query is "covered" by
an index, which means the seek predicates are in the index key and the
displayed columns are either in the key or included. A scan happens
when the SQL Server Query Optimizer determines that the best way to
find the data is to scan the entire index and then filter the results.
A lookup typically occurs when an index does not include all requested
columns, either in the index key or in the included columns. The query
optimizer will then use either the clustered key (against a clustered
index) or the RID (against a heap) to "lookup" the other requested
columns.
Typically, seek operations are more efficient than scans, due to
physically querying a smaller data set. There are situations where
this is not the case, such as a very small initial data set, but that
goes beyond the scope of your question.
Now, you asked how to determine how effective an index is, and there
are a few things to keep in mind. A clustered index's key columns are
called a clustering key. This is how records are made unique in the
context of a clustered index. All nonclustered indexes will include
the clustered key by default, in order to perform lookups when
necessary. All indexes will be inserted to, updated to, or deleted
from for every respective DML statement. That having been said, it is
best to balance performance gains in select statements against
performance hits in insert, delete, and update statements.
In order to determine how effective an index is, you must determine
the selectivity of your index keys. Selectivity can be defined as a
percentage of distinct records to total records. If I have a [person]
table with 100 total records and the [first_name] column contains 90
distinct values, we can say that the [first_name] column is 90%
selective. The higher the selectivity, the more efficient the index
key. Keeping selectivity in mind, it is best to put your most
selective columns first in your index key. Using my previous [person]
example, what if we had a [last_name] column that was 95% selective?
We would want to create an index with [last_name], [first_name] as the
index key.
I know this was a bit long-winded answer, but there really are a lot
of things that go into determining how effective an index will be, and
a lot things you must weigh any performance gains against.

Performance of SQL query with condition vs. without where clause

Which SQL-query will be executed with less time — query with WHERE-clause or without, when:
WHERE-clause deals with indexed field (e.g. primary key field)
WHERE-clause deals with non-indexed field
I suppose when we're working with indexed fields, thus query with WHERE will be faster. Am I right?
As has been mentioned there is no fixed answer to this. It all depends on the particular context. But just for the sake of an answer. Take this simple query:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this query without an index, every column, last_name must be checked for every row in the table (full table scan).
With an index, you could just follow a B-tree data structure until 'Smith' was found.
With a non index the worst case looks linear (n), whereas with a B-tree it would be log n, hence computationally less expensive.
Not sure what you mean by 'query with WHERE-clause or without', but you're correct that most of the time a query with a WHERE clause on an indexed field with outperform a query whose WHERE clause on a non-indexed field.
One instance where the performance will be the same (ie indexing doesn't matter) is when you run a range based query in your where clause (ie WHERE col1 > x ). This forces a scan of the table, and thus will be the same speed as a range query on a non indexed column.
Really, it depends on the columns you reference in the where clause, the types of data in the columns, the types of queries your running, etc.
It may depend on the type of where clause you are writing. In a simple where clause, it is generally better to have an index on the field you are using (and uindexes can and should be built on more than the PK). However, you have to write a saragble where clause for the index to make any difference. See this question for some guidelines on sarability:
What makes a SQL statement sargable?
There are cases where a where clause on the primary key will be slower.
The simplest is a table with one row. Using the index requires loading both the index and the data page -- two reads. No index cuts the work in half.
That is a degenerate case, but it points to the issue -- the proportion of the rows selected. Or, more accurately, the proportion of pages needed to resolve the query.
When the desired data is on all pages, then using an index slowed things down. For a non primary key, this can be disastrous, when the table is bigger than the page cache and the accesses are random.
Since pages are ordered by a primary key, the worst case is an additional index scan -- not too bad.
Some databases use statistics on tables to decide when to use an index and when to do a full table scan. Some don't.
In short, for low selectivity queries, an index will improve performance. For high selectivity queries, using an index can result in marginally worse performance or dire performance, depending on various factors.
Some of my queries are quite complex and applying a where clause degrading the performance. For the workaround, I used temp tables and then applied where clause on them. This significantly improved the performance. Also, where I had joins especially Left Outer Join, improved the performance.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.
A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?
Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.
Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)
It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers
One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.