SQL query optimization based on indexes - sql

I have been given as an assignment the following queries and how to optimize them by creating indexes:
a)SELECT EmployeeID FROM Employee WHERE Name='John'AND Surname='Brown'
b)SELECT EmployeeID FROM Employee WHERE Salary=1300
c)SELECT EmployeeID FROM Employee WHERE Salary BETWEEN 1000 AND 1500
d)SELECT EmployeeID FROM Employee WHERE Salary+Bonus>1500
from a table Employee:
EmployeeID,
Name,
Surname,
Salary,
Bonus
I've stated that for the first one a) a compound index would be best and a clustered better for the second one and a partioned for the third one and a some kind of clustered in (d.I am not sure about my choices could you please verify them and correct me as I am new to this.P.s.indexes better be in Oracle.Thanks in advance.

for the first one a) a compound index would be best
On what columns? Surname + Name, Name + Surname? The order can matter. In this case it likely doesn't matter at all, but normally you want to consider the entire application, and think about how you will be commonly doing lookups. If you have another query that looks up by surname alone, for example, you would want to make sure to put the surname column first in the index, so that this index will work for both queries. Over-indexing can be almost as bad for performance as under-indexing.
a clustered better for the second one
Again, you need to consider the entire table/application when choosing your indexes. You can only have one clustered index on a table. It's highly likely that your one clustered index will need to be on the EmployeeID column. Even if we don't see any queries using it here, that's the most common need. A regular index on Salary is probably good enough here.
a partitioned for the third one
A regular index on Salary will likely be good enough. The database will be able to go to the first record, and then "walk the index" until it no longer matches. But it depends on the table size... if the table is huge (into the 10s and 100s of millions of rows), partitioning can make sense (usually on the table itself). I don't know many businesses that have 10s of millions of employees. Again, one thing we want to do is avoid over-indexing, and so re-using the same index from b) is good.
some kind of clustered in (d
It depends on the database engine and version, but it's unlikely any index by itself will help this query. The reason is that expressions are very often not sargable, meaning the query optimizer won't be smart enough to know if the index will work or not. What you can do is create a computed column virtual column and put an index on that column.
In all cases, since you're only requesting the EmployeeID column, you want to add EmployeeID to the index, but don't actually index on that field. Just INCLUDE the column with the index. In this way, the database will be able to entirely fulfill your query from the index alone, without needing to go back to the table. The reason for just including the column, rather than indexing on it, is to improve performance of INSERT/UPDATE statements, to avoid needing to rebuild the index.

For d) a function based index (FBI) would be appropriate:
CREATE INDEX emp_i3 ON Employee (Salary+Bonus);

Related

Why can't I simply add an index that includes all columns?

I have a table in SQL Server database which I want to be able to search and retrieve data from as fast as possible. I don't care about how long time it takes to insert into the table, I am only interested in the speed at which I can get data.
The problem is the table is accessed with 20 or more different types of queries. This makes it a tedious task to add an index specially designed for each query. I'm considering instead simply adding an index that includes ALL columns of the table. It's not something you would normally do in "good" database design, so I'm assuming there is some good reason why I shouldn't do it.
Can anyone tell me why I shouldn't do this?
UPDATE: I forgot to mention, I also don't care about the size of my database. It's OK that it means my database size will grow larger than it needed to
First of all, an index in SQL Server can only have at most 900 bytes in its index entry. That alone makes it impossible to have an index with all columns.
Most of all: such an index makes no sense at all. What are you trying to achieve??
Consider this: if you have an index on (LastName, FirstName, Street, City), that index will not be able to be used to speed up queries on
FirstName alone
City
Street
That index would be useful for searches on
(LastName), or
(LastName, FirstName), or
(LastName, FirstName, Street), or
(LastName, FirstName, Street, City)
but really nothing else - certainly not if you search for just Street or just City!
The order of the columns in your index makes quite a difference, and the query optimizer can't just use any column somewhere in the middle of an index for lookups.
Consider your phone book: it's order probably by LastName, FirstName, maybe Street. So does that indexing help you find all "Joe's" in your city? All people living on "Main Street" ?? No - you can lookup by LastName first - then you get more specific inside that set of data. Just having an index over everything doesn't help speed up searching for all columns at all.
If you want to be able to search by Street - you need to add a separate index on (Street) (and possibly another column or two that make sense).
If you want to be able to search by Occupation or whatever else - you need another specific index for that.
Just because your column exists in an index doesn't mean that'll speed up all searches for that column!
The main rule is: use as few indices as possible - too many indices can be even worse for a system than having no indices at all.... build your system, monitor its performance, and find those queries that cost the most - then optimize these, e.g. by adding indices.
Don't just blindly index every column just because you can - this is a guarantee for lousy system performance - any index also requires maintenance and upkeep, so the more indices you have, the more your INSERT, UPDATE and DELETE operations will suffer (get slower) since all those indices need to be updated.
You are having a fundamental misunderstanding how indexes work.
Read this explanation "how multi-column indexes work".
The next question you might have is why not creating one index per column--but that's also a dead-end if you try to reach top select performance.
You might feel that it is a tedious task, but I would say it's a required task to index carefully. Sloppy indexing strikes back, as in this example.
Note: I am strongly convinced that proper indexing pays off and I know that many people are having the very same questions you have. That's why I'm writing a the a free book about it. The links above refer the pages that might help you to answer your question. However, you might also want to read it from the beginning.
...if you add an index that contains all columns, and a query was actually able to use that index, it would scan it in the order of the primary key. Which means hitting nearly every record. Average search time would be O(n/2).. the same as hitting the actual database.
You need to read a bit lot about indexes.
It might help if you consider an index on a table to be a bit like a Dictionary in C#.
var nameIndex = new Dictionary<String, List<int>>();
That means that the name column is indexed, and will return a list of primary keys.
var nameOccupationIndex = new Dictionary<String, List<Dictionary<String, List<int>>>>();
That means that the name column + occupation columns are indexed. Now imagine the index contained 10 different columns, nested so far deep it contains every single row in your table.
This isn't exactly how it works mind you. But it should give you an idea of how indexes could work if implemented in C#. What you need to do is create indexes based on one or two keys that are queried on extensively, so that the index is more useful than scanning the entire table.
If this is a data warehouse type operation where queries are highly optimized for READ queries, and if you have 20 ways of dissecting the data, e.g.
WHERE clause involves..
Q1: status, type, customer
Q2: price, customer, band
Q3: sale_month, band, type, status
Q4: customer
etc
And you absolutely have plenty of fast storage space to burn, then by all means create an index for EVERY single column, separately. So a 20-column table will have 20 indexes, one for each individual column. I could probably say to ignore bit columns or low cardinality columns, but since we're going so far, why bother (with that admonition). They will just sit there and churn the WRITE time, but if you don't care about that part of the picture, then we're all good.
Analyze your 20 queries, and if you have hot queries (the hottest ones) that still won't go any faster, plan it using SSMS (press Ctrl-L) with one query in the query window. It will tell you what index can help that queries - just create it; create them all, fully remembering that this adds again to the write cost, backup file size, db maintenance time etc.
I think the questioner is asking
'why can't I make an index like':
create index index_name
on table_name
(
*
)
The problems with that have been addressed.
But given it sounds like they are using MS sql server.
It's useful to understand that you can include nonkey columns in an index so they the values of those columns are available for retrieval from the index, but not to be used as selection criteria :
create index index_name
on table_name
(
foreign_key
)
include (a,b,c,d) -- every column except foreign key
I created two tables with a million identical rows
I indexed table A like this
create nonclustered index index_name_A
on A
(
foreign_key -- this is a guid
)
and table B like this
create nonclustered index index_name_B
on B
(
foreign_key -- this is a guid
)
include (id,a,b,c,d) -- ( every key except foreign key)
no surprise, table A was slightly faster to insert to.
but when I and ran these this queries
select * from A where foreign_key = #guid
select * from B where foreign_key = #guid
On table A, sql server didn't even use the index, it did a table scan, and complained about a missing index including id,a,b,c,d
On table B, the query was over 50 times faster with much less io
forcing the query on A to use the index didn't make it any faster
select * from A where foreign_key = #guid
select * from A with (index(index_name_A)) where foreign_key = #guid
I'm considering instead simply adding an index that includes ALL columns of the table.
This is always a bad idea. Indexes in database is not some sort of pixie dust that works magically. You have to analyze your queries and according to what and how is being queried - append indexes.
It is not as simple as "add everything to index and have a nap"
I see only long and complicated answers here so I thought I should give the simplest answer possible.
You cannot add an entire table, or all its columns, to an index because that just duplicates the table.
In simple terms, an index is just another table with selected data ordered in the order you normally expect to query it in, and a pointer to the row on disk where the rest of the data lives.
So, a level of indirection exists. You have a partial copy of a table in an preordered manner (both on disk and in RAM, assuming the index is not fragmented), which is faster to query for the columns defined in the index only, while the rest of the columns can be fetched without having to scan the disk for them, because the index contains a reference to the correct position on disk where the rest of the data is for each row.
1) size, an index essentially builds a copy of the data in that column some easily searchable structure, like a binary tree (I don't know SQL Server specifcs).
2) You mentioned speed, index structures are slower to add to.
That index would just be identical to your table (possibly sorted in another order).
It won't speed up your queries.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.

SQL2005 indexes

Was scanning through a SQL2005 database and saw the following two indexes for a table:
**PK_CLUSTERED_INDEX**
USER_ID
COMPANY_ID
DEPARTMENT_ID
**NON-unique_NON-clustered_INDEX**
USER_ID
COMPANY_ID
My initial thought is, drop the last index since the PK_CLUSTERED_INDEX already contains those columns, correct order and sort. Do the last index provide any gains at all?
If searching by User_ID, or User_ID and Company_ID columns, both indexes could fulfil that.
However, only the PK index would then be ideal if the Department_Id is also queried on in addition to those 2 fields.
If a query filters on User_ID and Company_ID, and needs to return other data columns then the PK index is still best as it has all the data there to hand. Whereas the nonclustered index doesn't so would probably need a Key Lookup to pull out the extra fields which is not as efficient.
It looks redundant to me, so I'd definitely be considering removing it.
To see whether an index is actually being used/get a feel for the level of usage, you can run one of the various index usage stats scripts out there. A good example is here.
In this case, drop the index, given it's non-unique, I would bet the optimizer never hits it, the first index is more unique and doesn't involve a row lookup after finding a match.
First one's all around better, you're not losing anything by dropping the second.
I would drop the NON-unique_NON-clustered_INDEX index, it is redundant and not needed.

RDBS when to use complex indexes for queries and when use simple?

Suppose I have table in my DB schema called TEST with fields (id, name, address, phone, comments). Now, I know that I'm going to perform a large set of different queries for that table, therefore my question is next, when and why I shall create indexes like ID_NAME_INDX (index for id and name) and when it's more efficient to create separately index for id and index for name field(by when I mean for what type of query)?
The general aim would be to "cover" all columns so the query only has to use the index.
-- An index on Name including ID would be ideal
SELECT
[id]
FROM
TEST
WHERE
[name] = 'bob'
Say you need name and indx but have separate indexes. You'll end up with a bookmark lookup from the index to the PK to get the other columns (assuming it doesn't just scan the PK)
Edit, after 1st comment:
select * from test where id='id1' and name='Name1'
For this query, the SELECT * but mitigates against any index so the PK would be used.
If you had:
select address from test where id='id1' and name='Name1'
then an index on ID, name including address would "cover" it.
Using "OR" creates difficulties for any strategy. However,
select address from test where id='id1' and name='Name1'
would still use the "ID, name including address" inex most likely but scan it rather that seek
Read this: Execution Plan Basics
I'm not sure your example explains the actual question you're asking. You're saying if you should have an index on ID and and index on Name, as opposed to an index on both ID and Name. The thing is, I guess that ID is your primary key and so you're not likely to do a search on ID AND Name.
However, in the terms of a table with two ID's of which you would want to search on either one, or both together then having three indexes, one on each of the ID's and one combined will be the fastest. If you have two indexes then to find the record you're looking for both indexes will need to be searched. However, if you have one index covering both ID's then only that index will need to be searched.
As with all indexes though, as you add them, your database increases in size and you will get a reduction on insert / update performance. You always need to weigh up the gains / losses.
Add indexes to the absolutely obvious candidates, add indexes to the "maybe" ones as the need arises. Continue to monitor your database performance and run query analysers to see where any performance gains can be made over time.
Most database software include some sort of a tool to debug your queries. These can usually tell you which indexes the server considered and which it ended up using. This functionality is usually called explain or something similar.
Usually you should create indexes for columns that are used in the where clause or joins.