Sql indexes vs full table scan

Sql indexes vs full table scan - sql

While writing complex SQL queries, how do we ensure that we are using proper indexes and avoiding full table scans? I do it by making sure I only join on columns that have indexes(primary key, unique key etc). Is this enough?

Ask the database for the execution plan for your query, and proceed from there.
Don't forget to index the columns that appear in your where clause as well.

Look at the execution plan of the query to see how the query optimizer thinks things must be retrieved. The plan is generally based on the statistics on the tables, the selectivity of the indices and the order of the joins. Note that the optimizer can decide that performing a full table scan is 'cheaper' than index lookup.
other Things to look for:
avoid subqueries if possible.
minimize the use of 'OR'-predicates
in the where clause

It is hard to say what is the best indexing because there are different strategies depend on situation. Still there are coupe things you should now about indexes.
Index SOMETIMES increase performance on select statement and ALWAYS decrease performance on insert and update.
To index table it is not necessary to make it as key on certain field. Also, real life indexes almost always include several fields.
Don't create any indexes for "future purposes" if your performance is satisfactory. Even if you don't have indexes at all.
Always try to analyze execution plan when tuning indexes. Don't be afraid to experiment.
+
Table scan is not always bad thing.
That is all from me.

Use Database Tuning Advisor(SQL Server) to analyse your query. It will suggest necessary indexes to add to tune your query performance

Related

Can Indices actually decrease SELECT performance?

After reading some stuff about indices on SQL Server and their performance advantages for selects and disadvantages for updates / inserts, i was wondering if badly used indices could actually also hurt performance for selects.
What conditions would have to be fulfilled to have an index decrease performance of a pure select query? Do such situations exist?
Thanks!
(although I always try to include code examples, i can't think of anything that would support this question...)

Yes, albeit very slightly - so slightly that it would be justified to also answer "No".
If you have an index which might be considered for a query, but is not useable, the optimizer will waste a short time pondering whether and how to use it (in rare cases with REALLY complicated indexes and views, and more frequently when index performance hints are wrong, you might end up choosing a suboptimal query plan).
Some cases would be:
a table without indexes
a table with a badly chosen index, which gets discarded
a table where TWO indexes exist, and for some reason (e.g. obsolete statistics), the existence of the second index makes the optimizer choose it, while it would have been more convenient to use the first.
a table where the existing index (usually also thanks to obsolete statistics) tricks the optimizer into reading from the index an amount of data comparable to what could have been, more efficiently, retrieved with a full table scan; to make things worse, the index is fragmented and hashed differently than the table. What was essentially a full table scan becomes a slowed down full table scan with lots of disk thrashing.
In the first two cases the query time is the same (and entails a full scan), but in the third, you also have to analyze and discard the index. In the fourth, unlikely but possible, case an execution time which is likely very large increases and becomes huge (update 2021-10-20: I have just done this to myself. Yay me).
Where an index is likelier to hurt you - where ALL indexes hurt you - is in inserts, deletes and updates. Then, any index not used by the update query, yet affected by same, will require a write to the index itself.
So you will want to have indexes, but as few as you can without sacrificing SELECT performances. Actually, you might decide against indexing for a rarely used SELECT query in order to avoid having the needed index constantly updated by all other UPDATE queries.
Edit: after reading Heinzi's answer, I'd also like to add that most DB servers have maintenance tools which analyze the tables and indexes (and sometimes query performance counters too), and properly update the hints of which Heinzi spoke. So it's also important to periodically "maintain" the database to keep the optimizer supplied with up-to-date information on which indexes to choose from.
Update (MySQL)
There is a very nifty MySQL analysis tool that can actually suggest improvements to the existing indexing (remove unused keys, add useful keys): common_schema. It's really worth a look.

Yes, but it's very unlikely and it should not influence your decision to use indexes.
Sometimes, the SQL Server query analyzer chooses the an execution plan that's not optimal. Since the number of possible execution plans is much larger than it might seem on first sight (a simple join of n tables already produces n! possible execution plans), SQL Server has to make an educated guess. It's in the nature of guesses that they are sometimes wrong.
It's a rare occurrence, but I've seen it happen a couple of times in the past years. In that case (and only in that case), a better plan would have been chosen if the index had not been there. However, removing the index is not the correct way to solve this problem, since the index usually exists for a reason. The correct way is to add a hint to this query (and only to this query), to help the optimizer choose the right plan.

Yes, indexes can hurt performance for SELECTs. It is important to understand how database engines operate. Data is stored on disk(s) in "pages". Indexes make it possible to access the specific page that has a specific value in one or more columns in the table.
This is great if you are looking for specific values.
However, consider a query that needs to look at every row in a table. If you go through the table, you read the pages in order and -- critically -- you get every row on the page with a single read. The number of reads is the number of pages in the table. In addition, the page cache can optimize the reads with look-ahead reads and pages no longer being used are simply overwritten.
Using an index for the same reads goes through the table one record at a time rather than one page at a time. This results in random reads through the pages. In the worst case, there is one read per record in the table -- potentially a very significant hit to performance. In addition, the index itself occupies some of the page cache, reducing memory for other operations.
In generally, the optimizer component of a SQL engine does a good job distinguishing between these two situations. One of the key metrics is the selectivity of the query. How many rows is the query returning (which the optimizer looks at with respect to the number of pages)? If the number of rows is about the same as the number of pages, the optimizer would consider a full table scan rather than an index scan.
There are definitely other considerations, but in general, an index can hurt performance of even a simple select query. In general, optimizers do a good job, but there are sometimes unusual cases that trick even the best optimizers.

My guess would be if you create indices that confuse the query plan optimiser, and that ends up choosing an inefficient index for the query at hand.

This is potentially implementation-dependent, but in principle indexes should not slow down SELECT.
Obviously they can slow down INSERT and UPDATE.

mysql too many indexes?

I am spending some time optimizing our current database.
I am looking at indexes specifically.
There are a few questions:
Is there such a thing as too many indexes?
What will indexes speed up?
What will indexes slow down?
When is it a good idea to add an index?
When is it a bad idea to add an index?
Pro's and Con's of multiple indexes vs multi-column indexes?

What will indexes speed up?
Data retrieval -- SELECT statements.
What will indexes slow down?
Data manipulation -- INSERT, UPDATE, DELETE statements.
When is it a good idea to add an index?
If you feel you want to get better data retrieval performance.
When is it a bad idea to add an index?
On tables that will see heavy data manipulation -- insertion, updating...
Pro's and Con's of multiple indexes vs multi-column indexes?
Queries need to address the order of columns when dealing with a covering index (an index on more than one column), from left to right in index column definition. The column order in the statement doesn't matter, only that of columns 1, 2 and 3 - a statement needs have a reference to column 1 before the index can be used. If there's only a reference to column 2 or 3, the covering index for 1/2/3 could not be used.
In MySQL, only one index can be used per SELECT/statement in the query (subqueries/etc are seen as a separate statement). And there's a limit to the amount of space per table that MySQL allows. Additionally, running a function on an indexed column renders the index useless - IE:
WHERE DATE(datetime_column) = ...

I disagree with some of the answers on this question.
Is there such a thing as too many indexes?
Of course. Don't create indexes that aren't used by any of your queries. Don't create redundant indexes. Use tools like pt-duplicate-key-checker and pt-index-usage to help you discover the indexes you don't need.
What will indexes speed up?
Search conditions in the WHERE clause.
Join conditions.
Some cases of ORDER BY.
Some cases of GROUP BY.
UNIQUE constraints.
FOREIGN KEY constraints.
FULLTEXT search.
Other answers have advised that INSERT/UPDATE/DELETE are slower the more indexes you have. That's true, but consider that many uses of UPDATE and DELETE also have WHERE clauses and in MySQL, UPDATE and DELETE support JOINs too. Indexes may benefit these queries more than making up for the overhead of updating indexes.
Also, InnoDB locks rows affected by an UPDATE or DELETE. They call this row-level locking, but it's really index-level locking. If there's no index to narrow down the search, InnoDB has to lock a lot more rows than the specific row you're changing. It can even lock all the rows in the table. These locks block changes made by other clients, even if they don't logically conflict.
When is it a good idea to add an index?
If you know you need to run a query that would benefit from an index in one of the above cases.
When is it a bad idea to add an index?
If the index is a left-prefix of another existing index, or the index doesn't help any of the queries you need to run.
Pro's and Con's of multiple indexes vs multi-column indexes?
In some cases, MySQL can perform index-merge optimization, and either union or intersect the results from independent index searches. But it gives better performance to define a single index so the index-merge doesn't need to be done.
For one of my consulting customers, I defined a multi-column index on a many-to-many table where there was no index, and improved their join query by a factor of 94 million!
Designing the right indexes is a complex process, based on the queries you need to optimize. You shouldn't make broad rules like "index everything" or "index nothing to avoid slowing down updates."
See also my presentation How to Design Indexes, Really.

Is there such a thing as too many indexes?
Indexes should be informed by the problem at hand: the tables, the queries your application will run, etc.
What will indexes speed up?
SELECTs.
What will indexes slow down?
INSERTs will be slower, because you have to update the index.
When is it a good idea to add an index?
When your application needs another WHERE clause.
When is it a bad idea to add an index?
When you don't need it to query or enforce uniqueness constraints.
Pros and Cons of multiple indexes vs multi-column indexes?
I don't understand the question. If you have a uniqueness constraint that includes multiple columns, by all means model it as such.

Is there such a thing as too many indexes?
Yes. Don't go out looking to create indexes, create them as necessary.
What will indexes speed up?
Any queries against the indexes table/view.
What will indexes slow down?
Any INSERT statements against the indexed table will be slowed down, because each new record will need to be indexed.
When is it a good idea to add an index?
When a query is not running at an acceptable speed. You may be filtering on records that are not part of the clustered PK, in which case you should add indexes based on the filters you are searching upon (if the performance deems fit).
When is it a bad idea to add an index?
When you do it for the sake of it - i.e over-optimization.
Pro's and Con's of multiple indexes vs multi-column indexes?
Depends on the queries you are trying to improve.

Is there such a thing as too many indexes?
Yup, like all things, too many indexes will slow down data manipulation.
When is it a good idea to add an index?
A good idea to add an index is when your queries are too slow (i.e. you have too many joins in your queries). You should use this optimization only after you built a solid model, to tweak the performance.

Creating indexes for 'OR' operator in queries

I have some MySQL queries with conditions like
where field1=val1 or field2=val2
and some like
where fieldx=valx and fieldy=valy and (field1=val1 or field2=val2)
How can I optimize these queries by creating indexes? My intuition is to create separate indexes for field1 and field2 for first query as it is an OR, so a composite index probably won't do much good.
For the second query I intend to create 2 indexes: fieldx, fieldy, field1 and fieldx,fieldy,field2 again for the above stated reason.
Is this solution correct? This is a really large table so I can't just experiment by applying indexes and explaining the query.

As with all DBMS optimisation questions, it depends on your execution engine.
I would start with the simplest scenario, four separate indexes on each of the columns.
This will ensure that any queries using those columns in a way you haven't anticipated will still run okay (a fieldx/fieldy/field1 index will be of zero use to a query only using fieldy).
Any decent execution engine will efficiently choose the index with lowest cardinality first so as to reduce the result set and then perform the other filters based on that.
Then, and only if you have a performance problem, you can look into improving it with different indexes. You should test performance on production-type data, not any test databases you have built yourself (unless they mirror the attributes of production anyway).
And keep in mind that database tuning is rarely a set-and-forget operation. You should periodically re-tune because performance depends both on the schema and the data you hold.
Even if the schema never changes, the data may vary wildly. Re your comment "I just cant experiment by applying indexes and explaining the query", that's exactly what you should be doing.
If you're worried about playing in production (and you should be), you should have another environment set up with similar specs, copy the production data across to it, then fiddle around with your indexes there.

My intuition is to create separate
indexes for field1 and field2 for
first query as it is an OR, so a
composite index probably won't do much
good.
That's correct.
For the second query I intend to create 2
indexes: fieldx, fieldy, field1 and
fieldx,fieldy,field2 again for the
above stated reason.
That's one option, the other will be an index on fieldx, fieldy, field1 and an index on field2 (same as for you first query!). Now you also have 2 indexes, but the second one will be much smaller. Your second query can use both indexes, the bigger one for the AND-part of your query and the small index for the OR part of field2. MySQL should be smart enough nowadays.
EXPLAIN will help you out.

When should you consider indexing your sql tables?

How many records should there be before I consider indexing my sql tables?

There's no good reason to forego obvious indexes (FKs, etc.) when you're creating the table. It will never noticeably affect performance to have unnecessary indexes on tiny tables, and it's good to take a first cut when your mind is into schema design. Also, some indexes serve to prevent duplicates, which can be useful regardless of table size.
I guess the proper answer to your question is that the number of records in the table should have nothing to do with when to create indexes.

I would create the index entries when I create my table. If you decide to create indices after the table has grown to 100, 1000, 100000 entries it can just take alot of time and perhaps make your database unavailable while you are doing it.
Think about the table first, create the indices you think you'll need, and then move on.
In some cases you will discover that you should have indexed a column, if thats the case, fix it when you discover it.
Creating an index on a searched field is not a pre-optimization, its just what should be done.

When the query time is unacceptable. Better yet, create a few indexes now that are likely to be useful, and run an EXPLAIN or EXPLAIN ANALYZE on your queries once your database is populated by representative data. If the indexes aren't helping, drop them. If there are slow queries that could benefit from more or different indexes, change the indexes.
You are not going to be locked in to an initial choice of indexes. Experiment, and make sure you measure performance!

In general I agree with the previous advice.
Always declare the referential integrity for the tables (Primary Key, Foreign Keys), column constraints (not null, check). Saves you from nightmares when apps put bad data into the tables (even in development).
I'd consider adding indexes for the common access columns (columns in your where clauses which are used in =, <> tests), as well.
Most of the modern RDBMS implementations are quite good at keeping you indexes up to date, without hitting your performance. So, the cost of having indexes is minimal.
Also, most RDBMS's have query plan evaluators which look at the relative costs going to the data rows via the index, or using some sort of table scan. So, again the performance hits are minimal.

Two.
I'm serious. If there are two rows now, and there will always be two rows, the cost of indexing is almost zero. It's quicker to index than to ponder whether you should. It won't take the optimizer very long to figure out that scanning the table is quicker than using the index.
If there are two rows now, but there will be 200,000 in the near future, the cost of not indexing could become prohibitively high. The right time to consider indexing is now.
Having said this, remember that you get an index automatically when you declare a primary key. Creating a table with no primary key is asking for trouble in most cases. So the only time you really need to consider indexing is when you want an index other than the index on the primary key. You need to know the traffic, and the anticipated volume to make this call. If you get it wrong, you'll know, and you can reverse the decision.
I once saw a reference table that had been created with no index when it contained 20 rows. Due to a business change, this table had grown to about 900 rows, but the person who should have noticed the absence of an index didn't. The time to insert a new order had grown from about 10 seconds to 15 minutes.

As a matter of routine I perform the following on read heavy tables:
Create indexes on common join fields such as Foreign Keys when I create the table.
Check the query plan for Views or Stored Procedures and add indexes wherever a table scan is indicated.
Check the query plan for queries by my application and add indexes wherever a table scan is indicated. (and often try to make them into Stored Procedures)
On write heavy tables (like activity logs) I avoid indexes unless they are absolutely necessary. I also tend to archive such data into indexed tables at regular intervals.

It depends.
How much data is in the table? How often is data inserted? A lot of indexes can slow down insertion time. Do you always query all the rows of the table? In this case indexes probably won't help much.
Those aren't common usages though. In most cases, you know you're going to be querying a subset of data. ON what fields? Are there common fields that are always joined on? Look at query plans for common or typical queries, it will generally show you where it's spending all of its time.

If there's a unique constraint on the table (and there should be at least one), then that will usually be enforced by a unique index.
Otherwise, you add indexes when the query performance is bad and adding the index will demonstrably improve the performance. There are books on the subject of how to create good sets of indexes on tables, including Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.
See also:
No indexes on small tables
When to create a new SQL Server index
Best Practices and Anti-Patterns in Creating Indexes
and, no doubt, a host of others.

How do you interpret a query's explain plan?

When attempting to understand how a SQL statement is executing, it is sometimes recommended to look at the explain plan. What is the process one should go through in interpreting (making sense) of an explain plan? What should stand out as, "Oh, this is working splendidly?" versus "Oh no, that's not right."

I shudder whenever I see comments that full tablescans are bad and index access is good. Full table scans, index range scans, fast full index scans, nested loops, merge join, hash joins etc. are simply access mechanisms that must be understood by the analyst and combined with a knowledge of the database structure and the purpose of a query in order to reach any meaningful conclusion.
A full scan is simply the most efficient way of reading a large proportion of the blocks of a data segment (a table or a table (sub)partition), and, while it often can indicate a performance problem, that is only in the context of whether it is an efficient mechanism for achieving the goals of the query. Speaking as a data warehouse and BI guy, my number one warning flag for performance is an index based access method and a nested loop.
So, for the mechanism of how to read an explain plan the Oracle documentation is a good guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009
Have a good read through the Performance Tuning Guide also.
Also have a google for "cardinality feedback", a technique in which an explain plan can be used to compare the estimations of cardinality at various stages in a query with the actual cardinalities experienced during the execution. Wolfgang Breitling is the author of the method, I believe.
So, bottom line: understand the access mechanisms. Understand the database. Understand the intention of the query. Avoid rules of thumb.

This subject is too big to answer in a question like this. You should take some time to read Oracle's Performance Tuning Guide

The two examples below show a FULL scan and a FAST scan using an INDEX.
It's best to concentrate on your Cost and Cardinality. Looking at the examples the use of the index reduces the Cost of running the query.
It's a bit more complicated (and i don't have a 100% handle on it) but basically the Cost is a function of CPU and IO cost, and the Cardinality is the number of rows Oracle expects to parse. Reducing both of these is a good thing.
Don't forget that the Cost of a query can be influenced by your query and the Oracle optimiser model (eg: COST, CHOOSE etc) and how often you run your statistics.
Example 1:
SCAN http://docs.google.com/a/shanghainetwork.org/File?id=dd8xj6nh_7fj3cr8dx_b
Example 2 using Indexes:
INDEX http://docs.google.com/a/fukuoka-now.com/File?id=dd8xj6nh_9fhsqvxcp_b
And as already suggested, watch out for TABLE SCAN. You can generally avoid these.

Looking for things like sequential scans can be somewhat useful, but the reality is in the numbers... except when the numbers are just estimates! What is usually far more useful than looking at a query plan is looking at the actual execution. In Postgres, this is the difference between EXPLAIN and EXPLAIN ANALYZE. EXPLAIN ANALYZE actually executes the query, and gets real timing information for every node. That lets you see what's actually happening, instead of what the planner thinks will happen. Many times you'll find that a sequential scan isn't an issue at all, instead it's something else in the query.
The other key is identifying what the actual expensive step is. Many graphical tools will use different sized arrows to indicate how much different parts of the plan cost. In that case, just look for steps that have thin arrows coming in and a thick arrow leaving. If you're not using a GUI you'll need to eyeball the numbers and look for where they suddenly get much larger. With a little practice it becomes fairly easy to pick out the problem areas.

Really for issues like these, the best thing to do is ASKTOM. In particular his answer to that question contains links to the online Oracle doc, where a lot of the those sorts of rules are explained.
One thing to keep in mind, is that explain plans are really best guesses.
It would be a good idea to learn to use sqlplus, and experiment with the AUTOTRACE command. With some hard numbers, you can generally make better decisions.
But you should ASKTOM. He knows all about it :)

The output of the explain tells you how long each step has taken. The first thing is to find the steps that have taken a long time and understand what they mean. Things like a sequential scan tell you that you need better indexes - it is mostly a matter of research into your particular database and experience.

One "Oh no, that's not right" is often in the form of a table scan. Table scans don't utilize any special indexes and can contribute to purging of every useful in memory caches. In postgreSQL, for example, you will find it looks like this.
Seq Scan on my_table (cost=0.00..15558.92 rows=620092 width=78)
Sometimes table scans are ideal over, say, using an index to query the rows. However, this is one of those red-flag patterns that you seem to be looking for.

Basically, you take a look at each operation and see if the operations "make sense" given your knowledge of how it should be able to work.
For example, if you're joining two tables, A and B on their respective columns C and D (A.C=B.D), and your plan shows a clustered index scan (SQL Server term -- not sure of the oracle term) on table A, then a nested loop join to a series of clustered index seeks on table B, you might think there was a problem. In that scenario, you might expect the engine to do a pair of index scans (over the indexes on the joined columns) followed by a merge join. Further investigation might reveal bad statistics making the optimizer choose that join pattern, or an index that doesn't actually exist.

look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. for example, if it is scanning a table, consider putting an index on the field(s) that is is scanning for

I mainly look for index or table scans. This usually tells me I'm missing an index on an important column that's in the where statement or join statement.
From http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx:
If you see any of the following in an
execution plan, you should consider
them warning signs and investigate
them for potential performance
problems. Each of them are less than
ideal from a performance perspective.
* Index or table scans: May indicate a need for better or additional indexes.
* Bookmark Lookups: Consider changing the current clustered index,
consider using a covering index, limit
the number of columns in the SELECT
statement.
* Filter: Remove any functions in the WHERE clause, don't include wiews
in your Transact-SQL code, may need
additional indexes.
* Sort: Does the data really need to be sorted? Can an index be used to
avoid sorting? Can sorting be done at
the client more efficiently?
It is not always possible to avoid
these, but the more you can avoid
them, the faster query performance
will be.

Rules of Thumb
(you probably want to read up on the details too:
Oracle Docs
ASKTOM
SQL Server Docs
)
Bad
Table Scans of Several Large Tables
Good
Using a unique index
Index includes all required fields
Most Common Win
In about 90% of performance problems I have seen, the easiest win is to break up a query with lots (4 or more) of tables into 2 smaller queries and a temporary table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas