How can I measure the cost of a database index? - sql

Is there a good method for judging whether the costs of creating a database index in Postgres (slower INSERTS, time to build an index, time to re-index) are worth the performance gains (faster SELECTS)?

I am actually going to disagree with Hexist. PostgreSQL's planner is pretty good, and it supports good sequential access to table files based on physical order scans, so indexes are not necessarily going to help. Additionally there are many cases where the planner has to pick an index. Additionally you are already creating primary keys for unique constraints and primary keys.
I think one of the good default positions with PostgreSQL (MySQL btw is totally different!) is to wait until you need an index to add one and then only add the indexes you most clearly need. This is, however, just a starting point and it assumes either a lack of a general lack of experience in looking at query plans or a lack of understanding of where the application is likely to go. Having experience in these areas matters.
In general, where you have tables likely to span more than 10 pages (that's 40kb of data and headers), it's a good idea to foreign keys. These can be assumed tob e clearly needed. Small lookup tables spanning 1 page should never have non-unique indexes because these indexes are never going to be used for selects (no query plan beats a sequential scan over a single page).
Beyond that point you also need to look at data distribution. Indexing boolean columns is usually a bad idea and there are better ways to index things relating to boolean searches (partial indexes being a good example). Similarly indexing commonly used function output may seem like a good idea sometimes, but that isn't always the case. Consider:
CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));
This will not do much. However an index on transdate might be useful if paired with a sparse index scan via a recursive CTE.
Once the basic indexes are in place, then the question becomes what other indexes do you need to add. This is often better left to later use case review than it is designed in at first. It isn't uncommon for people to find that performance significantly benefits from having fewer indexes on PostgreSQL.
Another major thing to consider is what sort of indexes you create and these are often use-case specific. A b-tree index on an array record for example might make sense if ordinality is important to the domain, and if you are frequently searching based on initial elements, but if ordinality is unimportant, I would recommend a GIN index, because a btree will do very little good (of course that is an atomicity red flag, but sometimes that makes sense in Pg). Even when ordinality is important, sometimes you need GIN indexes anyway because you need to be able to do commutitive scans as if ordinality was not. This is true if using ip4r for example to store cidr blocks and using an EXCLUDE constraint to ensure that no block contains any other block (the actual scan requires using an overlap operator rather than a contain operator since you don't know which side of the operator the violation will be found on).
Again this is somewhat database-specific. On MySQL, Hexist's recommendations would be correct, for example. On PostgreSQL, though, it's good to watch for problems.
As far as measuring, the best tool is EXPLAIN ANALYZE

Generally speaking, unless you have a log or archive table where you wont be doing selects on very frequently (or it's ok if they take awhile to run), you should index on anything your select/update/deelete statements will be using in a where clause.
This however is not always as simple as it seems, as just because a column is used in a where clause and is indexed, doesn't mean the sql engine will be able to use the index. Using the EXPLAIN and EXPLAIN ANALYZE capabilities of postgresql you can examine what indexes were used in selects and help you figure out if having an index on a column will even help you.
This is generally true because without an index your select speed goes down from some O(log n) looking operation down to O(n), while your insert speed only improves from cO(log n) to dO(log n) where d is usually less than c, ie you may speed up your inserts a little by not having an index, but you're going to kill your select speed if they're not indexed, so it's almost always worth it to have an index on your data if you're going to be selecting against it.
Now, if you have some small table that you do a lot of inserts and updates on, and frequently remove all the entries, and only periodically do some selects, it could turn out to be faster to not have any indexes.. however that would be a fairly special case scenario, so you'd have to do some benchmarking and decide if it made sense in your specific case.

Nice question. I'd like to add a bit more what #hexist had already mentioned and to the info provided by #ypercube's link.
By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.
Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:
there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.
Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).
Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:
if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.
How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.
All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.

Related

Can Indices actually decrease SELECT performance?

After reading some stuff about indices on SQL Server and their performance advantages for selects and disadvantages for updates / inserts, i was wondering if badly used indices could actually also hurt performance for selects.
What conditions would have to be fulfilled to have an index decrease performance of a pure select query? Do such situations exist?
Thanks!
(although I always try to include code examples, i can't think of anything that would support this question...)
Yes, albeit very slightly - so slightly that it would be justified to also answer "No".
If you have an index which might be considered for a query, but is not useable, the optimizer will waste a short time pondering whether and how to use it (in rare cases with REALLY complicated indexes and views, and more frequently when index performance hints are wrong, you might end up choosing a suboptimal query plan).
Some cases would be:
a table without indexes
a table with a badly chosen index, which gets discarded
a table where TWO indexes exist, and for some reason (e.g. obsolete statistics), the existence of the second index makes the optimizer choose it, while it would have been more convenient to use the first.
a table where the existing index (usually also thanks to obsolete statistics) tricks the optimizer into reading from the index an amount of data comparable to what could have been, more efficiently, retrieved with a full table scan; to make things worse, the index is fragmented and hashed differently than the table. What was essentially a full table scan becomes a slowed down full table scan with lots of disk thrashing.
In the first two cases the query time is the same (and entails a full scan), but in the third, you also have to analyze and discard the index. In the fourth, unlikely but possible, case an execution time which is likely very large increases and becomes huge (update 2021-10-20: I have just done this to myself. Yay me).
Where an index is likelier to hurt you - where ALL indexes hurt you - is in inserts, deletes and updates. Then, any index not used by the update query, yet affected by same, will require a write to the index itself.
So you will want to have indexes, but as few as you can without sacrificing SELECT performances. Actually, you might decide against indexing for a rarely used SELECT query in order to avoid having the needed index constantly updated by all other UPDATE queries.
Edit: after reading Heinzi's answer, I'd also like to add that most DB servers have maintenance tools which analyze the tables and indexes (and sometimes query performance counters too), and properly update the hints of which Heinzi spoke. So it's also important to periodically "maintain" the database to keep the optimizer supplied with up-to-date information on which indexes to choose from.
Update (MySQL)
There is a very nifty MySQL analysis tool that can actually suggest improvements to the existing indexing (remove unused keys, add useful keys): common_schema. It's really worth a look.
Yes, but it's very unlikely and it should not influence your decision to use indexes.
Sometimes, the SQL Server query analyzer chooses the an execution plan that's not optimal. Since the number of possible execution plans is much larger than it might seem on first sight (a simple join of n tables already produces n! possible execution plans), SQL Server has to make an educated guess. It's in the nature of guesses that they are sometimes wrong.
It's a rare occurrence, but I've seen it happen a couple of times in the past years. In that case (and only in that case), a better plan would have been chosen if the index had not been there. However, removing the index is not the correct way to solve this problem, since the index usually exists for a reason. The correct way is to add a hint to this query (and only to this query), to help the optimizer choose the right plan.
Yes, indexes can hurt performance for SELECTs. It is important to understand how database engines operate. Data is stored on disk(s) in "pages". Indexes make it possible to access the specific page that has a specific value in one or more columns in the table.
This is great if you are looking for specific values.
However, consider a query that needs to look at every row in a table. If you go through the table, you read the pages in order and -- critically -- you get every row on the page with a single read. The number of reads is the number of pages in the table. In addition, the page cache can optimize the reads with look-ahead reads and pages no longer being used are simply overwritten.
Using an index for the same reads goes through the table one record at a time rather than one page at a time. This results in random reads through the pages. In the worst case, there is one read per record in the table -- potentially a very significant hit to performance. In addition, the index itself occupies some of the page cache, reducing memory for other operations.
In generally, the optimizer component of a SQL engine does a good job distinguishing between these two situations. One of the key metrics is the selectivity of the query. How many rows is the query returning (which the optimizer looks at with respect to the number of pages)? If the number of rows is about the same as the number of pages, the optimizer would consider a full table scan rather than an index scan.
There are definitely other considerations, but in general, an index can hurt performance of even a simple select query. In general, optimizers do a good job, but there are sometimes unusual cases that trick even the best optimizers.
My guess would be if you create indices that confuse the query plan optimiser, and that ends up choosing an inefficient index for the query at hand.
This is potentially implementation-dependent, but in principle indexes should not slow down SELECT.
Obviously they can slow down INSERT and UPDATE.

Database indexes: A good thing, a bad thing, or a waste of time?

Adding indexes is often suggested here as a remedy for performance problems.
(I'm talking about reading & querying ONLY, we all know indexes can make writing slower).
I have tried this remedy many times, over many years, both on DB2 and MSSQL, and the result were invariably disappointing.
My finding has been that no matter how 'obvious' it was that an index would make things better, it turned out that the query optimiser was smarter, and my cleverly-chosen index almost always made things worse.
I should point out that my experiences relate mostly to small tables (<100'000 rows).
Can anyone provide some down-to-earth guidelines on choices for indexing?
The correct answer would be a list of recommendations something like:
Never/always index a table with less than/more than NNNN records
Never/always consider indexes on multi-field keys
Never/always use clustered indexes
Never/always use more than NNN indexes on a single table
Never/always add an index when [some magic condition I'm dying to learn about]
Ideally, the answer will give some instructive examples.
Indexes are kind of like chemotherapy...too much and it kills you...too little and you die...do it the wrong way and you die. You gotta know just how much, how often, and what kind to make it not kill you.
Your hardware, platform, environment, load all play a role. So to answer your questions..
Yes, possibly sometimes.
As a rule of thumb, primary keys and foreign keys need to be indexed. Usually primary key are indexed just by defining them as such, but FKs are not in every database (they definitely are not in SQL Server, I can't really speak for other dbs). You will be using these in joins, so it is generally critical to performance to define these.
Now if you have fields you often use in where clauses, they can benefit from indexes as well providing several things:
First the field must have a range of
values. A bit field or a field with
only 2 or 3 values will almost never
use an index.
Second the queries you write must be sargable. That is they must be designed to use indexes. I suspect if you never get performance improvements from what look like likely candidates for indexes, then you probably have queries that are not sargable. For instance take "WHERE Name like '%Smith'" as a where clause. Without knowing the first characters, the optimizer can't use the index.
Small tables rarely benefit much from indexes. If the optimizer can hold the whole thing in memory, then it is often faster to do so. If you were working with multimillion record tables, you would see that indexes are critical.
Indexing can be very complex and if you are interested in the subject, I suggest you get a good book on performance tuning your particular database and read in depth about them.
An index that's never used is a waste of disk space, as well as adding to the insert/update/delete time. It's probably best to define the clustering index first, then define
additional indexes as you find yourself writing WHERE clauses.
One common index mistake I see is people wondering why a select on col2 (or col3) takes so long when the index is defined as col1 ASC, col2 ASC, col3 ASC. When you have a multiple column index, your WHERE clause must use the first column in the index, or the first and second column in the index, and so forth.
If you need to access the data by col2, then you need an additional index that's defined as col2 ASC.
With small domain tables, it's sometimes faster to do a table scan than it is to read rows from the table using an index. This depends on the speed of your database machine and the speed of the network.
You need indexes. Only with indexes you can access data fast enough.
To make it as short as possible:
add indexes for columns you are frequently filtering (or grouping) for. (eg. a state or name)
like and sql functions could make the DBMS not use indexes.
add indexes only on columns which have many different values (eg. no boolean fields)
It is common to add indexes to foreign keys, but it is not always needed.
don't add indexes in very short tables
never add indexes when you don't know how they should enhance performance.
Finally: look into execution plans to decide how to optimize queries.
You'll add indexes just for a single, critical query. In this case, you'll add exactly the indexes that are needed in the query in question (multi-column indexes).
Basically when DB is collecting data and it's alive indexes have to go and evolve with that flow. There maybe really good index on table but after growing beyond of XXX records the same index in the same table is useless and in that case it should be refactored.
To have optimized and fast DB the only way is to monitor it all the time and refactor it over the time as records come in.
Real life example i got some time ago was super fast query restricted by some time range (created_at between A and B) and super slow query where time range was different. Same query, same database, same application and only one difference on time range.
Always use clustered indexes.
In fact you can't help but using them. The data in a table will be laid out on disk in some particular order anyway, it can't be save as a pile or something. You have the chance of specifying how exactly this data will be laid out. Why burn it?
When you have a table which gets new records appended and you observe that some value in those records always grow (like StackOverflow question number), make a clustered index out of it. Then the new data will not be inserted in the middle but will basically be appended to a file on disk which is a relatively cheap operation.
If a table is expected to be the target of a join then it is best to have a clustered index on that table so that the joins can be performed sequentially through the data pages. The columns in the clustered index will (on some DB systems) be included in all of the other indexes on that table, since those are the values that the indexes will use to reference the table data. To keep the other indexes from getting too large, the columns in the clustered index should be as narrow as possible, so it is best to use only numeric—rather than character—data types in the clustered index. In general, fewer columns are better than more columns, but notice that three int columns (12 bytes per row) are much better than one nvarchar(32) column (potentially 64 bytes per row).
If the clustered index is narrow, then a few additional indexes should not negatively impact performance very much even on very large tables.
Seems you are confusing two concepts here.
Adding indices *generally can only make a read query faster, very very rarely (almost never) slower. Adding an index never forces the query optimizer to use it. It will only use it if it thinks it can benefit from it, and it is generally very smart about those decisions.
For inserts/updates, of course, every index hurts performance a bit more... But at the other end of the spectrum, for, say a read only database, (like a USPS address database which is distributed monthly), in operational use there would ne no inserts/updates, so the only negative impact of additional indices is the disk space they take up.
This is entirely different that specifying that the query optimizer USE an index, in effect overriding what it would do on it's own... That can potentially make a query slower.
EDIT: Edited to eliminate opportunity for misinterpretation by overly literal readers.

What indexing implementations can handle arbitrary column combinations?

I am developing a little data warehouse system with a web interface where people can do filtered searches. There are current about 50 columns that people may wish to filter on, and about 2.5 million rows. A table scan is painfully slow. The trouble is that the range of queries I'm getting have no common prefixes.
Right now I'm using sqlite3, which will only use an index if there the columns required are the leftmost columns in that index. This seems to mean I'd need a lot of indexes. A quick glance at MySQL suggests it would also require many indexes for this kind of query.
My question is what indexing implementations are available for different database systems which can handle this kind of query on arbitrary combinations of columns?
I've prototyped my own indexing scheme; I store extra tables which list integer primary keys in my big table where each value for each column occur, and I keep enough statistics to be able to first examine the values with the smallest number of matches. It works okay; much better than a table scan but still a bit on the slow side, which is unsurprising for a first version in Python doing many SQL queries.
There are column-oriented databases that store data on a per-column base, where every column is its own index. They are a very good fit for Data Warehouse as they are extremly fast to read, but fairly slow to update.
Kickfire is such an example, which is a customized MySQL engine and has held the TPC-H benchmark top crown for a number of weeks, at an impressive system cost. Note that Kickfire is an appliance, sold as a hardware box.
Infobright would be another similar example, and has a free community edition that runs on Windows and Linux.
When there's too many indexes to create for a table I usually fall back on Full Text Search. Can't say if it will fit your scenario though.
SInce data warehouses are typically optimized for reading data not writing it, I would consider simply indexing all the columns. Yes this will slow down putting data into the warehouse, but typically that happens during non-peak hours and only once a day or less often.
One should only consider introducing "home grown" index structures, based on SQL tables, as a last resort, i.e. if there still exists [business-wise plausible] query cases not properly handled with an traditional index setting. For example if the list of such indexes were to become to big etc.
A few observations
You do not necessarily need indexes that include all of the columns that may be involved in one particular query; only the [collectively] selective ones may be required.
In other words if the query uses, for example, columns a, b, c and d, but if an index with a and b exists and if that produces, statistically only a few thousand rows, it may be acceptable to not introduce indexes with a, b and c (or and d or both), if c or d are not very plausible search criteria (used infrequently), and if their width is such that is would unduly burden the a+b index (or if there were other columns with a better fit for being "tacked-on" to the a+b index).
Aside from the obvious additional demand they put on disk storage, additional indexes, while possibly helping with SELECT (read) operations may also become an impediment with CUD (Create/Update/Delete) operations. It appears the context here is akin to a datawarehouse, where few [unscheduled] CUD operations take place, but it is good to keep this in mind.
See SQLite Optimizer for valuable insight into the way SQLite determines the way a particular query is executed.
Making a list of indexes
A tentative basis for the index scheme for this application may look like this:
[A] A single column index for every column in the table (save maybe the ones which are ridiculously unselective, say a "Married" column w/ "Y/N" values in it....)
[B] A two (or three) columns index for each the likely/common use case queries
[C] Additional two/three column indexes for the cases where some non-common query case involves a set of columns none of which is individually selective.
From this basis we then can define the actual list of indexes needed by:
Adding one (or a few) extra columns at the end of (and in a well thought out order...) to the [B] indexes above. Typically such columns are choosed because of their relative small width (they do grow the index unduly) and because they have a relative chance of being used in combination with the columns cited before them in the index.
Removing the [A] indexes which are generally equivalent to one or several [B] indexes. That is: columns which start with the same column, and for which the extra columns do no burden much the index.
reviewing the TREE of all possible (or all acceptable) cases, and marking off the branches adequately served with the indexes above. Then adding yet more indexes for the odd use cases not readily covered (if only with partial index scan + main table lookup for an acceptable number of rows).
In this situation, I find a hand-written tree structure a useful tool to help manage the otherwise unmanageable lists of possible combinations. Assuming a maximum of 4 search criteria selected from the 50 columns indicated in the question, we have in excess of 230,000 combinations to consider... The tree helps prune this rather quickly.

No indexes on small tables?

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Donald Knuth). My SQL tables are unlikely to contain more than a few thousand rows each (and those are the big ones!). SQL Server Database Engine Tuning Advisor dismisses the amount of data as irrelevant. So I shouldn't even think about putting explicit indexes on these tables. Correct?
The value of indexes is in speeding reads. For instance, if you are doing lots of SELECTs based on a range of dates in a date column, it makes sense to put an index on that column. And of course, generally you add indexes on any column you're going to be JOINing on with any significant frequency. The efficiency gain is also related to the ratio of the size of your typical recordsets to the number of records (i.e. grabbing 20/2000 records benefits more from indexing than grabbing 90/100 records). A lookup on an unindexed column is essentially a linear search.
The cost of indexes comes on writes, because every INSERT also requires an internal insert to each column index.
So, the answer depends entirely on your application -- if it's something like a dynamic website where the number of reads can be 100x or 1000x the writes, and you're doing frequent, disparate lookups based on data columns, indexing may well be beneficial. But if writes greatly outnumber reads, then your tuning should focus on speeding those queries.
It takes very little time to identify and benchmark a handful of your app's most frequent operations both with and without indexes on the JOIN/WHERE columns, I suggest you do that. It's also smart to monitor your production app and identify the most expensive, and most frequent queries, and focus your optimization efforts on the intersection of those two sets of queries (which could mean indexes or something totally different, like allocating more or less memory for query or join caches).
Knuth's wise words are not applicable to the creation (or not) of indexes, since by adding indexes you are not optimising anything directly: you are providing an index that the DBMSs optimiser may use to optimise some queries. In fact, you could better argue that deciding not to index a small table is premature optimisation, as by doing so you restrict the DBMS optimiser's options!
Different DBMSs will have different guidelines for choosing whether or not to index columns based on various factors including table size, and it is these that should be considered.
What is an example of premature optimisation in databases: "denormalising for performance" before any benchmarking has indicated that the normalised database actually has any performance issues.
Primary key columns will be indexed for the unique constraint. I would still index all foreign key columns. The optimizer can choose to ignore your index if it is irrelevant.
If you only have a little bit of data then the extra cost for insert/update should not be significant either.
Absolutely incorrect. 100% incorrect. Don't put a million pointless indexes, but you do want a Primary Key (in most cases), and you do want it CLUSTERED correctly.
Here's why:
SELECT * FROM MySmallTable <-- No worries... Index won't help
SELECT
*
FROM
MyBigTable INNER JOIN MySmallTable ON... <-- Ahh, now I'm glad I have my index.
Here's a good rule to go by.
"Since I have a TABLE, I'm likely going to want to query it at some time... If I'm going to query it, I'm likely going to do so in a consistent way..." <-- That's how you should index the table.
EDIT: I'm adding this line: If you have a concrete example in mind, I'll show you how to index it, and how much of a savings you'll get from doing so. Please supply a table, and an example of how you plan in using that table.
I suggest that you follow the usual rules about indexing, which approximately means "create indexes on those columns that you use in your queries".
This might sound unnecessary with such a small database. As others have already said: as long as your database stays as small as you have described, the queries will be fast enough anyway, and the indexes aren't really needed. They can even slow down insertions and updates, but unless you have very specific requirements there, it doesn't matter with such a small database.
But, if the database grows (which databases sometimes have a tendency to do) you don't have to remember to add indexes to that old database that you've probably forgotten about by then. Maybe it has even been installed at one your customers, and you can't modify it!
I guess what I'm saying is this: indexes should be such a natural part of your database design, that it is the lack of indexes that is the optimization, premature or not.
It depends. Is the table a reference table?
There are tables of a thousand rows where the absence of an index, and the resulting table scans can make the difference between a fairly simple operation delaying the user by 5 minutes instead of 5 seconds. I have seen exactly this problem, using a DBMS other than SQL Server.
Generally, if the table is a reference table, updates on it will be relatively rare. This means that the performance hit for updating the index will also be relatively rare. If the optimizer passes over the index, the performance hit on the optimizer will be negligible. The space needed to store the index will also be negligible.
If you declare a primary key, you should get an automatic index on that key. That automatic index will almost always do you enough good to justify its cost. Leave it in there. If you create a reference table without a primary key, there are other problems in your design methodology.
If you do frequent searches or frequent joins on some set of columns other than the primary key, an additional index might pay for itself. Don't fix that problem unless it is a problem.
Here's the general rule of thumb: go with the default behavior of the DBMS, unless you find a reason not to. Anything else is a premature preoccupation with optimization on your part.
If the rows have narrow width, and a few thousand rows fit on say 10-20 8K pages, it is unlikely that the SQL optimiser would elect to use an index even if you create one.
Put indexes ONLY if you have to :)
There are times when putting indexes can actually hurt performance, depending on what the table is used for...
So, in other words, you would think about putting indexes on tables when it is necessary as determined by profiling the application.
Indexes are often created implicitly when using UNIQUE constraints. I wouldn't try to avoid their use in that case!
As a general rule of thumb, it's good to avoid smaller indexes as they typically won't be used.
But sometimes they can provide a huge boost as I outlined here.
I guess there is an auto indexing on the primary key of the table which should be sufficient when querying on a table with less data.
So, yes explicit indexes can be avoided in case there is a small data set to be worked upon.
Even if you have an index, SQL Server might not even use it, depending on the statistics for that table. And if you plan to put in an index for a report that will run at most a couple times a year, bear in mind that the INSERT/UPDATE penalties for adding the index will be in effect ALL THE TIME. Before adding an index, ask yourself if it is worth the performance penalty.
You have to understand that based upon the query two lookups may be done, one into the index to get the pointer to the row, the next to the row itself. If the data that is being queried is in the index columns that extra step may not be necessary.
It is entirely possible that double dipping for data may be slower even if the optimizer goes after the index. Whether or not we care is up to application profiling and eventual explain plans.

How do you interpret a query's explain plan?

When attempting to understand how a SQL statement is executing, it is sometimes recommended to look at the explain plan. What is the process one should go through in interpreting (making sense) of an explain plan? What should stand out as, "Oh, this is working splendidly?" versus "Oh no, that's not right."
I shudder whenever I see comments that full tablescans are bad and index access is good. Full table scans, index range scans, fast full index scans, nested loops, merge join, hash joins etc. are simply access mechanisms that must be understood by the analyst and combined with a knowledge of the database structure and the purpose of a query in order to reach any meaningful conclusion.
A full scan is simply the most efficient way of reading a large proportion of the blocks of a data segment (a table or a table (sub)partition), and, while it often can indicate a performance problem, that is only in the context of whether it is an efficient mechanism for achieving the goals of the query. Speaking as a data warehouse and BI guy, my number one warning flag for performance is an index based access method and a nested loop.
So, for the mechanism of how to read an explain plan the Oracle documentation is a good guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009
Have a good read through the Performance Tuning Guide also.
Also have a google for "cardinality feedback", a technique in which an explain plan can be used to compare the estimations of cardinality at various stages in a query with the actual cardinalities experienced during the execution. Wolfgang Breitling is the author of the method, I believe.
So, bottom line: understand the access mechanisms. Understand the database. Understand the intention of the query. Avoid rules of thumb.
This subject is too big to answer in a question like this. You should take some time to read Oracle's Performance Tuning Guide
The two examples below show a FULL scan and a FAST scan using an INDEX.
It's best to concentrate on your Cost and Cardinality. Looking at the examples the use of the index reduces the Cost of running the query.
It's a bit more complicated (and i don't have a 100% handle on it) but basically the Cost is a function of CPU and IO cost, and the Cardinality is the number of rows Oracle expects to parse. Reducing both of these is a good thing.
Don't forget that the Cost of a query can be influenced by your query and the Oracle optimiser model (eg: COST, CHOOSE etc) and how often you run your statistics.
Example 1:
SCAN http://docs.google.com/a/shanghainetwork.org/File?id=dd8xj6nh_7fj3cr8dx_b
Example 2 using Indexes:
INDEX http://docs.google.com/a/fukuoka-now.com/File?id=dd8xj6nh_9fhsqvxcp_b
And as already suggested, watch out for TABLE SCAN. You can generally avoid these.
Looking for things like sequential scans can be somewhat useful, but the reality is in the numbers... except when the numbers are just estimates! What is usually far more useful than looking at a query plan is looking at the actual execution. In Postgres, this is the difference between EXPLAIN and EXPLAIN ANALYZE. EXPLAIN ANALYZE actually executes the query, and gets real timing information for every node. That lets you see what's actually happening, instead of what the planner thinks will happen. Many times you'll find that a sequential scan isn't an issue at all, instead it's something else in the query.
The other key is identifying what the actual expensive step is. Many graphical tools will use different sized arrows to indicate how much different parts of the plan cost. In that case, just look for steps that have thin arrows coming in and a thick arrow leaving. If you're not using a GUI you'll need to eyeball the numbers and look for where they suddenly get much larger. With a little practice it becomes fairly easy to pick out the problem areas.
Really for issues like these, the best thing to do is ASKTOM. In particular his answer to that question contains links to the online Oracle doc, where a lot of the those sorts of rules are explained.
One thing to keep in mind, is that explain plans are really best guesses.
It would be a good idea to learn to use sqlplus, and experiment with the AUTOTRACE command. With some hard numbers, you can generally make better decisions.
But you should ASKTOM. He knows all about it :)
The output of the explain tells you how long each step has taken. The first thing is to find the steps that have taken a long time and understand what they mean. Things like a sequential scan tell you that you need better indexes - it is mostly a matter of research into your particular database and experience.
One "Oh no, that's not right" is often in the form of a table scan. Table scans don't utilize any special indexes and can contribute to purging of every useful in memory caches. In postgreSQL, for example, you will find it looks like this.
Seq Scan on my_table (cost=0.00..15558.92 rows=620092 width=78)
Sometimes table scans are ideal over, say, using an index to query the rows. However, this is one of those red-flag patterns that you seem to be looking for.
Basically, you take a look at each operation and see if the operations "make sense" given your knowledge of how it should be able to work.
For example, if you're joining two tables, A and B on their respective columns C and D (A.C=B.D), and your plan shows a clustered index scan (SQL Server term -- not sure of the oracle term) on table A, then a nested loop join to a series of clustered index seeks on table B, you might think there was a problem. In that scenario, you might expect the engine to do a pair of index scans (over the indexes on the joined columns) followed by a merge join. Further investigation might reveal bad statistics making the optimizer choose that join pattern, or an index that doesn't actually exist.
look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. for example, if it is scanning a table, consider putting an index on the field(s) that is is scanning for
I mainly look for index or table scans. This usually tells me I'm missing an index on an important column that's in the where statement or join statement.
From http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx:
If you see any of the following in an
execution plan, you should consider
them warning signs and investigate
them for potential performance
problems. Each of them are less than
ideal from a performance perspective.
* Index or table scans: May indicate a need for better or additional indexes.
* Bookmark Lookups: Consider changing the current clustered index,
consider using a covering index, limit
the number of columns in the SELECT
statement.
* Filter: Remove any functions in the WHERE clause, don't include wiews
in your Transact-SQL code, may need
additional indexes.
* Sort: Does the data really need to be sorted? Can an index be used to
avoid sorting? Can sorting be done at
the client more efficiently?
It is not always possible to avoid
these, but the more you can avoid
them, the faster query performance
will be.
Rules of Thumb
(you probably want to read up on the details too:
Oracle Docs
ASKTOM
SQL Server Docs
)
Bad
Table Scans of Several Large Tables
Good
Using a unique index
Index includes all required fields
Most Common Win
In about 90% of performance problems I have seen, the easiest win is to break up a query with lots (4 or more) of tables into 2 smaller queries and a temporary table.