I heard this question during a job interview, and the interviewer said that yes. My question is why and could someone give an example of an index that makes search longer instead of shorter.
Yes, it can.
An additional index adds possible execution plans for a query if applicable. The Postgres query planner estimates costs for a variety of possible plans and the fastest estimate wins. Since those are estimates, actual query plans can always deviate. A chosen query plan using your new index can turn out to be slower than another plan without.
If your server is configured properly (cost and resource settings, current columns statistics, ...) this outcome is unlikely, but still possible. This can happen for almost every query. More likely for more complex queries. And some types of queries are notoriously hard to estimate.
Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Also, indexes always add write cost, so if your database is write-heavy and the machine is already saturated, more indexes can bring overall performance down.
A trivial example would be on a table with very few rows.
An index search has to load the index into memory and then look up the original data. If a table has only a few rows, they probably fit onto one data page. So, a full table scan requires loading one page.
Any index search (on a cold cache) requires loading two data pages -- one for the index and one for the data. That can be (significantly) longer then just scanning the rows on a single page.
On a large table, if the "search" returns a significant proportion of the rows in the table, then an index search ends up fetching the rows in an order different from how they are stored. If the data pages do not fix in memory, then you have a situation called thrashing, which means that there is a high probability that each new row would be a cache miss.
When creating indexes on PostgreSQL tables, EXPLAIN ANALYZE followed by an SQL command shows which indexes are used.
For example:
EXPLAIN ANALYZE SELECT A,B,C FROM MY_TABLE WHERE C=123;
Returns:
Seq Scan on public.my_table (cost=...) <- No index, BAD
And, after creating the index, it would return:
Index Scan using my_index_name on public.my_table (cost=...) <- Index, GOOD
However, for some queries that used the same index with a few hundred records, it didn't make any difference. Reading through documentation, it is recommended that either run ANALYZE or have the "Autovacuum" daemon on. This way the database would know the size of tables and decide on query plans properly.
is this absolutely necessary in a production environment? In other words, will PostgreSQL use the index when it's time to use it without need to analyse or vacuum as an extra task?
Short answer "just run autovacuum." Long answer... yes, because statistics can get out of date.
Let's talk about indexes and how/when PostgreSQL decides to use them.
PostgreSQL gets a query in, parses it, and then begins the planning process. How are we going to scan the tables? How are we going to join them and in what order? These are not trivial decisions and trying to find the generally best ways to do things typically means that PostgreSQL needs to know something about the tables.
The first thing to note is that indexes are not always a win. No plan ever beats a sequential scan through a one-page table, and even a 5 page table will almost always be faster with a sequential scan than an index scan. So PostgreSQL cannot safely decide to "use all available indexes."
So the way PostgreSQL decides whether to use an index is to check statistics. Now, these go out of date, which is why you want autovacuum to be updating them. You say your table has a few hundred records and the statics were probably out of date. If PostgreSQL cannot say that the index is a win, it won't use it. A few hundred records is going to be approaching "an index might help" territory depending on how selective the index is in weeding out records.
In your large table, there was probably no question based on existing statistics that the index would help. In your smaller table, there probably was a question and it got answered one way based on the stats it had, and a different way based on newer stats.
When an oracle explained plan is consider good?
I'm try to refactor a DB Schema, and there are so many query in view and packages that are so slow.
For example, this is one of the most orrible query, and give me this explain plan:
Plan
ALL_ROWSCost: 18,096 Bytes: 17 Cardinality: 1
I don't ask how to fix a query, just how to consider the explain plan as good. Thanks!!
Before considering the result of an Explain Plan we need to understand following terminologies,
Cardinality– Estimate of the number of rows coming out of each of the operations.
• Access method – The way in which the data is being accessed, via either a table scan or index
access.
• Join method – The method (e.g., hash, sort-merge, etc.) used to join tables with each other.
• Join type – The type of join (e.g., outer, anti, semi, etc.).
• Join order – The order in which the tables are joined to each other.
• Partition pruning – Are only the necessary partitions being accessed to answer the query?
• Parallel Execution – In case of parallel execution, is each operation in the plan being
conducted in parallel? Is the right data redistribution method being used?
By reviewing the four key
elements of: cardinality estimations, access methods, join methods, and join orders; you can determine if the execution plan is the best available plan.
This white paper will help you, http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/twp-explain-the-explain-plan-052011-393674.pdf
The cost estimate is oracles educated guess on how many blocks it will need to visit in order to answer your query. Is 18,096 a good number? That depends on what you are doing, how fast your server is and how quick you need it to run. There is little meaning in this number as an absolute value.
If you change the SQL or and indexes and the cost estimate goes down that is a good sign but what really matters is how long when it actually ruins. Oracle can estimate badly at times.
Having said all that it looks a bit high for something that runs while a user waits but reasonable for a batch job.
alt text http://img502.imageshack.us/img502/7245/75088152.jpg
There are two tables that I join them together, one of them is a temp table and I create an index after creating the table. But it is said in the query execution plan above.
what should I consider to convert all scan operations to seek operations? There are parts which are joins and where conditions...
Regards
bk
The "Missing index" hint that is displayed is your best starting point. SQL Server has detected you would get better performance by adding the index that it tells you to.
It's difficult to be specific as really need to know what your SELECT statement is as a number of things could cause a scan to be done instead of seek.
As an example, I recently blogged about how the structure of your WHERE clause for (e.g.) date filtered queries, can turn seeks into scans - in this instance things to look out for are the use of functions within the WHERE clause.
When attempting to understand how a SQL statement is executing, it is sometimes recommended to look at the explain plan. What is the process one should go through in interpreting (making sense) of an explain plan? What should stand out as, "Oh, this is working splendidly?" versus "Oh no, that's not right."
I shudder whenever I see comments that full tablescans are bad and index access is good. Full table scans, index range scans, fast full index scans, nested loops, merge join, hash joins etc. are simply access mechanisms that must be understood by the analyst and combined with a knowledge of the database structure and the purpose of a query in order to reach any meaningful conclusion.
A full scan is simply the most efficient way of reading a large proportion of the blocks of a data segment (a table or a table (sub)partition), and, while it often can indicate a performance problem, that is only in the context of whether it is an efficient mechanism for achieving the goals of the query. Speaking as a data warehouse and BI guy, my number one warning flag for performance is an index based access method and a nested loop.
So, for the mechanism of how to read an explain plan the Oracle documentation is a good guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009
Have a good read through the Performance Tuning Guide also.
Also have a google for "cardinality feedback", a technique in which an explain plan can be used to compare the estimations of cardinality at various stages in a query with the actual cardinalities experienced during the execution. Wolfgang Breitling is the author of the method, I believe.
So, bottom line: understand the access mechanisms. Understand the database. Understand the intention of the query. Avoid rules of thumb.
This subject is too big to answer in a question like this. You should take some time to read Oracle's Performance Tuning Guide
The two examples below show a FULL scan and a FAST scan using an INDEX.
It's best to concentrate on your Cost and Cardinality. Looking at the examples the use of the index reduces the Cost of running the query.
It's a bit more complicated (and i don't have a 100% handle on it) but basically the Cost is a function of CPU and IO cost, and the Cardinality is the number of rows Oracle expects to parse. Reducing both of these is a good thing.
Don't forget that the Cost of a query can be influenced by your query and the Oracle optimiser model (eg: COST, CHOOSE etc) and how often you run your statistics.
Example 1:
SCAN http://docs.google.com/a/shanghainetwork.org/File?id=dd8xj6nh_7fj3cr8dx_b
Example 2 using Indexes:
INDEX http://docs.google.com/a/fukuoka-now.com/File?id=dd8xj6nh_9fhsqvxcp_b
And as already suggested, watch out for TABLE SCAN. You can generally avoid these.
Looking for things like sequential scans can be somewhat useful, but the reality is in the numbers... except when the numbers are just estimates! What is usually far more useful than looking at a query plan is looking at the actual execution. In Postgres, this is the difference between EXPLAIN and EXPLAIN ANALYZE. EXPLAIN ANALYZE actually executes the query, and gets real timing information for every node. That lets you see what's actually happening, instead of what the planner thinks will happen. Many times you'll find that a sequential scan isn't an issue at all, instead it's something else in the query.
The other key is identifying what the actual expensive step is. Many graphical tools will use different sized arrows to indicate how much different parts of the plan cost. In that case, just look for steps that have thin arrows coming in and a thick arrow leaving. If you're not using a GUI you'll need to eyeball the numbers and look for where they suddenly get much larger. With a little practice it becomes fairly easy to pick out the problem areas.
Really for issues like these, the best thing to do is ASKTOM. In particular his answer to that question contains links to the online Oracle doc, where a lot of the those sorts of rules are explained.
One thing to keep in mind, is that explain plans are really best guesses.
It would be a good idea to learn to use sqlplus, and experiment with the AUTOTRACE command. With some hard numbers, you can generally make better decisions.
But you should ASKTOM. He knows all about it :)
The output of the explain tells you how long each step has taken. The first thing is to find the steps that have taken a long time and understand what they mean. Things like a sequential scan tell you that you need better indexes - it is mostly a matter of research into your particular database and experience.
One "Oh no, that's not right" is often in the form of a table scan. Table scans don't utilize any special indexes and can contribute to purging of every useful in memory caches. In postgreSQL, for example, you will find it looks like this.
Seq Scan on my_table (cost=0.00..15558.92 rows=620092 width=78)
Sometimes table scans are ideal over, say, using an index to query the rows. However, this is one of those red-flag patterns that you seem to be looking for.
Basically, you take a look at each operation and see if the operations "make sense" given your knowledge of how it should be able to work.
For example, if you're joining two tables, A and B on their respective columns C and D (A.C=B.D), and your plan shows a clustered index scan (SQL Server term -- not sure of the oracle term) on table A, then a nested loop join to a series of clustered index seeks on table B, you might think there was a problem. In that scenario, you might expect the engine to do a pair of index scans (over the indexes on the joined columns) followed by a merge join. Further investigation might reveal bad statistics making the optimizer choose that join pattern, or an index that doesn't actually exist.
look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. for example, if it is scanning a table, consider putting an index on the field(s) that is is scanning for
I mainly look for index or table scans. This usually tells me I'm missing an index on an important column that's in the where statement or join statement.
From http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx:
If you see any of the following in an
execution plan, you should consider
them warning signs and investigate
them for potential performance
problems. Each of them are less than
ideal from a performance perspective.
* Index or table scans: May indicate a need for better or additional indexes.
* Bookmark Lookups: Consider changing the current clustered index,
consider using a covering index, limit
the number of columns in the SELECT
statement.
* Filter: Remove any functions in the WHERE clause, don't include wiews
in your Transact-SQL code, may need
additional indexes.
* Sort: Does the data really need to be sorted? Can an index be used to
avoid sorting? Can sorting be done at
the client more efficiently?
It is not always possible to avoid
these, but the more you can avoid
them, the faster query performance
will be.
Rules of Thumb
(you probably want to read up on the details too:
Oracle Docs
ASKTOM
SQL Server Docs
)
Bad
Table Scans of Several Large Tables
Good
Using a unique index
Index includes all required fields
Most Common Win
In about 90% of performance problems I have seen, the easiest win is to break up a query with lots (4 or more) of tables into 2 smaller queries and a temporary table.