Find out which queries benefit from an existing index - sql

Is there a way to find out which queries benefit from a particular index?
I have used the DMV views and I know the index is being used in production but it would be great if there was a way to get a list of the queries positively impacted so I can make a decision if each index is worth keeping.
EDIT: I am using SQL Server
Thanks for your help!

Speaking from Oracle point of view: in Oracle, I can inspect query plan which gives enough information to guess whether an index was used or not. Remember that optimizer makes decisions based on the SQL at hand. There is no hard and fast rule re permanent use or non-use of an index. So, even if you find out that an index is being used or not, you can [almost] always modify the query so that the opposite is true!
Speaking of positive impact: again, it will only tell you how things are at this moment. For example, a table doesn't have enough records and a full table scan may be faster than using an index (due to overhead involved). But what if the situation changes (e.g. lot more records are inputted into that table)?
Bottom line: hard to make these decisions on permanent basis just by looking at what optimizer decided today, or what statistics are maintained by DB at this moment. Your knowledge of the data, its design and structure, and how it is being queried will be the real key on making these decisions.
My guess is that you are asking this question because you have lots of indices and you would like to get rid of a few. Unless the data changes rapidly, there is little overhead in maintaining indices (storage is cheap!). If that is the case, let's hope that optimizer is smart enough to make decision about using or not using an index based on cost... :-)

Related

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

Performance Tuning PostgreSQL

Keep in mind that I am a rookie in the world of sql/databases.
I am inserting/updating thousands of objects every second. Those objects are actively being queried for at multiple second intervals.
What are some basic things I should do to performance tune my (postgres) database?
It's a broad topic, so here's lots of stuff for you to read up on.
EXPLAIN and EXPLAIN ANALYZE is extremely useful for understanding what's going on in your db-engine
Make sure relevant columns are indexed
Make sure irrelevant columns are not indexed (insert/update-performance can go down the drain if too many indexes must be updated)
Make sure your postgres.conf is tuned properly
Know what work_mem is, and how it affects your queries (mostly useful for larger queries)
Make sure your database is properly normalized
VACUUM for clearing out old data
ANALYZE for updating statistics (statistics target for amount of statistics)
Persistent connections (you could use a connection manager like pgpool or pgbouncer)
Understand how queries are constructed (joins, sub-selects, cursors)
Caching of data (i.e. memcached) is an option
And when you've exhausted those options: add more memory, faster disk-subsystem etc. Hardware matters, especially on larger datasets.
And of course, read all the other threads on postgres/databases. :)
First and foremost, read the official manual's Performance Tips.
Running EXPLAIN on all your queries and understanding its output will let you know if your queries are as fast as they could be, and if you should be adding indexes.
Once you've done that, I'd suggest reading over the Server Configuration part of the manual. There are many options which can be fine-tuned to further enhance performance. Make sure to understand the options you're setting though, since they could just as easily hinder performance if they're set incorrectly.
Remember that every time you change a query or an option, test and benchmark so that you know the effects of each change.
Actually there are some simple rules which will get you in most cases enough performance:
Indices are the first part. Primary keys are automatically indexed. I recommend to put indices on all foreign keys. Further put indices on all columns which are frequently queried, if there are heavily used queries on a table where more than one column is queried, put an index on those columns together.
Memory settings in your postgresql installation. Set following parameters higher:
.
shared_buffers, work_mem, maintenance_work_mem, temp_buffers
If it is a dedicated database machine you can easily set the first 3 of these to half the ram (just be carefull under linux with shared buffers, maybe you have to adjust the shmmax parameter), in any other cases it depends on how much ram you would like to give to postgresql.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-resource.html
http://wiki.postgresql.org/wiki/Performance_Optimization
The absolute minimum I'll recommend is the EXPLAIN ANALYZE command. It will show a breakdown of subqueries, joins, et al., all the time showing the actual amount of time consumed in the operation. It will also alert you to sequential scans and other nasty trouble.
It is the best way to start.
Put fsync = off in your posgresql.conf, if you trust your filesystem, otherwise each postgresql operation will be imediately written to the disk (with fsync system call).
We have this option turned off on many production servers since quite 10 years, and we never had data corruptions.

How important is indexing and clustered indexing to database performance?

There have been several questions recently about database indexing and clustered indexing and it has been kind of new to me until the last couple weeks. I was wondering how important it is and what kind of performance gains can be expected from creating them.
Edit: What is usually the best type of fields to look at when putting in a clustered index when you are first starting out?
Very veryA(G,G) important. In my opinion, wise indexing is the absolute most important thing in DB performance optimization.
This is not an easy topic to cover in a single answer. Good indexing requires knowledge of queries going to happen on the database, making a large number of trade-offs and understanding the implication of a specific index in the specific DB engine. But it's very important nevertheless.
EDIT: Basically, clustered indexes usually should have short lengths. They should be created on queries which reflect a range. They should not have duplicate entries. But these guidelines are very general and by no means the right thing. The right thing is to analyze the queries that are gonna be executed. Carefully benchmarking and analyzing execution plans and understanding what is the best way to do it. This requires years of experience and knowledge and by no means it's something to explain in a single paragraph. It's the primary thing that makes DB experts expert (It's not the only thing, but it's primitive to other important things, such as concurrency issues, availability, ...)!
Indexing: extremely important. Having the wrong indexes makes queries harder, sometimes to the point they can't be completed in a sensible time.
Indexes also impact insert performance and disc usage (negatively), so keeping lots of superfluous indexes around on large tables is a bad idea too.
Clustering is something worth thinking about, I think it's really dependent on the behaviour of the specific database. If you can cluster your data correctly, you can dramatically reduce the amount of IOPs required to satisfy requests for rows not in memory.
Without proper indexes, you force the RDBMS to do table scans to query for anything. Terribly inefficient.
I'd also infer that you don't have primary keys, which is a cardinal sin in relational design.
Indexing is very important when the table contains many rows.
With a few rws, performance is better without indexes.
With larger tables indexes are very important to get good performance.
It is not easy to defined them. Clustered means that the data are stored in the clustered index order.
To get good hints of indexes you could use Toad
Indexing is vitally important.
The right index for a query can improve performance so dramatically it can seem like witchcraft.
As the other answers have said, indexing is crucial.
As you might infer from other answers, clustered indexing is much less crucial.
Decent indexing gives you first order performance gains - orders of magnitude are common.
Clustered indexing is a second order or incremental performance gain - usually giving small (<100%) percentages of performance increase.
(We also get into questions of 'what is a 100% performance gain'; I'm interpreting the percentage as ((oldtime - newtime)/newtime) * 100, so if the old time is 10 seconds and the new time is 5 seconds, the performance increase is 100%.)
Different DBMS have different interpretations of what a clustered index means. Beware.
In particular, some DBMS cluster the data once and thereafter, the clustering decays over time until the data is reclustered. Others take a more active view of clustering, I believe.
The clustered index is ususally but not always your primary key. One way of looking at a clustered index is to think of the data being physically ordered based on the values of the clustered index.
This may very well not be the case in reality however refrencing clustered indexes ususally gets you the following performance bonuses anyway:
All columns of the table are accessable for free when resolved from a clustered index hit as if they were contained within a covering index. (A query resolvable using just the index data without having to refrence the data pages of the table itself)
Update operations can be made directly against a clustered index without intermediate processing. If you are doing a lot of updates against a table you ususally want to be refrencing the clustered columns.
Depending on implementation there may be a sequential access benefit where data stored on disk is retreived quicker with fewer expensive disk seek operations.
Depending on implementation there may be free index benefit where a physical index is not necessary as data access can be resolved via simple guessing game algorithms.
Don't count on #3 and especially #4. #1 and #2 are ususally safe bets on most RDBMS platforms.

What are "SQL-Hints"?

I am an advocate of ORM-solutions and from time to time I am giving a workshop about Hibernate.
When talking about framework-generated SQL, people usually start talking about how they need to be able to use "hints", and this is supposedly not possible with ORM frameworks.
Usually something like: "We tried Hibernate. It looked promising in the beginning, but when we let it loose on our very very complex production database it broke down because we were not able to apply hints!".
But when asked for a concrete example, the memory of those people is suddenly not so clear any more ...
I usually feel intimidated, because the whole "hints"-topic sounds like voodoo to me...
So can anybody enlighten me? What is meant by SQL-hints or DB-Hints?
The only thing I know, that is somehow "hint-like" is SELECT ... FOR UPDATE. But this is supported by the Hibernate-API...
A SQL statement, especially a complex one, can actually be executed by the DB engine in any number of different ways (which table in the join to read first, which index to use based on many different parameters, etc).
An experienced dba can use hints to encourage the DB engine to choose a particular method when it generates its execution plan. You would only normally need to do this after extensive testing and analysis of the specific queries (because the DB engines are usually pretty darn good at figuring out the optimum execution plan).
Some MSSQL-specific discussion and syntax here:
http://msdn.microsoft.com/en-us/library/ms181714.aspx
Edit: some additional examples at http://geeks.netindonesia.net/blogs/kasim.wirama/archive/2007/12/31/sql-server-2005-query-hints.aspx
Query hints are used to guide the query optimiser when it doesn't produce sensible query plans by default. First, a small background in query optimisers:
Database programming is different from pretty much all other software development because it has a mechanical component. Disk seeks and rotational latency (waiting fora particular sector to arrive under the disk head) are very expensive in comparison to CPU. Different query resolution strategies will result in different amounts of I/O, often radically different amounts. Getting this right or wrong can make a major difference to the performance of the query. For an overview of query optimisation, see This paper.
SQL is declarative - you specify the logic of the query and let the DBMS figure out how to resolve it. A modern cost-based query optimiser (some systems, such as Oracle also have a legacy query optimiser retained for backward compatibility) will run a series of transformations on the query. These maintain semantic equivalence but differ in the order and choice of operations. Based on statistics collected on the tables (sizes, distribution histograms of keys) the optimiser computes an estimate of the amount of work needed for each query plan. It selects the most efficient plan.
Cost-based optimisation is heuristic, and is dependent on accurate statistics. As query complexity goes up the heuristics can produce incorrect plans, which can potentially be wildly inefficient.
Query hints can be used in this situation to force certain strategies in the query plan, such as a type of join. For example, on a query that usually returns very small result sets you may wish to force a nested loops join. You may also wish to force a certain join order of tables.
O/R mappers (or any tool that generates SQL) generates its own query, which will typically not have hinting information. In the case that this query runs inefficiently you have limited options, some of which are:
Examine the indexing on the tables. Possibly you can add an index. Some systems (recent versions of Oracle for example) allow you index joins across more than one table.
Some database management systems (again, Oracle comes to mind) allow you to manually associate a query plan with a specific query string. Query plans are cached by a hash value of the query. If the queries are paramaterised the base query string is constant and will resolve to the same hash value.
As a last resort, you can modify the database schema, but this is only possible if you control the application.
If you control the SQL you can hint queries. In practice it's fairly uncommon to actually need to do this. A more common failure mode on O/R mappers with complex database schemas is they can make it difficult to express complex query predicates or do complex operations over large bodies of data.
I tend to advocate using the O/R mapper for the 98% of work that it's suited for and dropping to stored procedures where they are the appropriate solution. If you really need to hint a query than this might be the appropriate strategy. Unless there is something unusual about your application (for example some sort of DSS) you should
only need to escape from the O/R mapper on a minority of situations. You might also
find (again, an example would be DSS tools working with the data in aggregate) that an O/R mapper is not really the appropriate strategy for the application.
While HINTS do as the other answers describe, you should only use them in rare, researched circumstances. 9 times out of 10 a HINT will result in a poor query plan. Unless you really know what you are doing, don't use them.
There is no such thing as "optimized SQL code", because SQL code is never executed.
SQL code is translated into an execution plan by the Optimizer. The Optimizer will use the information it has to choose (among other things).
the order in which tables are involved
the join method for each involved table (nested/merge/hash)
how to access a table's data (direct table access/ index with bookmark lookup/direct index access) (scan/seek)
should parallelism be used, and when to end parallelism (gather streams)
Query hints allow a programmer to over-ride (in most cases) or suggest politely (in other cases) the optimizer's choices.
Query hints can let you force off parallelism, force all joins to be implemented as nested loop, force one index to be used over another... as a few examples.
Since the optimizer is really good, if one over-rides the optimizer, one is generally asking for a non-optimal plan. Query hints are best served when the optimizer does not have the required information to make a good choice.
One place I use query hints is for table variables. Table variables are assumed to have 0 rows by the Optimizer, and so the Optimizer always joins table variables using nested loop (the best join implementation for small numbers of rows). If I have a large table variable - already ordered in a favorable way for merge join, I can specify a merge join be used by applying a query hint.
All modern RDBMS-es have some sort of query optimizer that calculates best query plan, which is sequence of read/write operations needed to execute SQL query.
Sometimes plans can be suboptimal, so RDBMS designers included "hints" in SQL. Hints are instructions you can embed in your SQL that affect query optimizer, With hints you can instruct query optimizer e.g. which indexes it should use, in what order data should be read from tables, ...
So, with hints you can resolve some bottlenecks that the query optimizer cannot solve by itself.
For example, here is list of Oracle hints.

Is it possible to have an indexed view in MySQL?

I found a posting on the MySQL forums from 2005, but nothing more recent than that. Based on that, it's not possible. But a lot can change in 3-4 years.
What I'm looking for is a way to have an index over a view but have the table that is viewed remain unindexed. Indexing hurts the writing process and this table is written to quite frequently (to the point where indexing slows everything to a crawl). However, this lack of an index makes my queries painfully slow.
I don't think MySQL supports materialized views which is what you would need, but it wouldn't help you in this situation anyway. Whether the index is on the view or on the underlying table, it would need to be written and updated at some point during an update of the underlying table, so it would still cause the write speed issues.
Your best bet would probably be to create summary tables that get updated periodically.
Have you considered abstracting your transaction processing data from your analytical processing data so that they can both be specialized to meet their unique requirements?
The basic idea being that you have one version of the data that is regularly modified, this would be the transaction processing side and requires heavy normalization and light indexes so that write operations are fast. A second version of the data is structured for analytical processing and tends to be less normalized and more heavily indexed for fast reporting operations.
Data structured around analytical processing is generally built around the cube methodology of data warehousing, being composed of fact tables that represent the sides of the cube and dimension tables that represent the edges of the cube.
Flexviews supports materialized views in MySQL by tracking changes to underlying tables and updating the table which functions as a materialized view. This approach means that SQL supported by the view is a bit restricted (as the change logging routines have to figure out which tables it should track for changes), but as far as I know this is the closest you can get to materialized views in MySQL.
Do you only want one indexed view? It's unlikely that writing to a table with only one index would be that disruptive. Is there no primary key?
If each record is large, you might improve performance by figuring out how to shorten it. Or shorten the length of the index you need.
If this is a write-only table (i.e. you don't need to do updates), it can be deadly in MySQL to start archiving it, or otherwise deleting records (and index keys), requiring the index to start filling (reusing) slots from deleted keys, rather than just appending new index values. Counterintuitive, but you're better off with a larger table in this case.