In Oracle specifically, and possibly other platforms, what is the difference between Indexes and Extended Statistics? They seem to be constructed is similar fashion and perform the same function. There must be some core differences - can anyone provide details?
Hmmm . . . They seem quite different to me.
An index is a copy of data in one or more columns in a table (perhaps with expressions) structured to speed access or to enforce a unique constraint. The index can be directly used to return values from these columns (or expressions).
Part of the process of creating and maintaining an index provides statistics about the underlying distributions of values. The optimizer can take advantage both of the data in the index and the information about distributions. However, the main purpose of indexes is either to provide an alternative, faster access path or to enforce uniqueness constraints.
Statistics (and extended statistics) describe properties of one or more columns. These properties are used by the optimizer to choose the best algorithm for running the query. The most important property is cardinality -- the number of different rows -- although skewness can also be important.
Statistics are not used to directly return values in result sets. They only affect the optimizer. Indexes can be used to return values; information gathered in the creation of indexes can also be used by the optimizer to define the best execution plan.
Related
Find off, at the moment I'm not looking for alternate suggestions, just a yes or a no, and if it's a yes, what the name is.
Are there any SQL DBMS that allows you to create "Spatial" indexes using arbitrary (i.e. non geometric) data types like integers, dates, etc? While spatial indexes are most commonly used for location data, they can also be used to properly indexes queries where you need to search within two or more ranges.
For example (and this is just a made-up example), if you had a database of customer receipts, and you wanted to find all transactions between $10-$1000 and which took place between 2000-01-01 thru 2005-03-01. The fact that you're searching within multiple ranges means that the regular b-tree indexes cannot be used to efficiently perform this lookup, at least not in a way that's scalable.
Now yes, for the specific example I provided, and probably any other case, you could likely come up with some tricks to do it efficiently using the b-tree indexes, or at the very least narrow it down; I'm well aware, but again, not looking for alternate suggestions, just a no, or a yes and the name.
Appreciate any help you all can provide
EDIT: Just to clarify; I'm using the term spatial index as this is the most common term for it as well as the most commonly implemented use case. I am however referring to any index which uses quadtrees, r-trees, etc to achieve the same or similar effect.
On one of my Informix tables, there are two indexes that have only 1 different column out of three. Here are the indexes:
CREATE INDEX informix.ix_1
ON informix.test(date, operator, rn)
CREATE INDEX informix.ix_2
ON informix.test(choice, date, operator)
date is from type Date
operator is Char(3)
choice is Integer
rn is Integer
Is it smart to combine these indexes into one and drop them, in this way:
CREATE INDEX informix.ix_new
ON informix.test(date, operator, rn, choice)
Without knowing what queries you run against that table, it isn't clear whether any of those indexes is useful. The size of the table — both width (number of columns and their types) and length (number of rows) also factor into the equation. Since the indexes are not unique, they're not present to enforce a key constraint.
If you always specify an exact date, the ix_1 index can be used. If you also specify an operator, it will be more useful (more restrictive); if you also specify the rn, it will be most useful. If you don't specify the date, ix_1 won't be used.
Similarly, with ix_2, if you always specify a choice, the index can be used; if you also specify the operator and date, it will be more useful (more restrictive).
With the new index, the 'choice' column will really only help if you always specify the date, operator and rn.
Remember that (non-unique) indexes represent a trade-off. They have to be maintained, so when you add a new row, or update one of the indexed columns in an existing row, or delete a row, each of the indexes also has to be modified appropriately. If the indexes are frequently used in between changes, then the cost of maintenance can more than pay for itself in the speed-up of the queries. If you seldom query the table or the indexes are never used with the queries you run against the table, then the indexes are just storage overhead (and a marginal optimization overhead as they have to be studied to see if they can help with the query — but this is a second-order effect). If the indexes speed up queries, their maintenance cost is not a problem. If they're never used, they're so much wasted effort.
Unique indexes typically help enforce an database constraint, and are subject to different considerations, though there are many similarities. If some combination of columns must be unique, the index serves a purpose even if it is never used in any query (though it is likely that the index will be used).
All of this applies pretty much to any database that uses indexes. There are whole books written on the design of indexing schemes for particular database designs.
Index design is based on the queries you need the index to support.
The order of columns in the index matters, and this also has to do with the queries you need the index to support.
You can't determine if it's smart to combine your indexes until you analyze your queries.
I don't use Informix regularly, but I see there is a chapter in the Informix Performance Guide about Queries and the query optimizer. You should read that guide to get more tips about how to analyze your queries.
What is the difference between using Indexes in SQL vs Using the ORDER BY clause?
Since from what I understand , the Indexes arrange the specified column(s) in an ordered manner that helps the query engine in looking through the tables quickly (and hence prevents table scan).
My question - why can't the query engine simply use the ORDER BY for improving performance?
Thanks !
You put the tag as sql-server-2008 but the question has nothing to do with SQL server. This question will apply to all databases.
From wikipedia:
Indexing is a technique some storage engines use for improving
database performance. The many types of indexes share the common
property that they reduce the need to examine every entry when running
a query. In large databases, this can reduce query time/cost by orders
of magnitude. The simplest form of index is a sorted list of values
that can be searched using a binary search with an adjacent reference to the location of the entry, analogous to the index in the back of a book. The same data can have multiple indexes (an employee database could be indexed by last name and hire date).
From a related thread in StackExchange
In the SQL world, order is not an inherent property of a set of data.
Thus, you get no guarantees from your RDBMS that your data will come
back in a certain order -- or even in a consistent order -- unless you
query your data with an ORDER BY clause.
To answer why the indexes are necessary?
Note the bolded text about indexing regarding the reduction in the need to examine every entry. In the absence of an index when an ORDER BY is issued in SQL, every entry need to be examined which increases the number of entries.
ORDER BY is applied only when reading. A single column may be used in indexes in which case there could be several different kinds of ordering in sql query requests. It is not possible to define the indexes unless we understand how the query requests are made.
A lot of times indexes are added once new patterns of querying emerge so as to keep the query performant which mean index creation is driven by how you defined your ORDER BY in SQL.
Query engine which processes your SQL with/without ORDER BY, defines your execution plan and does not understand Storage of data. The Data retrieved from a query engine may be partly from memory if the data was in cache and partly/fully from disk. When reading from disk in the storage engine will uses the indexes to figure the quickly read data.
ORDER BY effects the performance of a query when reading. Index effects the performance of a query when doing all the Create, Read, Update and Delete operations.
A query engine may choose to use an index or totally ignore the index based on the data characteristics.
Is there a good method for judging whether the costs of creating a database index in Postgres (slower INSERTS, time to build an index, time to re-index) are worth the performance gains (faster SELECTS)?
I am actually going to disagree with Hexist. PostgreSQL's planner is pretty good, and it supports good sequential access to table files based on physical order scans, so indexes are not necessarily going to help. Additionally there are many cases where the planner has to pick an index. Additionally you are already creating primary keys for unique constraints and primary keys.
I think one of the good default positions with PostgreSQL (MySQL btw is totally different!) is to wait until you need an index to add one and then only add the indexes you most clearly need. This is, however, just a starting point and it assumes either a lack of a general lack of experience in looking at query plans or a lack of understanding of where the application is likely to go. Having experience in these areas matters.
In general, where you have tables likely to span more than 10 pages (that's 40kb of data and headers), it's a good idea to foreign keys. These can be assumed tob e clearly needed. Small lookup tables spanning 1 page should never have non-unique indexes because these indexes are never going to be used for selects (no query plan beats a sequential scan over a single page).
Beyond that point you also need to look at data distribution. Indexing boolean columns is usually a bad idea and there are better ways to index things relating to boolean searches (partial indexes being a good example). Similarly indexing commonly used function output may seem like a good idea sometimes, but that isn't always the case. Consider:
CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));
This will not do much. However an index on transdate might be useful if paired with a sparse index scan via a recursive CTE.
Once the basic indexes are in place, then the question becomes what other indexes do you need to add. This is often better left to later use case review than it is designed in at first. It isn't uncommon for people to find that performance significantly benefits from having fewer indexes on PostgreSQL.
Another major thing to consider is what sort of indexes you create and these are often use-case specific. A b-tree index on an array record for example might make sense if ordinality is important to the domain, and if you are frequently searching based on initial elements, but if ordinality is unimportant, I would recommend a GIN index, because a btree will do very little good (of course that is an atomicity red flag, but sometimes that makes sense in Pg). Even when ordinality is important, sometimes you need GIN indexes anyway because you need to be able to do commutitive scans as if ordinality was not. This is true if using ip4r for example to store cidr blocks and using an EXCLUDE constraint to ensure that no block contains any other block (the actual scan requires using an overlap operator rather than a contain operator since you don't know which side of the operator the violation will be found on).
Again this is somewhat database-specific. On MySQL, Hexist's recommendations would be correct, for example. On PostgreSQL, though, it's good to watch for problems.
As far as measuring, the best tool is EXPLAIN ANALYZE
Generally speaking, unless you have a log or archive table where you wont be doing selects on very frequently (or it's ok if they take awhile to run), you should index on anything your select/update/deelete statements will be using in a where clause.
This however is not always as simple as it seems, as just because a column is used in a where clause and is indexed, doesn't mean the sql engine will be able to use the index. Using the EXPLAIN and EXPLAIN ANALYZE capabilities of postgresql you can examine what indexes were used in selects and help you figure out if having an index on a column will even help you.
This is generally true because without an index your select speed goes down from some O(log n) looking operation down to O(n), while your insert speed only improves from cO(log n) to dO(log n) where d is usually less than c, ie you may speed up your inserts a little by not having an index, but you're going to kill your select speed if they're not indexed, so it's almost always worth it to have an index on your data if you're going to be selecting against it.
Now, if you have some small table that you do a lot of inserts and updates on, and frequently remove all the entries, and only periodically do some selects, it could turn out to be faster to not have any indexes.. however that would be a fairly special case scenario, so you'd have to do some benchmarking and decide if it made sense in your specific case.
Nice question. I'd like to add a bit more what #hexist had already mentioned and to the info provided by #ypercube's link.
By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.
Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:
there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.
Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).
Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:
if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.
How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.
All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.
Adding indexes is often suggested here as a remedy for performance problems.
(I'm talking about reading & querying ONLY, we all know indexes can make writing slower).
I have tried this remedy many times, over many years, both on DB2 and MSSQL, and the result were invariably disappointing.
My finding has been that no matter how 'obvious' it was that an index would make things better, it turned out that the query optimiser was smarter, and my cleverly-chosen index almost always made things worse.
I should point out that my experiences relate mostly to small tables (<100'000 rows).
Can anyone provide some down-to-earth guidelines on choices for indexing?
The correct answer would be a list of recommendations something like:
Never/always index a table with less than/more than NNNN records
Never/always consider indexes on multi-field keys
Never/always use clustered indexes
Never/always use more than NNN indexes on a single table
Never/always add an index when [some magic condition I'm dying to learn about]
Ideally, the answer will give some instructive examples.
Indexes are kind of like chemotherapy...too much and it kills you...too little and you die...do it the wrong way and you die. You gotta know just how much, how often, and what kind to make it not kill you.
Your hardware, platform, environment, load all play a role. So to answer your questions..
Yes, possibly sometimes.
As a rule of thumb, primary keys and foreign keys need to be indexed. Usually primary key are indexed just by defining them as such, but FKs are not in every database (they definitely are not in SQL Server, I can't really speak for other dbs). You will be using these in joins, so it is generally critical to performance to define these.
Now if you have fields you often use in where clauses, they can benefit from indexes as well providing several things:
First the field must have a range of
values. A bit field or a field with
only 2 or 3 values will almost never
use an index.
Second the queries you write must be sargable. That is they must be designed to use indexes. I suspect if you never get performance improvements from what look like likely candidates for indexes, then you probably have queries that are not sargable. For instance take "WHERE Name like '%Smith'" as a where clause. Without knowing the first characters, the optimizer can't use the index.
Small tables rarely benefit much from indexes. If the optimizer can hold the whole thing in memory, then it is often faster to do so. If you were working with multimillion record tables, you would see that indexes are critical.
Indexing can be very complex and if you are interested in the subject, I suggest you get a good book on performance tuning your particular database and read in depth about them.
An index that's never used is a waste of disk space, as well as adding to the insert/update/delete time. It's probably best to define the clustering index first, then define
additional indexes as you find yourself writing WHERE clauses.
One common index mistake I see is people wondering why a select on col2 (or col3) takes so long when the index is defined as col1 ASC, col2 ASC, col3 ASC. When you have a multiple column index, your WHERE clause must use the first column in the index, or the first and second column in the index, and so forth.
If you need to access the data by col2, then you need an additional index that's defined as col2 ASC.
With small domain tables, it's sometimes faster to do a table scan than it is to read rows from the table using an index. This depends on the speed of your database machine and the speed of the network.
You need indexes. Only with indexes you can access data fast enough.
To make it as short as possible:
add indexes for columns you are frequently filtering (or grouping) for. (eg. a state or name)
like and sql functions could make the DBMS not use indexes.
add indexes only on columns which have many different values (eg. no boolean fields)
It is common to add indexes to foreign keys, but it is not always needed.
don't add indexes in very short tables
never add indexes when you don't know how they should enhance performance.
Finally: look into execution plans to decide how to optimize queries.
You'll add indexes just for a single, critical query. In this case, you'll add exactly the indexes that are needed in the query in question (multi-column indexes).
Basically when DB is collecting data and it's alive indexes have to go and evolve with that flow. There maybe really good index on table but after growing beyond of XXX records the same index in the same table is useless and in that case it should be refactored.
To have optimized and fast DB the only way is to monitor it all the time and refactor it over the time as records come in.
Real life example i got some time ago was super fast query restricted by some time range (created_at between A and B) and super slow query where time range was different. Same query, same database, same application and only one difference on time range.
Always use clustered indexes.
In fact you can't help but using them. The data in a table will be laid out on disk in some particular order anyway, it can't be save as a pile or something. You have the chance of specifying how exactly this data will be laid out. Why burn it?
When you have a table which gets new records appended and you observe that some value in those records always grow (like StackOverflow question number), make a clustered index out of it. Then the new data will not be inserted in the middle but will basically be appended to a file on disk which is a relatively cheap operation.
If a table is expected to be the target of a join then it is best to have a clustered index on that table so that the joins can be performed sequentially through the data pages. The columns in the clustered index will (on some DB systems) be included in all of the other indexes on that table, since those are the values that the indexes will use to reference the table data. To keep the other indexes from getting too large, the columns in the clustered index should be as narrow as possible, so it is best to use only numeric—rather than character—data types in the clustered index. In general, fewer columns are better than more columns, but notice that three int columns (12 bytes per row) are much better than one nvarchar(32) column (potentially 64 bytes per row).
If the clustered index is narrow, then a few additional indexes should not negatively impact performance very much even on very large tables.
Seems you are confusing two concepts here.
Adding indices *generally can only make a read query faster, very very rarely (almost never) slower. Adding an index never forces the query optimizer to use it. It will only use it if it thinks it can benefit from it, and it is generally very smart about those decisions.
For inserts/updates, of course, every index hurts performance a bit more... But at the other end of the spectrum, for, say a read only database, (like a USPS address database which is distributed monthly), in operational use there would ne no inserts/updates, so the only negative impact of additional indices is the disk space they take up.
This is entirely different that specifying that the query optimizer USE an index, in effect overriding what it would do on it's own... That can potentially make a query slower.
EDIT: Edited to eliminate opportunity for misinterpretation by overly literal readers.