There have been several questions recently about database indexing and clustered indexing and it has been kind of new to me until the last couple weeks. I was wondering how important it is and what kind of performance gains can be expected from creating them.
Edit: What is usually the best type of fields to look at when putting in a clustered index when you are first starting out?
Very veryA(G,G) important. In my opinion, wise indexing is the absolute most important thing in DB performance optimization.
This is not an easy topic to cover in a single answer. Good indexing requires knowledge of queries going to happen on the database, making a large number of trade-offs and understanding the implication of a specific index in the specific DB engine. But it's very important nevertheless.
EDIT: Basically, clustered indexes usually should have short lengths. They should be created on queries which reflect a range. They should not have duplicate entries. But these guidelines are very general and by no means the right thing. The right thing is to analyze the queries that are gonna be executed. Carefully benchmarking and analyzing execution plans and understanding what is the best way to do it. This requires years of experience and knowledge and by no means it's something to explain in a single paragraph. It's the primary thing that makes DB experts expert (It's not the only thing, but it's primitive to other important things, such as concurrency issues, availability, ...)!
Indexing: extremely important. Having the wrong indexes makes queries harder, sometimes to the point they can't be completed in a sensible time.
Indexes also impact insert performance and disc usage (negatively), so keeping lots of superfluous indexes around on large tables is a bad idea too.
Clustering is something worth thinking about, I think it's really dependent on the behaviour of the specific database. If you can cluster your data correctly, you can dramatically reduce the amount of IOPs required to satisfy requests for rows not in memory.
Without proper indexes, you force the RDBMS to do table scans to query for anything. Terribly inefficient.
I'd also infer that you don't have primary keys, which is a cardinal sin in relational design.
Indexing is very important when the table contains many rows.
With a few rws, performance is better without indexes.
With larger tables indexes are very important to get good performance.
It is not easy to defined them. Clustered means that the data are stored in the clustered index order.
To get good hints of indexes you could use Toad
Indexing is vitally important.
The right index for a query can improve performance so dramatically it can seem like witchcraft.
As the other answers have said, indexing is crucial.
As you might infer from other answers, clustered indexing is much less crucial.
Decent indexing gives you first order performance gains - orders of magnitude are common.
Clustered indexing is a second order or incremental performance gain - usually giving small (<100%) percentages of performance increase.
(We also get into questions of 'what is a 100% performance gain'; I'm interpreting the percentage as ((oldtime - newtime)/newtime) * 100, so if the old time is 10 seconds and the new time is 5 seconds, the performance increase is 100%.)
Different DBMS have different interpretations of what a clustered index means. Beware.
In particular, some DBMS cluster the data once and thereafter, the clustering decays over time until the data is reclustered. Others take a more active view of clustering, I believe.
The clustered index is ususally but not always your primary key. One way of looking at a clustered index is to think of the data being physically ordered based on the values of the clustered index.
This may very well not be the case in reality however refrencing clustered indexes ususally gets you the following performance bonuses anyway:
All columns of the table are accessable for free when resolved from a clustered index hit as if they were contained within a covering index. (A query resolvable using just the index data without having to refrence the data pages of the table itself)
Update operations can be made directly against a clustered index without intermediate processing. If you are doing a lot of updates against a table you ususally want to be refrencing the clustered columns.
Depending on implementation there may be a sequential access benefit where data stored on disk is retreived quicker with fewer expensive disk seek operations.
Depending on implementation there may be free index benefit where a physical index is not necessary as data access can be resolved via simple guessing game algorithms.
Don't count on #3 and especially #4. #1 and #2 are ususally safe bets on most RDBMS platforms.
Related
I am new to SQL and have started learning Postgres. I want to create a database in which one of the column of table in that database is email. Now since I will be using email to uniquely identify the person for sign-up & logging in, I used UNIQUE command while creating the table so that if an account already exists with that email I will pass an error. But will this be efficient in large databases?
Here is the screenshot of the SQL command I have used.
What do you mean by efficient? A unique constraint or index adds overhead to prevent duplicates. This overhead is needed on inserts or updates on the affected columns.
How much overhead? Well a uniqueness is validating by using an index. On most databases, looking up a value in an index can be quite fast and usually does not pose a significant performance issue. Of course, such overhead can sometimes be important, particularly on very busy databases with lots of data changes (think dozens or hundreds of transactions per second).
In general, the size of the table is going to have little impact on the performance -- well, indexes do slow down a little bit (but hardly noticeable) as tables get bigger, assuming there is sufficient memory.
On the other hand, the cost of not having data integrity can be much larger -- affecting the performance and quality of queries that run on the data.
Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.
I have to create a data warehouse (a star schema) and customer wants it in SQL 2014 in memory. My understanding is in-memory has lot of limitations with FK constraints, indexes etc. These are crucial for us, because our fact table volume are in millions. As an alternative I was thinking suggesting creating bunch of de-normalized tables and join them in SQL for reporting and not go with Kimball DWH. I have around 9 transaction and 4 master tables.
Any better suggestion or alternative to go around this?
Normally when you say in-memory with SQL Server, it means the in-memory oltp (Hekaton), which is designed for specific situations, mainly to handle bottlenecks in locking and latching. I would assume that in this case that's not what you mean.
Microsoft also uses in-memory name with clustered columnstore, which to my mind at least makes things quite confusing. Clustered columnstore is designed for data warehouses, and instead of normal row based approach, it stores the data in column format. If you have enterprise edition, it's at least worth trying for fact tables. You should get significant space savings when comparing to normal compressed tables (my fact tables shrunk between 75% - 90% compared page compressed row store) -- which of course helps a lot in terms of what fits into the cache and performance should be a better too, but of course a lot depends on your data, database structure and your queries.
There are quite many restrictions too, biggest ones probably being that you can't have unique indexes or primary / foreign keys. This restriction will be removed in SQL Server 2016, so if you can wait until that, or possibly upgrade once it's available, that might not be such a big issue.
You mentioned that there is no support for indexes. That is true, but you don't need other indexes with clustered columnstore, because the data is stored in column format and is highly compressed.
Is there a way to find out which queries benefit from a particular index?
I have used the DMV views and I know the index is being used in production but it would be great if there was a way to get a list of the queries positively impacted so I can make a decision if each index is worth keeping.
EDIT: I am using SQL Server
Thanks for your help!
Speaking from Oracle point of view: in Oracle, I can inspect query plan which gives enough information to guess whether an index was used or not. Remember that optimizer makes decisions based on the SQL at hand. There is no hard and fast rule re permanent use or non-use of an index. So, even if you find out that an index is being used or not, you can [almost] always modify the query so that the opposite is true!
Speaking of positive impact: again, it will only tell you how things are at this moment. For example, a table doesn't have enough records and a full table scan may be faster than using an index (due to overhead involved). But what if the situation changes (e.g. lot more records are inputted into that table)?
Bottom line: hard to make these decisions on permanent basis just by looking at what optimizer decided today, or what statistics are maintained by DB at this moment. Your knowledge of the data, its design and structure, and how it is being queried will be the real key on making these decisions.
My guess is that you are asking this question because you have lots of indices and you would like to get rid of a few. Unless the data changes rapidly, there is little overhead in maintaining indices (storage is cheap!). If that is the case, let's hope that optimizer is smart enough to make decision about using or not using an index based on cost... :-)
Keep in mind that I am a rookie in the world of sql/databases.
I am inserting/updating thousands of objects every second. Those objects are actively being queried for at multiple second intervals.
What are some basic things I should do to performance tune my (postgres) database?
It's a broad topic, so here's lots of stuff for you to read up on.
EXPLAIN and EXPLAIN ANALYZE is extremely useful for understanding what's going on in your db-engine
Make sure relevant columns are indexed
Make sure irrelevant columns are not indexed (insert/update-performance can go down the drain if too many indexes must be updated)
Make sure your postgres.conf is tuned properly
Know what work_mem is, and how it affects your queries (mostly useful for larger queries)
Make sure your database is properly normalized
VACUUM for clearing out old data
ANALYZE for updating statistics (statistics target for amount of statistics)
Persistent connections (you could use a connection manager like pgpool or pgbouncer)
Understand how queries are constructed (joins, sub-selects, cursors)
Caching of data (i.e. memcached) is an option
And when you've exhausted those options: add more memory, faster disk-subsystem etc. Hardware matters, especially on larger datasets.
And of course, read all the other threads on postgres/databases. :)
First and foremost, read the official manual's Performance Tips.
Running EXPLAIN on all your queries and understanding its output will let you know if your queries are as fast as they could be, and if you should be adding indexes.
Once you've done that, I'd suggest reading over the Server Configuration part of the manual. There are many options which can be fine-tuned to further enhance performance. Make sure to understand the options you're setting though, since they could just as easily hinder performance if they're set incorrectly.
Remember that every time you change a query or an option, test and benchmark so that you know the effects of each change.
Actually there are some simple rules which will get you in most cases enough performance:
Indices are the first part. Primary keys are automatically indexed. I recommend to put indices on all foreign keys. Further put indices on all columns which are frequently queried, if there are heavily used queries on a table where more than one column is queried, put an index on those columns together.
Memory settings in your postgresql installation. Set following parameters higher:
.
shared_buffers, work_mem, maintenance_work_mem, temp_buffers
If it is a dedicated database machine you can easily set the first 3 of these to half the ram (just be carefull under linux with shared buffers, maybe you have to adjust the shmmax parameter), in any other cases it depends on how much ram you would like to give to postgresql.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-resource.html
http://wiki.postgresql.org/wiki/Performance_Optimization
The absolute minimum I'll recommend is the EXPLAIN ANALYZE command. It will show a breakdown of subqueries, joins, et al., all the time showing the actual amount of time consumed in the operation. It will also alert you to sequential scans and other nasty trouble.
It is the best way to start.
Put fsync = off in your posgresql.conf, if you trust your filesystem, otherwise each postgresql operation will be imediately written to the disk (with fsync system call).
We have this option turned off on many production servers since quite 10 years, and we never had data corruptions.