I've got a process that inserts about 1 million records a day into a table and it's been doing that for a year. I then have a secondary table that joins to the results table and selects a count of the results grouped by an id and status for the part three months. Everything has been going on fine but the query now is performing extremely slow , I can't seem to figure out what's gone wrong. Could someone point me towards where I'd need to start with to get the performance up.
We can think of a table as of a large file on disk. To search some information in the file you need to scan it. This is expensive.
To make the process more efficient, RDBMS builds indexes - data structures, that are usually kept in memory, organised to simplify specific queries and for each row contain references where to find the row within the file. The more rows you have, the larger become indexes.
At some point the indexes become too large to fit into memory and parts of it are swapped into disk. Random access to popular indexes start causing a lot of disk IO operations, because the OS is constantly saving/loading parts of the indexes, and this is much slower than working with just memory.
What to do heavily depends on the data, there are several approaches, but the common idea behind them is to make popular indexes fit into memory again:
you can just add some memory
you can use smaller datatypes
you can delete or merge indexes and rewrite queries to use the indexes in a certain way
you can partition your data and distribute an index among several machines
...
And make sure you do use indexes, because if a table is rapidly growing, table scans of the whole table will slow down your queries very soon.
Related
I have a large table with 10 million records and is used for one of our existing applications. we are working on a new application wherein it requires only filtered result set of large table with 7000 records.
My question is will there be any performance gain going for a smaller table with 7000 records vs querying large table with filter condition(and it will joined to few other tables in the schema which are completely independant from existing application)? or should I avoid redundancy maintaining all the data in one table? This is the design in data warehouse. Please suggest!
Thank you!
For almost any database, using a sample table will be noticeably faster. This is because reading the records will require loading fewer data pages.
In addition, if the base table is being updated, then a "snapshot" is isolated from page, table, and row locks that occur on the main table. This is good from a performance perspective, but it means that the versions can get out-of-synch, which may be bad.
And, from a querying perspective, the statistics on the sample would be more accurate. This helps the optimizer choose the best query plans.
I can think of two cases where performance might not improve significantly. The first is if your database supports clustered indexes and the rows that you want are defined by a range of index keys (or a single key). These will be "adjacent", so the clustered index would scan about the same number of pages. There is a slight overhead for the actual index structure.
Similarly, if your records were so large that there was one record per data page, then the advantage of a second table would be less. It would eliminate the index access overhead, but not reduce the number of reads.
None of these considerations say whether or not you should use a separate table. You should test in your environment. The overhead of managing a separate table (and there is a cost to creating and deleting it both in terms of performance and application complexity) may outweigh small performance gains.
I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?
What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!
As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.
I have created more than 500 tables with geometry columns ,
it's taking more time for spatial index creation after inserting data, how can I create spatial index for all tables within a few seconds,
are there any other way to speed up for spatial index creation?
What is it that you want to speed up ? The creation of the spatial indexes (= the CREATE INDEX commands) ? Or the inserts/updates to the tables. How big are those tables ? And how long is acceptable for your business process ?
Indexes are typically created once only upon the initial creation and population of a spatial table. Unless your application is one that inserts massive quantities of data at short intervals, or one that tracks moving objects, there is no need whatsoever to rebuild the indexes after the initial creation. They are obviously maintained automatically.
As for the time to build a spatial index, it does obviously take longer (sometimes significantly so) than building an index on the same number of names. Obviously, the larger your tables, the longer that can take. Then again, if you think you can build 500 indexes (spatial or not) in a few seconds, you are deluded, even if you use very powerful hardware and the tables are empty.
There are two ways to make your index builds take less time:
1) For large tables, partition them and do the build in parallel. That is only possible if you own the partitioning option (only available in EE). Then again, that makes sense only for large tables - i.e. 50 mio rows and more. It makes little sense on small tables.
2) For a large number of smallish tables, just do the builds in parallel, i.e. run multiple index creations concurrently. Use the database's scheduler (DBMS_SCHEDULER) to orchestrate that.
If what you are after is how to accelerate data ingestion (moving objects, tracking, ...) then you need to explain a bit more about your workflow.
I have a table that has 50 fields:
10 Fields that are almost always needed.
40 Fields that are very rarely needed.
I would roughly say that the fields in (1) are needed to be accessed 1000 times more frequently than the fields in (2).
Should I split them to two tables with one-to-one relation, or keep all in the same table?
The process that you are describing is sometimes referred to as "vertical partitioning". Taken to an extreme (one column per vertical partition), this is how columnar databases store data. Unfortunately (to the best of my knowledge), Postgres does not currently have direct support for vertical partitioning.
Your idea of splitting the data into two tables is fine. I would note the following:
You will need to modify queries that use the extra columns to use the second table. (You can wrap the join into a view which you use when you want the extra columns.)
If both tables have a clustered primary key that connects them, then the join should be really fast.
If you are inserting/updating/deleting data, then you need to be careful about synchronization. I think you can handle this with an INSTEAD OF trigger on a view combining the tables.
If some records do not have extra columns, this can be a big win on the space side.
If all records and all columns are going to be loaded into the cache, then this probably is not a big win.
This can be a big performance win, under some circumstances. But there is additional manual work to keep the tables synchronized.
There's really not nearly enough information here to estimate (never mind actually quantify) what the benefits might be, but the costs are very clear -- more complex code, a more complex schema, probably greater overall space usage, and a performance overhead when adding and removing rows.
A performance improvement might come from scanning a smaller amount of data when performing a full table scan, or from an increased likelihood in finding data blocks in memory when required, and an overall smaller memory footprint, but without specific information on the types of operation commonly performed, and whether the server is under memory pressure, no reliable advice can be given.
Be very wary of making your system more complex as a side-effect of uncertain performance gains.
So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.