i have a huge table (200mln records). about 70% is not need now (there is column ACTIVE in a table and those records have value N ). There are a lot of multi-column indexes but none of them includes that column. Will removing that 70% records improve SELECT (ACTIVE='Y') performance (because oracle has to read table blocks with no active records and then exclude them from final result)? Is shrink space necessary?
It's really impossible to say without knowing more about your queries.
At one extreme, access by primary key would only improve if the height of the supporting index was reduced, which would probably require deletion of the rows and then a rebuild of the index.
At the other extreme, if you're selecting nearly all active records then a full scan of the table with 70% of the rows removed (and the table shrunk) would take only 30% of the pre-deletion time.
There are many other considerations -- selecting a set of data and accessing the table via indexes, and needing to reject 99% of rows after reading the table because it turns out that there's a positive correlation between the required rows and an inactive status.
One way of dealing with this would be through list partitioning the table on the ACTIVE column. That would move inactive records to a partition that could be eliminated from many queries, with no need to index the column, and would keep the time for full scans of active records down.
If you really do not need these inactive records, why do you just not delete them instead of marking them inactive?
Edit: Furthermore, although indexing a column with a 70/30 split is not generally helpful, you could try a couple of other indexing tricks.
For example, if you have an indexed column which is frequently used in queries (client_id?) then you can add the active flag to that index. You could also construct a partial index:
create index my_table_active_clients
on my_table (case when active = 'Y' then client_id end);
... and then query on:
select ...
from ...
where (case when active = 'Y' then client_id end) = :client_id
This would keep the index smaller, and both indexing approaches would probably be helpful.
Another edit: A beneficial side effect of partitioning could be that it keeps the inactive records and active records "physically" apart, and every block read into memory from the "active" partition of course only has active records. This could have the effect of improving your cache efficiency.
Partitioning, putting the active='NO' records in a separate partition, might be a good option.
http://docs.oracle.com/cd/B19306_01/server.102/b14223/parpart.htm
Yes it will most likely. But depending on your access schema the increase will most likely not as big. Setting an index including the column would be a better solution for future IMHO.
Most probably no. Delete will not reduce the size of the table's segment. Additional maintenance might help. After the DELETE execute also:
ALTER TABLE <tablename> SHRINK SPACE COMPACT;
ALTER INDEX <indexname> SHRINK SPACE COMPACT; -- for every table's index
Alternatively you can use old school approach:
ALTER TABLE <tablename> MOVE;
ALTER INDEX <indexnamename> REBUILD;
When delting 70% of table also consider possible solution CTAS (create table as select). It will be much faster.
Indexing plays a vital role in SELECT query. The performance will drastically increase
if you use those indexed columns in the query. Ya deleting rows will enhance the performance
for sure somewhat but not drastically.
Related
I need to improve performance of INSERT INTO query to a table which has 1 billion rows. This table contains a clustered index primary key.
One suggestion is to reduce the table data by deleting (copying to archive table) old records and keep most recent records in the table. This will reduce data from 1 billion to 2 million. Will this approach increase data write process?
Are there any other way to increase record insert process?
Note: this INSERT INTO query is in a complex stored procedure and the execution plan points to this INSERT statement taking a certain amount of time.
Simplistically, reducing the size of the table will not have much impact on performance. There are some cases where it could make a difference.
If the clustered index primary key is not ordered, then you have a fragmentation problem. That means that inserts are likely to be splitting pages and rewriting them.
The "good" news is that in the existing table, your pages are probably already fragmented, so you probably have few full pages. So splitting is less likely. This is "good" in quotes because it means that you have lots of wasted space, which is inefficient for queries.
If you remove the excess rows and compact (defrag) the table, then you will have some advantages. The biggest is that the data will probably fit into memory -- a big performance advantage.
I would recommend that you fix the table because the extra rows are probably hurting query performance. Given the volume of data, I would suggest a truncate/re-insert approach:
select t.*
into temp_t
from t
where <rows to keep logic here>;
truncate table t; -- be sure you have a backup!
insert into t
select *
from temp_t;
This will be much faster than trying to delete 99.9% of the rows (unless you happen to have a partitioned table where you can simply drop partitions).
If you want to keep the old data, you might find a way to partition the table. Of course, your queries would have to use the partitioning key to access the "valid" rows rather than the archive.
Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.
I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';
I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.
INFORMIX-SE 7.32:
I have a transaction table with about 5,000 nrows. The transaction.ticket_number[INT] is a column which gets updated with the next available sequential ticket number every time a specific row is updated. The column is unique indexed. I'm currently using the following SELECT statement to locate the max(transaction.ticket_num):
SELECT MAX(transaction.ticket_number) FROM transaction;
Since the row being updated is clustered acording to the transaction.fk_id[INT], where it is joined to customer.pk_id[SERIAL],the row is not physically located at the end of the transaction table, rather it resides within the group of transaction rows belonging to each particular customer. I chose to cluster the transactions belonging to each customer because response time is faster when I scroll through each customers transaction. Is there a faster way of locating the max(transaction.ticket_number) with the above query?.. Would a 'unique index on transaction(ticket_number) descending' improve access or is the indexed fully traversed from begining to end irrelevantly?
On a table of only 5000 rows on a modern machine, you are unlikely to be able to measure the difference in performance of the various techniques, especially in the single-user scenario which I believe you are facing. Even if the 5000 rows were all at the maximum permissible size (just under 32 KB), you would be dealing with 160 MB of data, which could easily fit into the machine's caches. In practice, I'm sure your rows are far smaller, and you'd never need all the data in the cache.
Unless you have a demonstrable performance problem, go with the index on the ticket number column and rely on the server (Informix SE) to do its job. If you have a demonstrable problem, show the query plans from SET EXPLAIN output. However, there are major limits on how much you can tweak SE performance - it is install-and-go technology with minimal demands on tuning.
I'm not sure whether Informix SE supports the 'FIRST n' (aka 'TOP n') notation that Informix Dynamic Server supports; I believe not.
Due to NULLABLE columns and other factors, use of indexes, etc, you can often find the following would be faster, but normally only negligably...
SELECT TOP 1 ticket_number FROM transaction ORDER BY ticket_number DESCENDING
I'm also uncertain as to whether you actually have an Index on [ticket_number]? Or do you just have a UNIQUE constraint? A constraint won't help determine a MAX, but an INDEX will.
In the event that an INDEX exists with ticket_number as the first indexable column:
- An index seek/lookup would likely be used, not needing to scan the other values at all
In the event that an INDEX exists with ticket_number Not as the first indexable column:
- An index scan would likely occur, checking every single unique entry in the index
In the event that no usable INDEX exists:
- The whole table would be scanned