Improve performance of querys in Postgresql with an index - sql

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?

Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.

I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)

The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.

Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

Related

Make query run faster - IT HAS NO JOIN

I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?
What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!
As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

What is the best way to ensure consistent ordering in an Oracle query?

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

Index not used Postgres

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.