Make query run faster - IT HAS NO JOIN - sql

I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?

What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!

As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.

Related

What is the best way to ensure consistent ordering in an Oracle query?

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

Index not used Postgres

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.