Will the query plan be changed on different data size? - sql

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?

Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.

Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

Related

Estimate Rows vs Actual Rows, what is the impact on performance?

I have a query that performs very quickly but in production when server loads are high its performance is underwhelming. I have a suspicion that it might be the Estimated Rows being much lower than the Actual Rows in the execution plan. I know that server statistics are not stale.
I am now optimizing a new query and I worry that it will have the same problem in production. The number of rows returned and the CPU and Reads are well within the designated thresholds my data admins require. As you can see in the above SQL Sentry plan there are a few temp tables that estimate a single row but return 100 times as many rows.
My question is this, even when the number of rows are few, does a difference in rows by such a large percentage cause bottlenecks on the server's performance? Secondary question, if the problem isn't a bad cached plan or stale stats, what other issues would cause a plan to show such a discrepancy?
A difference between actual and estimated rows does not cause a "bottleneck" in the server.
The impact is on algorithms and resource allocation for the query. SQL Server has multiple algorithms that it can use for things like JOINs and GROUP BYs. The (estimated) size of the data is one of the primary items of information that it uses to choose the appropriate algorithm.
Choosing the wrong algorithm is not exactly a bottleneck, but it does slow the query down. You would need to study the execution plan to see if this is happening in your case.
If you have simple queries that select from a single table, then there are many fewer options for the execution plan. The only impact I can readily think of in this case would be using an full table scan rather than an index for filtering. For your data sizes, I don't think that would make much of a difference.
Estimate Rows vs Actual Rows, what is the impact on performance?
If there is huge difference between Estimate Rows and Actual Rows then you need to worry about that query.
There can be no of reason for this.
Stale Statistics
Skewed data distribution : Here Statistics is updated, but it is skewed.Create Filtered Statistics for those index will help.
Un-Optimize query :Poorly written query.Join condition are in wrong manner.

Make query run faster - IT HAS NO JOIN

I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?
What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!
As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.

Postgres query optimization

On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?
This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.
Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

Does the speed of the query depend on the number of rows in the table?

Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?
The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.
Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.
Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.
Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.
The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.
You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity