Estimate Rows vs Actual Rows, what is the impact on performance?

Estimate Rows vs Actual Rows, what is the impact on performance? - sql

I have a query that performs very quickly but in production when server loads are high its performance is underwhelming. I have a suspicion that it might be the Estimated Rows being much lower than the Actual Rows in the execution plan. I know that server statistics are not stale.
I am now optimizing a new query and I worry that it will have the same problem in production. The number of rows returned and the CPU and Reads are well within the designated thresholds my data admins require. As you can see in the above SQL Sentry plan there are a few temp tables that estimate a single row but return 100 times as many rows.
My question is this, even when the number of rows are few, does a difference in rows by such a large percentage cause bottlenecks on the server's performance? Secondary question, if the problem isn't a bad cached plan or stale stats, what other issues would cause a plan to show such a discrepancy?

A difference between actual and estimated rows does not cause a "bottleneck" in the server.
The impact is on algorithms and resource allocation for the query. SQL Server has multiple algorithms that it can use for things like JOINs and GROUP BYs. The (estimated) size of the data is one of the primary items of information that it uses to choose the appropriate algorithm.
Choosing the wrong algorithm is not exactly a bottleneck, but it does slow the query down. You would need to study the execution plan to see if this is happening in your case.
If you have simple queries that select from a single table, then there are many fewer options for the execution plan. The only impact I can readily think of in this case would be using an full table scan rather than an index for filtering. For your data sizes, I don't think that would make much of a difference.

Estimate Rows vs Actual Rows, what is the impact on performance?
If there is huge difference between Estimate Rows and Actual Rows then you need to worry about that query.
There can be no of reason for this.
Stale Statistics
Skewed data distribution : Here Statistics is updated, but it is skewed.Create Filtered Statistics for those index will help.
Un-Optimize query :Poorly written query.Join condition are in wrong manner.

Related

Can I speed up a large JDBC query with Futures?

I have a Postgres table with millions of rows. One of the columns is a timestamp and I need to query rows with time greater than x and less than y. Since I am getting hundreds of thousands of rows back, the query is taking a long time even though I indexed it.
My plan is to use a list of Futures to each make a query over a small time interval concurrently and then I will aggregate the results after.
Should I expect the large speedup I'm hoping for? Is there a better approach?

Depends on where you're bottlenecked. If it's on network I/O it won't help significantly.
If it's disk I/O, it might help a bit, since it'll effectively parallelize the query. But only if your DB server benefits significantly from I/O parallelization. It's possible that you won't see much benefit, so test it by manually dispatching a bunch of queries simultaneously and timing it to see before you bother reworking your code.
Make sure you're doing forward index scans though - see EXPLAIN ANALYZE; you'll get less benefit if you're doing backward index scans.
TL;DR: Benchmark and see.

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?

Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.

Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

SQL SERVER - Execution plan

I have a VIEW in both Databases. At one database, takes less then 1 second to run and but in the other database 1 minute or more to go. I check indexes and everything is the same. The diference between the number of rows is below than 10 millions of rows from each other database.
I check de exectuion plan, and what i found is that, the database that takes more time, i have 3 Hash Match(1 aggregate and 2 right outer join) that is responssible for 100% on the query batch. On the other database i don't have this in the execution plan.
Can anyone tell me where can i begin to search the problem?
Thank you, sorry for the bad english.

You can check this link here for a quick explanation on different types of joins.
Basically, with the information you've given us, here are some of the alternatives for what might be wrong:
One DB has indexes the other doesn't.
The size difference between some of the joined tables in one DB over the other, is dramatic enough to change the type of join used.
While your indexes might be the same on both DB table groups, as you said.. it's possible the other DB has outdated / bad statistics or too much index fragmentation, resulting in sub-optimal plans.
EDIT:
Regarding your comment below, it's true that rebuilding indexes is similar to dropping & recreating indexes. And since creating indexes also creates the statistics for those indexes, rebuilding will take care of them as well. Sometimes that's not enough however.
While officially default statistics should be built with about 20% sampling rate of the actual data, in reality the sampling rate can be as low as just a few percents depending on how massive the table is. It's rarely anywhere near 20%. Because of that, many DBA's build statistics manually with FULLSCAN to obtain a 100% sampling rate.
The statistics take equally much storage space either way, so there are really no downsides to this aside from the extra time required in maintenance plans. In my current project, we have several situations where the default sampling rate for the statistics is not enough, and would still produce bad plans. So we routinely update all statistics with FULLSCAN every few weeks to make sure the performance stays top notch.

Querying Oracle table of high degree of parallelism results in full table scan

Well, the title described what I've just encountered recently with Oracle database.
Here's some background:
Table in concern in partitioned by hash into 4 partitions.
Parallel degree of the table is 4.
Hash key equals PK.
There is quite a number of rows in the table, around 200M.
PK index is also partitioned (local partition).
Parallel degree of the index is 1.
Okay now I've got a query behaves strangely as I change the parallel degree of the table.
If table degree is 4, it results in full table scan (coordinated parallel full table scan) as revealed by explain plan. Takes 30 minutes or more to complete the query.
If table degree is 1-3, it correctly make use of the PK index (range scan, single threaded) and returns result in 20 seconds.
If I set both table degree and index degree to 4, results in full table scan (same result as the first scenario in above).
This behavior, however, does not happen in another database where I have an nearly identical clone of the table. The only difference is number of records. The table in another database is of slightly smaller size (minus 1-2 million). The smaller table, also with degree of 4, does not runs into full table scan with the same query.
I've spent some time on Googling around and found the following things about parallel query:
From Oracle official doc
A high degree of parallelism for a table skews the optimizer toward full table scans over range scans. Examine the DEGREE column in ALL_TABLES for the table to determine the degree of parallelism.
And from http://www.toadworld.com/Portals/0/GuyH/Articles/Oracle%20Parallel%20SQL%20Part%201.pdf
Parallel query should be applied when
The SQL performs at least one full table, index or partition scan
And from AskTom.com
Parallel query is suitable for a certain class of large problems: very large problems
that have no other solution. Parallel query is my last path of action for solving a
performance problem; it's never my first course of action.
It seems that parallel execution is designed for processing a very large scale of data when no other better solution exists. It attempts to give better performance by running things in parallel, with each CPU (process) dedicated to work on separated portion of data (block range, table partitions or index partitions). Such that it is not designed to speed up general query, or query that does not cover a sufficient portion of the whole table.
Is my above understanding correct that parallel should not be used as a mean to speed up general query?
If yes, is that also means that the best practice to turn off parallel (degree as 0) and enable for particular query/operation through hint or parallel clause?
And in addition to all, what should be the best practice for setting up PARALLEL? If what I want to do is give best read performance through multi-threading, what should the setup be?
Lots of questions here. Lots of thanks in advance.

As a general rule I agree with Tom. Our main base table is an approx 240m rows iot, plus other indexes, with somewhere between 10 and 1,000 insert, delete, update operations happening 24 hours a day. We generally get information out of it in split seconds and then if we want a lot of information go for the full scan and deal with the 2.5 hours it takes. In answer to some of your questions, if you're going to be doing more large queries than small ones then go with the partition. If not then don't.

For your specific query, parallelism likely isn't your biggest problem. The new estimated cost and time of a query will be very roughly equal to the original cost divided by the degree of parallelism. The optimizer could be wrong here; for example, if you only have one hard drive then the new plan probably won't be any faster at all. But a 4x estimate mistake shouldn't lead to a 90x performance difference. This leads me to believe that your plan was already on the brink of failure, and this just tipped it over. How close are the estimated and actual cardinalities of your non-parallel plan? Whatever is causing those differences might be responsible for the bulk of your problem.
For your more general questions, there are no simple answers. There are several dozen things you may need to consider for parallelism, only you can know which ones will apply to your situation. Your best bet is to stop trying to Google it, and instead read the manual. The Using Parallel Execution chapter in the Data Warehousing Guide is a good place to start.

Degree of a relation or table in SQL means number of attribute in a relation.
For Example: If a relation in SQL has three rows and four columns then its degree in four. Simply we can say that number of columns of a relation called its degree.

Does the speed of the query depend on the number of rows in the table?

Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?

The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.

Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.

Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.

Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.

The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.

You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas