SQL SERVER - Execution plan - sql

I have a VIEW in both Databases. At one database, takes less then 1 second to run and but in the other database 1 minute or more to go. I check indexes and everything is the same. The diference between the number of rows is below than 10 millions of rows from each other database.
I check de exectuion plan, and what i found is that, the database that takes more time, i have 3 Hash Match(1 aggregate and 2 right outer join) that is responssible for 100% on the query batch. On the other database i don't have this in the execution plan.
Can anyone tell me where can i begin to search the problem?
Thank you, sorry for the bad english.

You can check this link here for a quick explanation on different types of joins.
Basically, with the information you've given us, here are some of the alternatives for what might be wrong:
One DB has indexes the other doesn't.
The size difference between some of the joined tables in one DB over the other, is dramatic enough to change the type of join used.
While your indexes might be the same on both DB table groups, as you said.. it's possible the other DB has outdated / bad statistics or too much index fragmentation, resulting in sub-optimal plans.
EDIT:
Regarding your comment below, it's true that rebuilding indexes is similar to dropping & recreating indexes. And since creating indexes also creates the statistics for those indexes, rebuilding will take care of them as well. Sometimes that's not enough however.
While officially default statistics should be built with about 20% sampling rate of the actual data, in reality the sampling rate can be as low as just a few percents depending on how massive the table is. It's rarely anywhere near 20%. Because of that, many DBA's build statistics manually with FULLSCAN to obtain a 100% sampling rate.
The statistics take equally much storage space either way, so there are really no downsides to this aside from the extra time required in maintenance plans. In my current project, we have several situations where the default sampling rate for the statistics is not enough, and would still produce bad plans. So we routinely update all statistics with FULLSCAN every few weeks to make sure the performance stays top notch.

Related

Estimate Rows vs Actual Rows, what is the impact on performance?

I have a query that performs very quickly but in production when server loads are high its performance is underwhelming. I have a suspicion that it might be the Estimated Rows being much lower than the Actual Rows in the execution plan. I know that server statistics are not stale.
I am now optimizing a new query and I worry that it will have the same problem in production. The number of rows returned and the CPU and Reads are well within the designated thresholds my data admins require. As you can see in the above SQL Sentry plan there are a few temp tables that estimate a single row but return 100 times as many rows.
My question is this, even when the number of rows are few, does a difference in rows by such a large percentage cause bottlenecks on the server's performance? Secondary question, if the problem isn't a bad cached plan or stale stats, what other issues would cause a plan to show such a discrepancy?
A difference between actual and estimated rows does not cause a "bottleneck" in the server.
The impact is on algorithms and resource allocation for the query. SQL Server has multiple algorithms that it can use for things like JOINs and GROUP BYs. The (estimated) size of the data is one of the primary items of information that it uses to choose the appropriate algorithm.
Choosing the wrong algorithm is not exactly a bottleneck, but it does slow the query down. You would need to study the execution plan to see if this is happening in your case.
If you have simple queries that select from a single table, then there are many fewer options for the execution plan. The only impact I can readily think of in this case would be using an full table scan rather than an index for filtering. For your data sizes, I don't think that would make much of a difference.
Estimate Rows vs Actual Rows, what is the impact on performance?
If there is huge difference between Estimate Rows and Actual Rows then you need to worry about that query.
There can be no of reason for this.
Stale Statistics
Skewed data distribution : Here Statistics is updated, but it is skewed.Create Filtered Statistics for those index will help.
Un-Optimize query :Poorly written query.Join condition are in wrong manner.

Make query run faster - IT HAS NO JOIN

I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?
What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!
As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

Postgres query optimization

On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?
This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.
Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!

Querying Oracle table of high degree of parallelism results in full table scan

Well, the title described what I've just encountered recently with Oracle database.
Here's some background:
Table in concern in partitioned by hash into 4 partitions.
Parallel degree of the table is 4.
Hash key equals PK.
There is quite a number of rows in the table, around 200M.
PK index is also partitioned (local partition).
Parallel degree of the index is 1.
Okay now I've got a query behaves strangely as I change the parallel degree of the table.
If table degree is 4, it results in full table scan (coordinated parallel full table scan) as revealed by explain plan. Takes 30 minutes or more to complete the query.
If table degree is 1-3, it correctly make use of the PK index (range scan, single threaded) and returns result in 20 seconds.
If I set both table degree and index degree to 4, results in full table scan (same result as the first scenario in above).
This behavior, however, does not happen in another database where I have an nearly identical clone of the table. The only difference is number of records. The table in another database is of slightly smaller size (minus 1-2 million). The smaller table, also with degree of 4, does not runs into full table scan with the same query.
I've spent some time on Googling around and found the following things about parallel query:
From Oracle official doc
A high degree of parallelism for a table skews the optimizer toward full table scans over range scans. Examine the DEGREE column in ALL_TABLES for the table to determine the degree of parallelism.
And from http://www.toadworld.com/Portals/0/GuyH/Articles/Oracle%20Parallel%20SQL%20Part%201.pdf
Parallel query should be applied when
The SQL performs at least one full table, index or partition scan
And from AskTom.com
Parallel query is suitable for a certain class of large problems: very large problems
that have no other solution. Parallel query is my last path of action for solving a
performance problem; it's never my first course of action.
It seems that parallel execution is designed for processing a very large scale of data when no other better solution exists. It attempts to give better performance by running things in parallel, with each CPU (process) dedicated to work on separated portion of data (block range, table partitions or index partitions). Such that it is not designed to speed up general query, or query that does not cover a sufficient portion of the whole table.
Is my above understanding correct that parallel should not be used as a mean to speed up general query?
If yes, is that also means that the best practice to turn off parallel (degree as 0) and enable for particular query/operation through hint or parallel clause?
And in addition to all, what should be the best practice for setting up PARALLEL? If what I want to do is give best read performance through multi-threading, what should the setup be?
Lots of questions here. Lots of thanks in advance.
As a general rule I agree with Tom. Our main base table is an approx 240m rows iot, plus other indexes, with somewhere between 10 and 1,000 insert, delete, update operations happening 24 hours a day. We generally get information out of it in split seconds and then if we want a lot of information go for the full scan and deal with the 2.5 hours it takes. In answer to some of your questions, if you're going to be doing more large queries than small ones then go with the partition. If not then don't.
For your specific query, parallelism likely isn't your biggest problem. The new estimated cost and time of a query will be very roughly equal to the original cost divided by the degree of parallelism. The optimizer could be wrong here; for example, if you only have one hard drive then the new plan probably won't be any faster at all. But a 4x estimate mistake shouldn't lead to a 90x performance difference. This leads me to believe that your plan was already on the brink of failure, and this just tipped it over. How close are the estimated and actual cardinalities of your non-parallel plan? Whatever is causing those differences might be responsible for the bulk of your problem.
For your more general questions, there are no simple answers. There are several dozen things you may need to consider for parallelism, only you can know which ones will apply to your situation. Your best bet is to stop trying to Google it, and instead read the manual. The Using Parallel Execution chapter in the Data Warehousing Guide is a good place to start.
Degree of a relation or table in SQL means number of attribute in a relation.
For Example: If a relation in SQL has three rows and four columns then its degree in four. Simply we can say that number of columns of a relation called its degree.