My department was recently reprimanded (nicely) by our IT department for running queries with very high costs on the premise that our queries have a real possibility of destabilizing and/or crashing the database. None of us are DBA's; were just researchers who write and execute queries against the database, and I'm probably the only one who ever looked at an explain plan before the reprimand.
We were told that query costs over 100 should be very rare, and queries with costs over 1000 should never be run. The problems I am running into are that cost seems have no correlation with execution time, and I'm losing productivity while trying to optimize my queries.
As an example, I have a query that executes in under 5 seconds with a cost of 10844. I rewrote the query to use a view that contains most of the information I need, and got the cost down to 109, but the new query, which retrieves the same results, takes 40 seconds to run. I found a question here with a possible explanation:
Measuring Query Performance : "Execution Plan Query Cost" vs "Time Taken"
That question led me to parallelism hints. I tried using /*+ no_parallel*/ in the cost 10884 query, but the cost did not change, nor did the execution time, so I'm not sure that parallelism is the explanation for the faster execution time but higher cost. Then, I tried using the /*+ parallel(n)*/ hint, and found that the higher the value of n, the lower the cost of the query. In the case of cost 10844 query, I found that /*+ parallel(140)*/ dropped the cost to 97, with a very minor increase in execution time.
This seemed like an ideal "cheat" to meet the requirements that our IT department set forth, but then I read this:
http://www.oracle.com/technetwork/articles/datawarehouse/twp-parallel-execution-fundamentals-133639.pdf
The article contains this sentence:
Parallel execution can enable a single operation to utilize all system resources.
So, my questions are:
Am I actually placing more strain on the server resources by using the /*+ parallel(n)*/ hint with a very high degree of parallelism, even though I am lowering the cost?
Assuming no parallelism, is execution speed a better measure of resources used than cost?
The rule your DBA gave you doesn't make a lot of sense. Worrying about the cost that is reported for a query is very seldom productive. First, you cannot directly compare the cost of two different queries-- one query that has a cost in the millions may run very quickly and consume very few system resources another query that has a cost in the hundreds may run for hours and bring the server to its knees. Second, cost is an estimate. If the optimizer made an accurate estimate of the cost, that strongly implies that it has come up with the optimal query plan which would mean that it is unlikely that you'd be able to modify the query to return the same results while using fewer resources. If the optimizer made an inaccurate estimate of the cost, that strongly implies that it has come up with a poor query plan in which case the reported cost would have no relationship to any useful metric you'd come up with. Most of the time, the queries you're trying to optimize are the queries where the optimizer generated an incorrect query plan because it incorrectly estimated the cost of various steps.
Tricking the optimizer by using hints that may or may not actually change the query plan (depending on how parallelism is configured, for example) is very unlikely to solve a problem-- it's much more likely to cause the optimizer's estimates to be less accurate and make it more likely that it chooses a query plan that consumes far more resources than it needs to. A parallel hint with a high degree of parallelism, for example, would tell Oracle to drastically reduce the cost of a full table scan which makes it more likely that the optimizer would choose that over an index scan. That is seldom something that your DBAs would want to see.
If you're looking for a single metric that tells you whether a query plan is reasonable, I'd use the amount of logical I/O. Logical I/O is correlated pretty well with actual query performance and with the amount of resources your query consumes. Looking at execution time can be problematic because it varies significantly based on what data happens to be cached (which is why queries often run much faster the second time they're executed) while logical I/O doesn't change based on what data is in cache. It also lets you scale your expectations as the number of rows your queries need to process change. If you're writing a query that needs to aggregate data from 1 million rows, for example, that should consume far more resources than a query that needs to return 100 rows of data from a table with no aggregation. If you're looking at logical I/O, you can easily scale your expectations to the size of the problem to figure out how efficient your queries could realistically be.
In Christian Antognini's "Troubleshooting Oracle Performance" (page 450), for example, he gives a rule of thumb that is pretty reasonable
5 logical reads per row that is returned/ aggregated is probably very good
10 logical reads per row that is returned/ aggregated is probably adequate
20+ logical reads per row that is returned/ aggregated is probably inefficient and needs to be tuned
Different systems with different data models may merit tweaking the buckets a bit but those are likely to be good starting points.
My guess is that if you're researchers that are not developers, you're probably running queries that need to aggregate or fetch relatively large data sets, at least in comparison to those that application developers are commonly writing. If you're scanning a million rows of data to generate some aggregate results, your queries are naturally going to consume far more resources than an application developer whose queries are reading or writing a handful of rows. You may be writing queries that are just as efficient from a logical I/O per row perspective, you just may be looking at many more rows.
If you are running queries against the live production database, you may well be in a situation where it makes sense to start segregating workload. Most organizations reach a point where running reporting queries against the live database starts to create issues for the production system. One common solution to this sort of problem is to create a separate reporting database that is fed from the production system (either via a nightly snapshot or by an ongoing replication process) where reporting queries can run without disturbing the production application. Another common solution is to use something like Oracle Resource Manager to limit the amount of resources available to one group of users (in this case, report developers) in order to minimize the impact on higher priority users (in this case, users of the production system).
Related
I am using very large tables containing hundreds of millions of rows, and I am measuring the performances of some queries using SQL Developer, I figured out that there is an option called Unshared SQL worksheet, it allows me to execute many queries at the same time. Executing many queries at the same time is suitable for me especially that some queries or procedure take hours to be executed.
My question is does executing many queries at the same time affect performances? (by performances I mean the duration of execution of queries)
Every query costs something to execute. That's just physics. Your database has a fixed amount of resources - CPU, memory, I/O, temp disk - to service queries (let's leave elastic cloud instances out of the picture). Every query which is executing simultaneously is asking for resources from that fixed pot. Potentially, if you run too many queries at the same time you will run into resource contention, which will affect the performance of individual queries.
Note the word "potentially". Whether you will run into actual problem depends on many things: what resources your queries need, how efficiently your queries have been written, how much resource your database server has available, how efficiently it's been configured to support multiple users (and whether the DBA has implemented profiles to manage resource usage). So, like with almost every database tuning question, the answer is "it depends".
This is true even for queries which hit massive tables such as you describe. Although, if you have queries which you know will take hours to run you might wish to consider tuning them as a matter of priority.
I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.
How much of an impact does the number of records in a table have on the performance of a query that returns approx 5k records using a 'date is greater than' criteria?
For example where a table might have a total of 50k records vs 100k.
In general, yes, more records to retrieve will take longer to load. However, it's impossible for us to give you a specific number, as there's a lot of factors involved, and every scenario would be different.
Specifically, it could depend on whether or not the database is hosted locally or on a server. It would depend on indexing, hardware (disks, specifically), possibly network speed, etc., etc.
If you're pulling twice as much data, theoretically it should take about double the time, but again, YMMV. With proper indexing, it might be faster.
Incidentally, if speed (and perhaps more importantly, recovery, security, storage size and multi-user usage) might be an issue, you're probably better off switching to SQL Server.
On oracle 10gr2, I have several sql queries that I am comparing performance. But after their first run, the v$sql table has the execution plan stored for caching, so for one of the queries I go from 28 seconds on first run to .5 seconds after.
I've tried
ALTER SYSTEM FLUSH BUFFER_CACHE;
After running this, the query consistently runs at 5 seconds, which I do not believe is accurate.
Thought maybe deleting the line item itself from the cache:
delete from v$sql where sql_text like 'select * from....
but I get an error about not being able to delete from view.
Peter gave you the answer to the question you asked.
alter system flush shared_pool;
That is the statement you would use to "delete prepared statements from the cache".
(Prepared statements aren't the only objects flushed from the shared pool, the statement does more than that.)
As I indicated in my earlier comment (on your question), v$sql is not a table. It's a dynamic performance view, a convenient table-like representation of Oracle's internal memory structures. You only have SELECT privilege on the dynamic performance views, you can't delete rows from them.
flush the shared pool and buffer cache?
The following doesn't answer your question directly. Instead, it answers a fundamentally different (and maybe more important) question:
Should we normally flush the shared pool and/or the buffer cache to measure the performance of a query?
In short, the answer is no.
I think Tom Kyte addresses this pretty well:
http://www.oracle.com/technology/oramag/oracle/03-jul/o43asktom.html
http://www.oracle.com/technetwork/issue-archive/o43asktom-094944.html
<excerpt>
Actually, it is important that a tuning tool not do that. It is important to run the test, ignore the results, and then run it two or three times and average out those results. In the real world, the buffer cache will never be devoid of results. Never. When you tune, your goal is to reduce the logical I/O (LIO), because then the physical I/O (PIO) will take care of itself.
Consider this: Flushing the shared pool and buffer cache is even more artificial than not flushing them. Most people seem skeptical of this, I suspect, because it flies in the face of conventional wisdom. I'll show you how to do this, but not so you can use it for testing. Rather, I'll use it to demonstrate why it is an exercise in futility and totally artificial (and therefore leads to wrong assumptions). I've just started my PC, and I've run this query against a big table. I "flush" the buffer cache and run it again:
</excerpt>
I think Tom Kyte is exactly right. In terms of addressing the performance issue, I don't think that "clearing the oracle execution plan cache" is normally a step for reliable benchmarking.
Let's address the concern about performance.
You tell us that you've observed that the first execution of a query takes significantly longer (~28 seconds) compared to subsequent executions (~5 seconds), even when flushing (all of the index and data blocks from) the buffer cache.
To me, that suggests that the hard parse is doing some heavy lifting. It's either a lot of work, or its encountering a lot of waits. This can be investigated and tuned.
I'm wondering if perhaps statistics are non-existent, and the optimizer is spending a lot of time gathering statistics before it prepares a query plan. That's one of the first things I would check, that statistics are collected on all of the referenced tables, indexes and indexed columns.
If your query joins a large number of tables, the CBO may be considering a huge number of permutations for join order.
A discussion of Oracle tracing is beyond the scope of this answer, but it's the next step.
I'm thinking you are probably going to want to trace events 10053 and 10046.
Here's a link to an "event 10053" discussion by Tom Kyte you may find useful:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:63445044804318
tangentially related anecdotal story re: hard parse performance
A few years back, I did see one query that had elapsed times in terms of MINUTES on first execution, subsequent executions in terms of seconds. What we found was that vast majority of the time for the first execution time was spent on the hard parse.
This problem query was written by a CrystalReports developer who innocently (naively?) joined two humongous reporting views.
One of the views was a join of 62 tables, the other view was a join of 42 tables.
The query used Cost Based Optimizer. Tracing revealed that it wasn't wait time, it was all CPU time spent evaluating possible join paths.
Each of the vendor supplied "reporting" views wasn't too bad by itself, but when two of them were joined, it was agonizingly slow. I believe the problem was the vast number of join permutations that the optimizer was considering. There is an instance parameter that limits the number of permutations considered by the optimizer, but our fix was to re-write the query. The improved query only joined the dozen or so tables that were actually needed by the query.
(The initial immediate short-term "band aid" fix was to schedule a run of the query earlier in the morning, before report generation task ran. That made the report generation "faster", because the report generation run made use of the already prepared statement in the shared pool, avoiding the hard parse.
The band aid fix wasn't a real solution, it just moved the problem to a preliminary execution of the query, when the long execution time wasn't noticed.
Our next step would have probably been to implement a "stored outline" for the query, to get a stable query plan.
Of course, statement reuse (avoiding the hard parse, using bind variables) is the normative pattern in Oracle. It mproves performance, scalability, yada, yada, yada.
This anecdotal incident may be entirely different than the problem you are observing.
HTH
It's been a while since I worked with Oracle, but I believe execution plans are cached in the shared pool. Try this:
alter system flush shared_pool;
The buffer cache is where Oracle stores recently used data in order to minimize disk io.
We've been doing a lot of work lately with performance tuning queries, and one culprit for inconsistent query performance is the file system cache that Oracle is sitting on.
It's possible that while you're flushing the Oracle cache the file system still has the data your query is asking for meaning that the query will still return fast.
Unfortunately I don't know how to clear the file system cache - I just use a very helpful script from our very helpful sysadmins.
FIND ADDRESS AND HASH_VALUE OF SQL_ID
select address,hash_value,inst_id,users_executing,sql_text from gv$sqlarea where sql_id ='7hu3x8buhhn18';
PURGE THE PLAN FROM SHARED POOL
exec sys.dbms_shared_pool.purge('0000002E052A6990,4110962728','c');
VERIFY
select address,hash_value,inst_id,users_executing,sql_text from gv$sqlarea where sql_id ='7hu3x8buhhn18';
Do you have any formal or informal standards for reasonably achievable SQL query speed? How do you enforce them? Assume a production OLTP database under full realistic production load of a couple dozen queries per second, properly equipped and configured.
Personal example for illustrative purposes (not a recommendation, highly contingent on many factors, some outside your control):
Expectation:
Each transactional unit (single statement, multiple SQL statements from beginning to end transaction boundaries, or a single stored procedure, whichever is largest) must execute in 1 second or less on average, without anomalous outliers.
Resolution:
Slower queries must be optimized to standard. Slow queries for reports and other analysis are moved to an OLAP cube (best case) or a static snapshot database.
(Obviously some execution queries (Insert/Update/Delete) can't be moved, so must be optimized, but so far in my experience it's been achievable.)
Given that you can't expect deterministic performance on a system that could (at least in theory) be subject to transient load spikes, you want your performance SLA to be probabilistic. An example of this might be:
95% of transactions to complete within 2 seconds.
95% of search queries (more appropriate for a search screen) to complete within 10 seconds.
95% of operational reports to complete within 10 seconds.
Transactional and search queries can't be moved off transactional system, so the only actions you can take are database or application tuning, or buying faster hardware.
For operational reports, you need to be ruthless about what qualifies as an operational report. Only reports that absolutely need to have access to up-to-date data should be run off the live system. Reports that do a lot of I/O are very anti-social on a production system, and normalised schemas tend to be quite inefficient for reporting. Move any reports that do not require real-time data off onto a data warehouse or some other separate reporting facility.
I usually go by the one second rule when writing/refactoring stored procedures, although my workplace doesn't have any specific rules about this. It's just my common sense. Experience tells me that if it takes up to ten seconds or more for a procedure to execute, which doesn't perform any large bulk inserts, there are usually serious problems in the code that can easily be corrected.
They way most common problem I encounter in SP:s with poor performance is incorrect use of indexes, causing costly index seek operations.
O of N is good and anything worse like N^2 will eventually be too slow.