What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query? - sql

The query remains constant i.e it will remain the same.
e.g. a select query takes 30 minutes if it returns 10000 rows.
Would the same query take 1 hour if it has to return 20000 rows?
I am interested in knowing the mathematical relation between no. of rows(N) and execution time(T) keeping other parameters as constant(K).
i.e T= N*K or
T=N*K + C or
any other formula?
Reading http://research.microsoft.com/pubs/76556/progress.pdf if it helps. Anybody who can understand this before me, please do reply. Thanks...

Well that is good question :), but there is not exact formula, because it depends of execution plan.
SQL query optimizer could choose another execution plan on query which return different number of rows.
I guess if the query execution plan is the same for both query's and you have some "lab" conditions then time growth could be linear. You should research more on sql execution plans and statistics

Take the very simple example of reading every row in a single table.
In the worst case, you will have to read every page of the table from your underlying storage. The worst case for this is having to do a random seek. The seek time will dominate all other factors. So you can estimate the total time.
time ~= seek time x number of data pages
Assuming your rows are of a fairly regular size, then this is linear in the number of rows.
However databases do a number of things to try and avoid this worst case. For example, in SQL Server table storage is often allocated in extents of 8 consecutive pages. A hard drive has a much faster streaming IO rate than random IO rate. If you have a clustered index, reading the pages in cluster order tend to have a lot more streaming IO than random IO.
The best case time, ignoring memory caching, is (8KB is the SQL Server page size)
time ~= 8KB * number of data pages / streaming IO rate in KB/s
This is also linear in the number of rows.
As long as you do a reasonable job managing fragmentation, you could reasonably extrapolate linearly in this simple case. This assumes your data is much larger than the buffer cache. If not, you also have to worry about the cliff edge where your query changes from reading from buffer to reading from disk.
I'm also ignoring details like parallel storage paths and access.

Related

Oracle SQL or PLSQL scale with load

Suppose I have query ( it has joins on multiple tables ) and assuming it is tuned, and optimized. This query runs on the target database/tables with N number of records and query results R number of records and takes time T. Now gradually the load increases and say the target records become N2, and result it give is R2 and time it takes as T2. Assuming that I have allocated enough memory to the Oracle, L2/L1 will be close to T2/T1. Means the proportional increase in the load will result proportional increase in execution time. For this question lets say L2 = 5L1, means load has increased to 5times. Then time take to complete by this query would also be 5times or little more, right? So, to reduce the proportional growth in time, do we have options in Oracle, like parallel hint etc? In Java we split the job in multiple threads and 2times the load with 2times the worker thread we get almost same time to complete. So with increasing load we increase the worker thread and achieve the scaling issue reasonably well. Is such thing possible in Oracle or does Oracle take care of such thing in the back end and will scale, by splitting the load internally into parallel processing? Here, I have multi core processors. I Will experiment it, but if expert opinion is available it will help.

No. Query algorithms do not necessarily grow linearly.
You should probably learn something about algorithms and complexity. But many algorithms used in a data are super-linear. For instance, ordering a set of rows has a complexity of O(n log n), meaning that if you double the data size, the time taken for sorting more than doubles.
This is also true of index lookups and various join algorithms.
On the other hand, if your query is looking up a few rows using a b-tree index, then the complex is O(log n) -- this is sublinear. So index lookups grow more slowly than the size of the data.
So, in general you cannot assume that increasing the size of data by a factor of n has a linear effect on the time.

Estimate Rows vs Actual Rows, what is the impact on performance?

I have a query that performs very quickly but in production when server loads are high its performance is underwhelming. I have a suspicion that it might be the Estimated Rows being much lower than the Actual Rows in the execution plan. I know that server statistics are not stale.
I am now optimizing a new query and I worry that it will have the same problem in production. The number of rows returned and the CPU and Reads are well within the designated thresholds my data admins require. As you can see in the above SQL Sentry plan there are a few temp tables that estimate a single row but return 100 times as many rows.
My question is this, even when the number of rows are few, does a difference in rows by such a large percentage cause bottlenecks on the server's performance? Secondary question, if the problem isn't a bad cached plan or stale stats, what other issues would cause a plan to show such a discrepancy?

A difference between actual and estimated rows does not cause a "bottleneck" in the server.
The impact is on algorithms and resource allocation for the query. SQL Server has multiple algorithms that it can use for things like JOINs and GROUP BYs. The (estimated) size of the data is one of the primary items of information that it uses to choose the appropriate algorithm.
Choosing the wrong algorithm is not exactly a bottleneck, but it does slow the query down. You would need to study the execution plan to see if this is happening in your case.
If you have simple queries that select from a single table, then there are many fewer options for the execution plan. The only impact I can readily think of in this case would be using an full table scan rather than an index for filtering. For your data sizes, I don't think that would make much of a difference.

Estimate Rows vs Actual Rows, what is the impact on performance?
If there is huge difference between Estimate Rows and Actual Rows then you need to worry about that query.
There can be no of reason for this.
Stale Statistics
Skewed data distribution : Here Statistics is updated, but it is skewed.Create Filtered Statistics for those index will help.
Un-Optimize query :Poorly written query.Join condition are in wrong manner.

Oracle 10g Full table scan(parallel access) 100x times faster than index access by rowid

There was a query in production which was running for several hours(5-6) hours. I looked into its execution plan, and found that it was ignoring a parallel hint on a huge table. Reason - it was using TABLE ACCESS BY INDEX ROWID. So after I added a /*+ full(huge_table) */ hint before the parallel(huge_table) hint, the query started running in parallel, and it finished in less than 3 minutes. What I could not fathom was the reason for this HUGE difference in performance.
The following are the advantages of parallel FTS I can think of:
Parallel operations are inherently fast if you have more idle CPUs.
Parallel operations in 10g are direct I/O which bypass
buffer cache which means there is no risk of "buffer busy waits" or
any other contention for buffer cache.
Sure there are the above advantages but then again the following disadvantages are still there:
Parallel operations still have to do I/O, and this I/O would be more than what we have for TABLE ACCESS BY INDEX ROWID as the entire table is scanned and is costlier(all physical reads)
Parallel operations are not very scalable which means if there aren't enough free resources, it is going to be slow
With the above knowledge at hand, I see only one reason that could have caused the poor performance for the query when it used ACCESS BY INDEX ROWID - some sort of contention like "busy buffer waits". But it doesn't show up on the AWR top 5 wait events. The top two events were "db file sequential read" and "db file scattered read". Is there something else that I have missed to take into consideration? Please enlighten me.

First, without knowing anything about your data volumes, statistics, the selectivity of your predicates, etc. I would guess that the major benefit you're seeing is from doing a table scan rather than trying to use an index. Indexes are not necessarily fast and table scans are not necessarily slow. If you are using a rowid from an index to access a row, Oracle is limited to doing single block reads (sequential reads in Oracle terms) and that it's going to have to read the same block many times if the block has many rows of interest. A full table scan, on the other hand, can do nice, efficient multiblock reads (scattered reads in Oracle terms). Sure, an individual single block read is going to be more efficient than a single multiblock read but the multiblock read is much more efficient per byte read. Additionally, if you're using an index, you've potentially got to read a number of blocks from the index periodically to find out the next rowid to read from the table.
You don't actually need to read all that much data from the table before a table scan is more efficient than an index. Depending on a host of other factors, the tipping point is probably in the 10-20% range (that's a very, very rough guess). Imagine that you had to get a bunch of names from the phone book and that the phone book had an index that included the information you're filtering on and the page that the entry is on. You could use an index to find the name of a single person you want to look at, flip to the indicated page, record the information, flip back to the index, find the next name, flip back, etc. Or you could simply start at the first name, scan until you find a name of interest, record the information, and continue the scan. It doesn't take too long before you're better off ignoring the index and just reading from the table.
Adding parallelism doesn't reduce the amount of work your query does (in fact, adding in parallel query coordination means that you're doing more work). It's just that you're doing that work over a shorter period of elapsed time by using more of the server's available resources. If you're running the query with 6 parallel slaves, that could certainly allow the query to run 5 times faster overall (parallel query obviously scales a bit less than linearly because of overheads). If that's the case, you'd expect that doing a table scan made the query 20 times faster and adding parallelism added another factor of 5 to get your 100x improvement.

Querying Oracle table of high degree of parallelism results in full table scan

Well, the title described what I've just encountered recently with Oracle database.
Here's some background:
Table in concern in partitioned by hash into 4 partitions.
Parallel degree of the table is 4.
Hash key equals PK.
There is quite a number of rows in the table, around 200M.
PK index is also partitioned (local partition).
Parallel degree of the index is 1.
Okay now I've got a query behaves strangely as I change the parallel degree of the table.
If table degree is 4, it results in full table scan (coordinated parallel full table scan) as revealed by explain plan. Takes 30 minutes or more to complete the query.
If table degree is 1-3, it correctly make use of the PK index (range scan, single threaded) and returns result in 20 seconds.
If I set both table degree and index degree to 4, results in full table scan (same result as the first scenario in above).
This behavior, however, does not happen in another database where I have an nearly identical clone of the table. The only difference is number of records. The table in another database is of slightly smaller size (minus 1-2 million). The smaller table, also with degree of 4, does not runs into full table scan with the same query.
I've spent some time on Googling around and found the following things about parallel query:
From Oracle official doc
A high degree of parallelism for a table skews the optimizer toward full table scans over range scans. Examine the DEGREE column in ALL_TABLES for the table to determine the degree of parallelism.
And from http://www.toadworld.com/Portals/0/GuyH/Articles/Oracle%20Parallel%20SQL%20Part%201.pdf
Parallel query should be applied when
The SQL performs at least one full table, index or partition scan
And from AskTom.com
Parallel query is suitable for a certain class of large problems: very large problems
that have no other solution. Parallel query is my last path of action for solving a
performance problem; it's never my first course of action.
It seems that parallel execution is designed for processing a very large scale of data when no other better solution exists. It attempts to give better performance by running things in parallel, with each CPU (process) dedicated to work on separated portion of data (block range, table partitions or index partitions). Such that it is not designed to speed up general query, or query that does not cover a sufficient portion of the whole table.
Is my above understanding correct that parallel should not be used as a mean to speed up general query?
If yes, is that also means that the best practice to turn off parallel (degree as 0) and enable for particular query/operation through hint or parallel clause?
And in addition to all, what should be the best practice for setting up PARALLEL? If what I want to do is give best read performance through multi-threading, what should the setup be?
Lots of questions here. Lots of thanks in advance.

As a general rule I agree with Tom. Our main base table is an approx 240m rows iot, plus other indexes, with somewhere between 10 and 1,000 insert, delete, update operations happening 24 hours a day. We generally get information out of it in split seconds and then if we want a lot of information go for the full scan and deal with the 2.5 hours it takes. In answer to some of your questions, if you're going to be doing more large queries than small ones then go with the partition. If not then don't.

For your specific query, parallelism likely isn't your biggest problem. The new estimated cost and time of a query will be very roughly equal to the original cost divided by the degree of parallelism. The optimizer could be wrong here; for example, if you only have one hard drive then the new plan probably won't be any faster at all. But a 4x estimate mistake shouldn't lead to a 90x performance difference. This leads me to believe that your plan was already on the brink of failure, and this just tipped it over. How close are the estimated and actual cardinalities of your non-parallel plan? Whatever is causing those differences might be responsible for the bulk of your problem.
For your more general questions, there are no simple answers. There are several dozen things you may need to consider for parallelism, only you can know which ones will apply to your situation. Your best bet is to stop trying to Google it, and instead read the manual. The Using Parallel Execution chapter in the Data Warehousing Guide is a good place to start.

Degree of a relation or table in SQL means number of attribute in a relation.
For Example: If a relation in SQL has three rows and four columns then its degree in four. Simply we can say that number of columns of a relation called its degree.

Does the speed of the query depend on the number of rows in the table?

Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?

The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.

Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.

Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.

Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.

The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.

You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas