I have a question for an assignment that I'm confused about.
The question is:
Calucalte the lowest cost of executing query:
SELECT max(profit) FROM Salesman;
What would the formula be for working this out? Would I need to use a SELECT cost formula such as Linear Search cost, or would I use an Aggregate search forumla? Or the two combined?
Note: The profit field is not indexed in a B-Tree
I'm just having trouble deciding what forumla to use for this query.
I'm not sure what metrics you're using to calculate the cost. But, the question requests the "lowest" cost. So, imagine the situation that takes the least work from the system, then calculate the cost using whatever your instructor or course indicates you should use.
If your data is predetermined, couldn't you just use the chosen database system itself to describe the costs?
Related
I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!
If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.
Consider I have three tables: order, sub_order and product.
The cost of an sub_order is built upon a complex formula which then depends on the individual costs of the products. The cost of the order is simply the sum of the sub_order costs, although this formula might change in the future.
We store the derived field order.order_cost for convinience.
The questions I have are:
Should business rules be applied to the database layer? If so, is there a way to force the constraint for order_cost using SQL? That is, order_cost is always the sum of sub_order_cost
If you want the order cost to always be the sum of the costs from sub_order, then calculate the value on the fly. There is no need to store the value, unless you have a performance reason. Your question doesn't mention performance as a reason to duplicate the data.
If you want the order cost to be initialized to the sum, then you can use a trigger. That would be strange, though, because I would expect the order to be created before the sub_orders.
If you want to maintain the order amount as sub_orders are created, deleted, and updated, then you need triggers. Or, just calculate the values on the fly.
Select pn.pn_level_id From phone_number pn Where pn.phone_number='15183773646'
Select pn.pn_level_id From phone_number pn Where pn.phone_number=' 15183773646'
Do you think they are the same ? No. Basically they are not the same in the PL/SQL Devloper.
I'm wondering why the latter one's cost is less than the previous sql.
Any help would be appreciated!
The cost is not the same because when cost is calculated planner takes into account available statistics. Statistics among other things contains values which appear most frequently in column with their frequencies. It allows planner to better estimate number of rows which will be fetched and to decide how to better get data (e.g. via sequential scan or by index).
In your case value 15183773646 is probably among mostly frequently appearing and that's why planner estimation is different for the query involving it (as planner has better estimation of number of such rows) as compared to other values for which it basically guesses.
The optimizer evaluates all the ways to extract data, since you database doesn't have the pn.phone_number=' 15183773646' and this value is not stored in the index hence Oracle skip one I/O operation this is why the second cost is less.
I have a MySQL SELECT query that calculates a distance using Pythagoras in the WHERE clause to restrict results to a certain radius.
I also use the exact same calculation in the ORDER BY clause to sort the results by smallest distance first.
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
If it does, how can I optimize the query so it is only calculated once (if possible at all)?
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
No, the calculation will not be performed twice if it is written in exactly the same way. However if your aim is to improve the performance of your application then you might want to look at the bigger picture rather than concentrating on this minor detail which could give you at most a factor of two difference. A more serious problem is that your query prevents efficient usage of indexes and will result in a full scan.
I would recommend that you change your database so that you use the geometry type and create a spatial index on your data. Then you can use MBRWithin to quickly find the points that lie inside the bounding box of your circle. Once you have found those points you can run your more expensive distance test on those points only. This approach will be significantly faster if your table is large and a typical search returns only a small fraction of the rows.
If you can't change the data model then you can still improve the performance by using a bounding box check first, for example WHERE x BETWEEN 10 AND 20 AND y BETWEEN 50 AND 60. The bounding box check will be able to use an index, but because R-Tree indexes are only supported on the geometry type you will have to use the standard B-Tree index which is not as efficient for this type of query (but still much better than what you are currently doing).
You could possibly select for it, put it in the HAVING clause and use it in the ORDER BY clause, then the calculation is certainly only done once, but I guess that would be slower, because it has to work with more data. The calculation itself is not that expensive.
What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?
One would think that calling sum(myColumn) would be Linear.
But count(1) isn't. How come and what are the real time-complexity?
In a perfect world I would want sum, avg and count to be O(1). But we don't live in one of those, do we?
What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?
In MySQL with MyISAM, COUNT(*) without GROUP BY is O(1) (constant)
It is stored in the table metadata.
In all systems, MAX and MIN on indexed expressions without GROUP BY are O(log(n)) (logarithmic).
They are fetched with a single index seek.
Aggregate functions are O(n) (linear), when used without GROUP BY or GROUP BY uses HASH
Aggregate functions are O(n log(n)) when GROUP BY uses SORT.
All values should be fetched, calculated and stored in the state variables (which may be stored in a hash table).
In addition, when using SORT, they should also be sorted.
In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).
Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.
Note: this is speculation based on my understanding of how SQL query planners work, and may not be entirely accurate.
I believe all the aggregate functions, or at least the "math" ones you name above, should be O(n). The query will be executed roughly as follows:
Fetch rows matching the join predicates and filter predicates (ie "WHERE clause")
Create row-groups according to the GROUP BY clause. A single row-group is created for queries with no GROUP BY
For each row group, apply the aggregate function to the rows in the group. For things like SUM, AVG, MIN, MAX as well as non-numeric functions like CONCAT there are simple O(n) algorithms, and I suspect those are used. Create one row in the output set for each row-group created in step #2
If a HAVING predicate is present, filter output rows using this predicate
Note, however, that even though the aggregate functions are O(n), the operation might not be. If you create a query which cartesian joins a table to itself, you will be looking at O(n*n) minimum just to create the initial row set (step #1). Sorting to create row-groups (step #2) may be O(n lg n), and may require disk storage for the sort operation (as opposed to in-memory operation only), so your query may still perform poorly if you are manipulating many rows.
For big data-warehouse style queries, the major databases can parallelize the task, so have multiple CPUs working on it. There will therefore be threshold points where it isn't quite linear as the cost of co-ordinating the parallel threads trades off against the benefit of using the multiple CPUs.