MIN/MAX vs ORDER BY and LIMIT

MIN/MAX vs ORDER BY and LIMIT - sql

Out of the following queries, which method would you consider the better one? What are your reasons (code efficiency, better maintainability, less WTFery)...
SELECT MIN(`field`)
FROM `tbl`;
SELECT `field`
FROM `tbl`
ORDER BY `field`
LIMIT 1;

In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.
If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.
It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.

SELECT MIN(`field`)
FROM `tbl`;
Simply because it is ANSI compatible. Limit 1 is particular to MySql as TOP is to SQL Server.

As mson and Sean McSomething have pointed out, MIN is preferable.
One other reason where ORDER BY + LIMIT is useful is if you want to get the value of a different column than the MIN column.
Example:
SELECT some_other_field, field
FROM tbl
ORDER BY field
LIMIT 1

I think the answers depends on what you are doing.
If you have a 1 off query and the intent is as simple as you specified, select min(field) is preferable.
However, it is common to have these types of requirements change into - grab top n results, grab nth - mth results, etc.
I don't think it's too terrible an idea to commit to your chosen database. Changing dbs should not be made lightly and have to revise is the price you pay when you make this move.
Why limit yourself now, for pain you may or may not feel later on?
I do think it's good to stay ANSI as much as possible, but that's just a guideline...

Given acceptable performance I would use the first one because it is semantically closer to the intent.
If the performance was an issue, (Most modern optimizers will probalbly optimize both to the same query plan, although you have to test to verify that) then of course I would use the faster one.

user650654 said that ORDER BY with LIMIT 1 useful when one need "to get the value of a different column than the MIN column". I think, in this case we still have better performance with two single passes using MIN instead of sorting (hoping this is optimized :()
SELECT some_other_field, field
FROM tbl
WHERE field=(SELECT MIN(field) FROM tbl)

Related

What is the efficiency of a query + subquery that finds the minimum parameter of a table in SQL?

I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!

If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.

Is ORDER BY time consuming?

I am always wondering if ORDER BY is efficient, because I believe it inevitably need a whole-database scanning, even if the ordering field is indexed.
For example, if I order by created_at and limit 10. I think, because the database cannot know I will order by created_at a priori, it has to sort the whole data and return the first 10 items. Of course if we have an index on created_at, things might be better.
However, even with index, I think we can still run into trouble. For example, I want to sort by a function of a field, say (age^2-age-10). Even if we indexed the age field, the database cannot know what function I will use a priori, so it has to calculate the sqrt on all rows.
Am I wrong? Anyway, could anyone explain to me the workflow behind ORDER BY?

If there is an index that is sorted in the same order as specified in the ORDER BY clause, the database will not need to perform a sort operation. The query optimizer looks for indexes that can speed up your query. It analyzes your SQL query and, in the case of ORDER BY clauses, looks for indexes that have the same order. See Indexing ORDER BY for more details.
Some database engines allow indexing computed columns, which would cover the case you mentioned.

In theory, the database optimizer can take into account the limit clause when determining the query plan. This is most obviously useful with a limit 1 query, which can be implemented just by keeping track of which row has the extreme value for the columns in the order by. The same idea can be extended to larger limit sizes.
In practice, I don't think that most databases implement this optimization when the limit is larger than 1. Some may for the special case of limit 1 (or top 1 or whatever the right syntax is).
An index can be used for an order by. In general, the columns in the index would need to match exactly the appropriate columns in the index. SQL optimizers are generally not smart enough to recognize simple conversions. On the other hand, people who write SQL usually don't do such transformations.

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?

The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?

You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.

A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.

It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

SQL SELECT clause tuning

Why does the sql query execute faster if I use the actual column names in the SELECT statement instead of SELECT *?

A noticeable difference at all seems odd... since I'd expect it to be a very minuscule difference and am intrigued to test it out.
Any difference might in a statement using Select * might be due to it taking extra time to find out what all of the column names are.

Because depending on the query it has to work out if there are unique names, what they all are, etc. Where as, if you specificy it, its all done for it.

Generally, the more you tell it, the less it has to calculate. This is the same for many systems.

Its possible the performance is way better when you select certain column names than Select *, one good reason, just check whether, you have used the columns which are already indexed, in this case, the optimizer will make a plan to select all the data only from index instead from actual table. But check the plan once for sure.

using distinct command

using distinct command in SQL is good practice or not? is there any drawback of distinct command?

It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.

If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing

From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .

It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.

If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas