Is ORDER BY time consuming? - sql

I am always wondering if ORDER BY is efficient, because I believe it inevitably need a whole-database scanning, even if the ordering field is indexed.
For example, if I order by created_at and limit 10. I think, because the database cannot know I will order by created_at a priori, it has to sort the whole data and return the first 10 items. Of course if we have an index on created_at, things might be better.
However, even with index, I think we can still run into trouble. For example, I want to sort by a function of a field, say (age^2-age-10). Even if we indexed the age field, the database cannot know what function I will use a priori, so it has to calculate the sqrt on all rows.
Am I wrong? Anyway, could anyone explain to me the workflow behind ORDER BY?

If there is an index that is sorted in the same order as specified in the ORDER BY clause, the database will not need to perform a sort operation. The query optimizer looks for indexes that can speed up your query. It analyzes your SQL query and, in the case of ORDER BY clauses, looks for indexes that have the same order. See Indexing ORDER BY for more details.
Some database engines allow indexing computed columns, which would cover the case you mentioned.

In theory, the database optimizer can take into account the limit clause when determining the query plan. This is most obviously useful with a limit 1 query, which can be implemented just by keeping track of which row has the extreme value for the columns in the order by. The same idea can be extended to larger limit sizes.
In practice, I don't think that most databases implement this optimization when the limit is larger than 1. Some may for the special case of limit 1 (or top 1 or whatever the right syntax is).
An index can be used for an order by. In general, the columns in the index would need to match exactly the appropriate columns in the index. SQL optimizers are generally not smart enough to recognize simple conversions. On the other hand, people who write SQL usually don't do such transformations.

Related

Unordered results in SQL

I have read it over and over again that SQL, at its heart, is an unordered model. That means executing the same SQL query multiple times can return result-set in different order, unless there's an "order by" clause included. Can someone explain why can a SQL query return result-set in different order in different instances of running the query? It may not be the case always, but its certainly possible.
Algorithmically speaking, does query plan not play any role in determining the order of result-set when there is no "order by" clause? I mean when there is query plan for some query, how does the algorithm not always return data in the same order?
Note: Am not questioning the use of order by, am asking why there is no-guarantee, as in, am trying to understand the challenges due to which there cannot be any guarantee.
Some SQL Server examples where the exact same execution plan can return differently ordered results are
An unordered index scan might be carried out in either allocation order or key order dependant on the isolation level in effect.
The merry go round scanning feature allows scans to be shared between concurrent queries.
Parallel plans are often non deterministic and order of results might depend on the degree of parallelism selected at runtime and concurrent workload on the server.
If the plan has nested loops with unordered prefetch this allows the inner side of the join to proceed using data from whichever I/Os happened to complete first
Martin Smith has some great examples, but the absolute dead simple way to demonstrate when SQL Server will change the plan used (and therefore the ordering that a query without ORDER BY will be used, based on the different plan) is to add a covering index. Take this simple example:
CREATE TABLE dbo.floob
(
blat INT PRIMARY KEY,
x VARCHAR(32)
);
INSERT dbo.floob VALUES(1,'zzz'),(2,'aaa'),(3,'mmm');
This will order by the clustered PK:
SELECT x FROM dbo.floob;
Results:
x
----
zzz
aaa
mmm
Now, let's add an index that happens to cover the query above.
CREATE INDEX x ON dbo.floob(x);
The index causes a recompile of the above query when we run it again; now it orders by the new index, because that index provides a more efficient way for SQL Server to return the results to satisfy the query:
SELECT x FROM dbo.floob;
Results:
x
----
aaa
mmm
zzz
Take a look at the plans - neither has a sort operator, they are just - without any other ordering input - relying on the inherent order of the index, and they are scanning the whole index because they have to (and the cheapest way for SQL Server to scan the index is in order). (Of course even in these simple cases, some of the factors in Martin's answer could influence a different order; but this holds true in the absence of any of those factors.)
As others have stated, the ONLY WAY TO RELY ON ORDER is to SPECIFY AN ORDER BY. Please write that down somewhere. It doesn't matter how many scenarios exist where this belief can break; the fact that there is even one makes it futile to try to find some guidelines for when you can be lazy and not use an ORDER BY clause. Just use it, always, or be prepared for the data to not always come back in the same order.
Some related thoughts on this:
Bad habits to kick : relying on undocumented behavior
Why people think some SQL Server 2000 behaviors live on… 12 years later
Quote from Wikipedia:
"As SQL is a declarative programming language, SELECT queries specify a result set, but do not specify how to calculate it. The database translates the query into a "query plan" which may vary between executions, database versions and database software. This functionality is called the "query optimizer" as it is responsible for finding the best possible execution plan for the query, within applicable constraints."
It all depends on what the query optimizer picks as a plan - table scan, index scan, index seek, etc.
Other factors that might influence picking a plan are table/index statistics and parameter sniffing to name a few.
In short, the order is never guaranteed without an ORDER BY clause.
It's simple: if you need the data ordered then use an ORDER BY. It's not hard!
It may not cause you a problem today or next week or even next month but one day it will.
I've been on a project where we needed to rewrite dozens (or maybe hundreds) of queries after an upgrade to Oracle 10g caused GROUP BY to be evaluated in a different way than in had on Oracle 9i, meaning that the queries weren't necessarily ordered by the grouped columns anymore. Not fun and simple to avoid.
Remember that SQL is declarative language so you are telling the DBMS what you want and the DBMS is then working out how to get it. It will bring back the same results every time but may evaluate in a different way each time: there are no guarantees.
Just one simple example of where this might cause you problems is that new rows appear at the end of the table if you select from the table.... until they don't because you've deleted some rows and the DBMS decides to fill in the empty space.
There are an unknowable number of ways it can go wrong unless you use ORDER BY.
Why does water boil at 100 degrees C? Because that's the way it's defined.
Why are there no guarantees about result ordering without an ORDER BY? Because that's the way it's defined.
The DBMS will probably use the same query plan the next time and that query plan will probably return the data in the same order: but that is not a guarantee, not even close to a guarantee.
If you don't specify an ORDER BY then the order will depend on the plan it uses, for example if the query did a table scan and used no index then the result would be the "natural order" or the order of the PK. However if the plan determines to use IndexA that is based on columnA then the order would be in that order. Make sense?

Is the result of GROUP BY should be sorted accordingly the SQL standard?

Is the result of GROUP BY should be sorted accordingly the SQL standard?
Many databases return the sorted results for GROUP BY,
but is it enforced by SQL92 or other standard?
No. GROUP BY has no standard impact on the order of rows returned. That's what ORDER BY is designed to do.
If you're getting some kind of repeatable or predictable sort order returned by a GROUP BY, it's something being done in your DBMS that is not defined in the standards.
As a previous answer has explained, no sorting is ever implied by any basic SQL construct other than ORDER BY.
However, to compute GROUP BY, either index scan or in-memory sorting may take place (to create the buckets), and such an index scan, or sorting, implies a traversal of the data in a sorted order. So it is no accident that a particular database often behaves like this. Do not rely on it, however, because with a different set of indexes, or even just a different query plan (which may be triggered as little as by a few inserts and/or a restart of your database server) the behavior could be quite different.
Notice also that reordering the column list in the ORDER BY clause will result in reliably reordering the output, whereas reordering the column list in a GROUP BY clause will likely have no effect whatsoever.
There is no performance cost of using a seemingly "redundant" ORDER BY. The query plan will likely be identical, if the original one already guaranteed sorted output.
Um, sorting the output of a GROUP BY is not in the standard because there are standard algorithms for grouping that do not produce results in order.
The most common of these is the use of a hash table for doing the group by.
In addition, on a multithreaded server, the data could be sorted, but the results would be returned processor-by-processor. There is no guarantee that the lowest order processor would be the first to return data.
And also, on a parallel machine, the data may be split among the processors using a variety of methods. For instance, all strings that end in "a" may go to one processor. All that end in "b" to another. These could then be sorted locally, but the results themselves would not be sorted overall.
Databases such as mysql that guarantee a sort after the group by are making a poor design decision. In addition to not conforming to the standard, such databases either limit the choice of algorithm or impose additional processing for ordering.

How do database servers decide which order to return rows without any "order by" statements?

Kind of a whimsical question, always something I've wondered about and I figure knowing why it does what it does might deepen my understanding a bit.
Let's say I do "SELECT TOP 10 * FROM TableName". In short timeframes, the same 10 rows come back, so it doesn't seem random. They weren't the first or last created. In my massive sample size of...one table, it isn't returning the min or max auto-incrementing primary key value.
I also figure the problem gets more complex when taking joins into account.
My database of choice is MSSQL, but I figure this might be an interesting question regardless of the platform.
If you do not supply an ORDER BY clause on a SELECT statement you will get rows back in arbitrary order.
The actual order is undefined, and depends on which blocks/records are already cached in memory, what order I/O is performed in, when threads in the database server are scheduled to run, and so on.
There's no rhyme or reason to the order and you should never base any expectations on what order rows will be in unless you supply an ORDER BY.
If they're not ordered by the calling query, I believe they're just returned in the order they were read off disk. This may vary because of the types of joins used or the indexes that looked up the values.
You can see this if the table has a clustered index on it (and you're just selecting - a JOIN can re-order things) - a SELECT will return the rows in clustered-index-order, even without an ORDER BY clause.
There is a very detailed explanation with examples here: http://sqlserverpedia.com/blog/sql-server-bloggers/its-the-natural-order-of-things-not/
"How do database servers decide which order to return rows without any “order by” statements?"
They simply do not take any "decision" with respect to ordering. They see the user doesn't care about ordering, and so they don't care either. And thus they simply go out to find the requested rows. The order in which they find them is normally the order in which you get them. That order depends on user-unpredictable things like the chosen physical access paths, ordering of physical records inside the database's physical files, etc. etc.
Don't let yourself be misled by the ordering as you get it, in the case where you didn't explicitly specify an ordering in your query. If you don't specify an ordering in your query, no ordering in the result set is guaranteed, even if in practice results seem to suggest that some ordering appears to be adhered to by the server.

How To Reduce Number of Rows Scanned by MySQL

I have a query that pulls 5 records from a table of ~10,000. The order clause isn't covered by an index, but the where clause is.
The query scans about 7,700 rows to pull these 5 results, and that seems like a bit much. I understand, though, that the complexity of the ordering criteria complicates matters. How, if at all, can i reduce the number of rows scanned?
The query looks like this:
SELECT *
FROM `mediatypes_article`
WHERE `mediatypes_article`.`is_published` = 1
ORDER BY `mediatypes_article`.`published_date` DESC, `mediatypes_article`.`ordering` ASC, `mediatypes_article`.`id` DESC LIMIT 5;
medaitypes_article.is_published is indexed.
How many rows apply to "is_published = 1" ?
I assume that is like... 7.700 rows?
Either way you take it, the full result that will match the WHERE clause has to be fetched and completely ordered by all sorting criteria. Then the full list of all sorted published articles will be truncated after the first 5 results.
Maybe it will help you to look at the MySQL documentation article about ORDER BY optimization, but for the first you should try to apply indices on the columns that are stated in the ORDER BY statement. It is very likely that this will speed up things greatly.
Executing OPTIMIZE TABLE may not help, but it doesn't hurt either.
When you have ordering, you have to traverse all the btree to figure out the proper order.
10,000 records to order is not that big amount to worry about. Remember, with proper indexing, the RDBMS doesn't fetch the whole record to figure out the order. It has the indexed columns in btree pages saved on disk and with few page reads, the whole btree is loaded in memory and can be traversed.
In MySQL you can make an index that includes multiple columns. I think what you probably need to do is make an index that includes is_published and published_date. You should look at the output from the EXPLAIN statement to make sure it's doing things the smart way, and add an index if it is not.

MIN/MAX vs ORDER BY and LIMIT

Out of the following queries, which method would you consider the better one? What are your reasons (code efficiency, better maintainability, less WTFery)...
SELECT MIN(`field`)
FROM `tbl`;
SELECT `field`
FROM `tbl`
ORDER BY `field`
LIMIT 1;
In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.
If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.
It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.
SELECT MIN(`field`)
FROM `tbl`;
Simply because it is ANSI compatible. Limit 1 is particular to MySql as TOP is to SQL Server.
As mson and Sean McSomething have pointed out, MIN is preferable.
One other reason where ORDER BY + LIMIT is useful is if you want to get the value of a different column than the MIN column.
Example:
SELECT some_other_field, field
FROM tbl
ORDER BY field
LIMIT 1
I think the answers depends on what you are doing.
If you have a 1 off query and the intent is as simple as you specified, select min(field) is preferable.
However, it is common to have these types of requirements change into - grab top n results, grab nth - mth results, etc.
I don't think it's too terrible an idea to commit to your chosen database. Changing dbs should not be made lightly and have to revise is the price you pay when you make this move.
Why limit yourself now, for pain you may or may not feel later on?
I do think it's good to stay ANSI as much as possible, but that's just a guideline...
Given acceptable performance I would use the first one because it is semantically closer to the intent.
If the performance was an issue, (Most modern optimizers will probalbly optimize both to the same query plan, although you have to test to verify that) then of course I would use the faster one.
user650654 said that ORDER BY with LIMIT 1 useful when one need "to get the value of a different column than the MIN column". I think, in this case we still have better performance with two single passes using MIN instead of sorting (hoping this is optimized :()
SELECT some_other_field, field
FROM tbl
WHERE field=(SELECT MIN(field) FROM tbl)