Does SELECT DISTINCT imply a sort of the results - sql

Does including DISTINCT in a SELECT query imply that the resulting set should be sorted?
I don't think it does, but I'm looking for a an authoritative answer (web link).
I've got a query like this:
Select Distinct foo
From Bar
In oracle, the results are distinct but are not in sorted order. In Jet/MS-Access there seems to be some extra work being done to ensure that the results are sort. I'm assuming that oracle is following the spec in this case and MS Access is going beyond.
Also, is there a way I can give the table a hint that it should be sorting on foo (unless otherwise specified)?

From the SQL92 specification:
If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.
...
4) If an is not specified, then the ordering of the rows of Q is implementation-dependent.
Ultimately the real answer is that DISTINCT and ORDER BY are two separate parts of the SQL statement; If you don't have an ORDER BY clause, the results by definition will not be specifically ordered.

No. There are a number of circumstances in which a DISTINCT in Oracle does not imply a sort, the most important of which is the hashing algorithm used in 10g+ for both group by and distinct operations.
Always specify ORDER BY if you want an ordered result set, even in 9i and below.

There is no "authoritative" answer link, since this is something that no SQL server guarantees.
You will often see results in order when using distinct as a side effect of the best methods of finding those results. However, any number of other things can mix up the results, and some server may hand back results in such a way as to not give them sorted even if it had to sort to get the results.
Bottom line: if your server doesn't guarantee something you shouldn't count on it.

Not to my knowledge, no. The only reason I can think of is that SQL Server would internally sort the data in order to detect and filter out duplicates, and thus return it in a "pre-sorted" manner. But I wouldn't rely on that "side effect" :-)

No, it is not implying a sort. In my experience, it sorts by the known index, which may happen to be foo.
Why be subtle? Why not specific Select Distinct foo from Bar Order by foo?

On at least one server I've used (probably either Oracle or SQL Server, about six years ago), SELECT DISTINCT was rejected if you didn't have an ORDER BY clause. It was accepted on the "other" server (Oracle or SQL Server). Your mileage may vary.

No, the results are not sorted. If you want to give it a 'hint', you can certainly supply an ORDER BY:
select distinct foo
from bar
order by foo
But keep in mind that you might want to sort on more than just alphabetically. Instead you might want to sort on criteria on other fields. See:
http://weblogs.sqlteam.com/jeffs/archive/2007/12/13/select-distinct-order-by-error.aspx

As the answers mostly say, DISTINCT does not mandate a sort - only ORDER BY mandates that. However, one standard way of achieving DISTINCT results is to sort; the other is to hash the values (which tends to lead to semi-random sequencing). Relying on the sort effect of DISTINCT would be foolish.

In my case (SQL server), as an example I had a list of countries with a numerical value X assigned against each. When I did a select distinct * from Table order by X, it ordered it by X but at the same time result set countries were also ordered which was not directly implemented.
From my experience, I'll say that distinct does imply an implicit sort.

Yes. Oracle does use a sort do calculate a distinct. You can see that if you look at the explain plan. The fact that it did a sort for that calculation does not in any way imply
that the result set will be sorted. If you want the result set sorted, you are required to use the ORDER BY clause.

Related

How does implicit sorting in SELECT *,MIN(x) work in SQLite?

Today I had an apparently very common problem of selecting the row with the minimum value from each group of a dataset split by a group by. I found a solution that is unique to SQLite (it works incorrectly in MySQL and throws an error in PostgreSQL) and doesn't use any joins. It looks like this:
SELECT *, min(x) FROM table GROUP BY y
Here is a fiddle with an example.
However, I don't understand why this works - just by including an aggregate function each group was somehow implicitly sorted and returned the row to which the result of the aggregate function corresponds. Default SQL behavior is to select an arbitrary row. I dug through relevant SQLite documentation and found no explanation of this. This is what I'd like an explanation for.
Edit: both answers so far guess that this is a coincidence. It is not. In the actual table I have ~90 records split into ~30 groups with this method and it works as expected on every one. See for yourself.
To be compatible with MySQL, SQLite allows to use columns that are neither aggregated nor grouped by.
MySQL does not guarantee that the values come from any specific row, and neither did SQLite before version 3.7.11. However, due to how grouping is implemented in SQLite, the values in such columns happened to come from the row that matches the min()/max() in certain cases.
Some paying customer found this useful and wanted a guarantee for this, so SQLite enforced it in all cases and documented it in the changelog of version 3.7.11, which makes it a supported feature (i.e., it's tested, and will never be removed).
While it is safe to use, this behaviour is a violation extension of the SQL standard that was never properly designed, and never meant to be a selling feature, so it is not mentioned in the actual documentation.
It probably works by accident. SQLite will return an arbitrary row for each group. The row does not necessarily have to have the minimum x value for the group.
Learn to express the query correctly:
SELECT t.*
FROM table t
WHERE t.x = (SELECT MIN(t2.x) FROM table t2 WHERE t2.y = t.y)
The record you see was arbitrary chosen.
You cannot count on the behaviour which seems fix to you.
It can be changed due to changes in the table structure (e.g. added/removed indexes), between versions etc.
https://www.sqlite.org/lang_select.html
If the SELECT statement is an aggregate query with a GROUP BY clause
...
Each expression in the result-set is then evaluated once for each
group of rows. If the expression is an aggregate expression, it is
evaluated across all rows in the group. Otherwise, it is evaluated
against a single arbitrarily chosen row from within the group. If
there is more than one non-aggregate expression in the result-set,
then all such expressions are evaluated for the same row.
This reminds me of a famous pitfall related to Oracle's GROUP BY.
Everybody just knew that if you use GROUP BY you can skip the ORDER BY because the result set is already ordered.
The reason the result set was ordered at that time is that Oracle used a sort based algorithm for the implementation of the group by.
In version 10gR2 Oracle added an additional GROUP BY algorithm based on HASH.
You can guess the rest of the story.

Does ms access group by order the results?

I need to run a query that groups the result and orders it. When I used the following query I noticed that the results were ordered by the field name:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
Originally I planed on using the following query:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
ORDER BY name
I'm worried that order by significantly slows the running time.
Can I depend on ms-access to always order by the field I am grouping by, and eliminate the order by?
EDIT: I tried grouping different fields in other tables and it was always ordered by the grouped field.
I have found answers to this question to other SQL DBMSs, but not access.
How GROUP BY and ORDER BY work in general
Databases usually choose between sorting and hashing when creating groups for GROUP BY or DISTINCT operations. If they do choose sorting, you might get lucky and the sorting is stable between the application of GROUP BY and the actual result set consumption. But at some later point, this may break as the database might suddenly prefer an order-less hashing algorithm to produce groups.
In no database, you should ever rely on any implicit ordering behaviour. You should always use explicit ORDER BY. If the database is sophisticated enough, adding an explicit ORDER BY clause will hint that sorting is more optimal for the grouping operation as well, as the sorting can then be re-used in the query execution pipeline.
How this translates to your observation
I tried grouping different fields in other tables and it was always ordered by the grouped field.
Have you exhaustively tried all possible queries that could ever be expressed? I.e. have you tried:
JOIN
OUTER JOIN
semi-JOIN (using EXISTS or IN)
anti-JOIN (using NOT EXISTS or NOT IN)
filtering
grouping by many many columns
DISTINCT + GROUP BY (this will certainly break your ordering)
UNION or UNION ALL (which defeats this argument anyway)
I bet you haven't. And even if you tried all of the above, can you be sure there isn't a very peculiar configuration where the above breaks, just because you've observed the behaviour in some (many) experiments?
You cannot.
MS Access specific behaviour
As far as MS Access is concerned, consider the documentation on ORDER BY
Remarks
ORDER BY is optional. However, if you want your data displayed in sorted order, then you must use ORDER BY.
Notice the wording. "You must use ORDER BY". So, MS Acces is no different from other databases.
The answer
So your question about performance is going in the wrong direction. You cannot sacrifice correctness for performance in this case. Better tackle performance by using indexes.
Here is the MSDN documentation for the GROUP BY clause in Access SQL:
https://msdn.microsoft.com/en-us/library/bb177905(v=office.12).aspx
The page makes no reference to any implied or automatic ordering of results - if you do see desired ordering without an explicit ORDER BY then it is entirely coincidental.
The only way to guarantee the particular ordering of results in SQL is with ORDER BY.
There is a slight performance problem with using ORDER BY (in general) in that it requires the DBMS to get all of the results first before it outputs the first row of results (though the DBMS is free to use an "online sort" algorithm that sorts data as it gets each row from its backing store, it still needs to get the last row from the backing store before it can return the first row to the client (in case the last row from the backing-store happens to be the 1st result according to the ORDER BY) - however unless you're querying tens of thousands of rows in a latency-sensitive application this is not a problem - and as you're using Access already it's very clear that this is not a performance-sensitive application.

Does order by in view guarantee order of select?

I have view for which it only makes sense to use a certain ordering. What I would like to do is to include the ORDER BY clause in the view, so that all SELECTs on that view can omit it. However, I am concerned that the ordering may not necessarily carry over to the SELECT, because it didn't specify the order.
Does there exist a case where an ordering specified by a view would not be reflected in the results of a select on that view (other than an order by clause in the view)?
You can't count on the order of rows in any query that doesn't have an explicit ORDER BY clause. If you query an ordered view, but you don't include an ORDER BY clause, be pleasantly surprised if they're in the right order, and don't expect it to happen again.
That's because the query optimizer is free to access rows in different ways depending on the query, table statistics, row counts, indexes, and so on. If it knows your query doesn't have an ORDER BY clause, it's free to ignore row order in order (cough) to return rows more quickly.
Slightly off-topic . . .
Sort order isn't necessarily identical across platforms even for well-known collations. I understand that sorting UTF-8 on Mac OS X is particularly odd. (PostgreSQL developers call it broken.) PostgreSQL relies on strcoll(), which I understand relies on the OS locales.
It's not clear to me how PostgreSQL 9.1 will handle this. In 9.1, you can have multiple indexes, each with a different collation. An ORDER BY that doesn't specify a collation will usually use the collation of the underlying base table's columns, but what will the optimizer do with an index that specifies a different collation than an unindexed column in the base table?
Couldn't see how to reply further up. Just adding my reply here.
You can rely on the ordering in every case where you could rely on it if you manually wrote the query.
That's because PostgreSQL rewrites your query merging in the view.
CREATE VIEW v AS SELECT * FROM people ORDER BY surname;
-- next two are identical
SELECT * FROM v WHERE forename='Fred';
SELECT * FROM people WHERE forename='Fred' ORDER BY surname;
However, if you use the view as a sub-query then the sorting might not remain, just as the output order from a sub-query is never maintained.
So - am I saying to rely on this? No, probably better all round to specify your desired sort order in the application. You'll need to do it for every other query anyway. If it's a utility view for DBA use, that's a different matter though - I have plenty of utility views that provide sorted output.
While observations have so far been true for the following, this answer is not definitive by any means. #Catcall and I, both, could not find anything definitive in the documentation and I have to admit, I'm too lazy to wade through and make sense of the source code.
But for observations sake, consider the following:
SELECT * FROM (select * from foo order by bar) foobar;
The query should return ordered.
SELECT * FROM vw_foo; -- where vw_foo is the sub-select above
The query should return ordered.
SELECT * FROM vw_foo LEFT JOIN (select * from bar) bar ON vw_foo.id = bar.id;
The query should use it's own discretion and may return unordered.
Disclaimer:
Much like #Catcall said, you should never truly depend on any implicit sorting, as many times it will be left up to the database engine. Databases are designed for quickness and reliability; they often interface with memory and try to pull/push data as quickly as possible. However, the ordering isn't solely based on memory management, there are several factors that are involved.
Unless you have something specific in mind, you should do your sorting at the end (on the outer query).
If the above observation was true, something like the following should always turn the results in the correct order:
SELECT *
FORM (select trunc(random()*999999+1) as i
from generate_series(1,1000000)
order by i
) foo;
The simple process would be: perform preprocessing and perform query identification (identify that an order exists), start loop, fetch first field (generate random number), add to output stack in sorted order. The ordering may also occur at the end of the stack generation, instead of during (eg compile the list and then do the sorting). This depends on versioning and the query.

question about aggregate function internals in SQL/Postgres

How does a function like SUM work? If I execute
select id,sum(a) from mytable group by id
does it sort by id and then sum over each range of equal id's? I am no planner expert, but it looks like that is what is happening, where mytable is maybe a hundred million rows with a few million distinct id's.
Or does it just keep a hash of id -> current_sum, and then at each row either increments the value of id or add a new key? Isn't that far faster and less memory hungry?
SQL standards try to dictate external behavior, not internal behavior. In this particular case, a SQL implementation that conforms to (one of the many) standards is supposed to act like it does things in this order.
Build a working table from all the table constructors in the FROM clause. (There's only one in your example.)
In the GROUP BY clause, partition the working table into groups. Reduce each group to one row. Replace the working table with the grouped table.
Resolve the expressions in the SELECT clause.
Query optimizers that follow SQL standards are free to rearrange things however they like, as long as the result is the same as if it had followed those steps.
You can find more details in the answers and comments to this SO question.
So, I found this, http://helmingstay.blogspot.com/2009/06/postgresql-poetry-aggregate-median-with.html, which claims that it does indeed use the accumulator pattern. Hmmm.

Select Distinct without sorting

I used a Select Distinct query, which resulted me a sorted data. Is there anyway that i dont get data sorted?
I'll try to elaborate a bit as to what's going on and why... though I agree with #vic's comment to the question...
Without explicitly stating an order (via an order by clause) there is absolutely no guarantee of any order in the result set.
Practically speaking, many queries will return a consistent order based on the query plan and how the data is actually stored and accessed... DO NOT RELY ON THIS!
Specifically, for a distinct query, the sql engine will sort the data so that it can be sure to remove any duplicates.
In short, if the order of the result set matters (even if the desired order is "random") you must ALWAYS explicitly state it. That said, from a purely set-based-math/sql standpoint, the order of the result shouldn't matter.
Put this at the end of your query. This will effectively randomize the results which then will appear to you non-sorted ;)
ORDER BY Rnd([ID]);
Replace the ID with primary key of the table. In Access SQL it is possible to call certain VB Functions directly. In this case the Rnd function can be called in a query and fed a seed value from the data being sorted.
I think sorting may have something to do with the way DISTINCT is determined.
The easiest way to return distinct values is to sort the selection set
returned by processing the SQL predicate and then
returning only the rows where the DISTINCT columns change value from the prior row.
In short,
DISTINCT requires a sort to be performed where duplicate rows are dropped.
That said, there is no guarantee that rows are returned to you in any particular
order unless you explicitly include an ORDER BY clause.