Performance of ISNULL() in GROUP BY clause SQL - sql

I have been refactoring some old queries recently and noticed that a lot of them repeat ISNULL() in the GROUP BY clause, where it is used in the SELECT clause. I feel in my bones that removing the ISNULL() in the GROUP BY clause will improve performance, but I can't find any documentation on whether it is actually likely to or not. Here's the sort of thing I mean:
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
ISNULL(Foo,-1)
,ISNULL(Bar,-1);
GO
The above is the way I keep coming across - When there is grouping on the Foo column, the SELECT and the GROUP BY for selected columns match exactly. The example below is a possible alternative - some possibly unnecessary ISNULL() calls have been removed, and the SELECT and GROUP BY clauses no longer match.
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
Foo
,Bar;
GO
I suppose maybe when the SELECT and GROUP BY clauses match, the optimiser only has to do the ISNULL() calculation once to know what is going on, so it might be theoretically more performative to group by the results that are actually selected? Alternatively, maybe it is better to avoid adding a second set of ISNULL() calls that don't change the granularity of the data at all... Maybe the optimiser is clever enough to realise that the NULLS in the grouping are (in this case) -1s in the selection...?
I personally would prefer removing any unnecessary functions, especially once that might affect index usage but when I look online, the references to performance are all like the answers here, about using ISNULL() in the WHERE clause, which I already know to avoid.
I also suspect that any gains are going to be vanishingly small, so this is really asking for an academic or theoretical answer, but as I work, I keep wondering and it bugs me, so I thought I would ask if anyone has any thoughts.

Non-aggregated columns in SELECT clauses generally must precisely match the ones in GROUP BY clauses. If I were you, and I were dealing with tested production code, I would not make the change you propose.
Edit the match between non-aggregated SELECT columns and GROUP BY columns is necessary for GROUP BY. If the columns in SELECT are 1:1 dependent on the columns in GROUP BY, it will work. Otherwise the results are ambiguous.

Internally, SQL does not really have two copies of each ISNULL. They are all flattened together in the internal tree used during compilation. So, this level of optimization is not useful to consider in SQL Server. A query without any ISNULL in it would probably perform a bit faster and potentially a lot faster depending on the rest of the schema and query. However, the ISNULL in the select list and the GROUP BY list are not executed twice in most cases within SQL - this level of detail can show up in showplan, but it's often below the level of detail most people would care to examine.

There are a few different aspects to consider here:
Referring to the same value multiple times in the same scope
In most situations, the optimizer is clever enough to collapse these into calculating them once. The fact that you have a GROUP BY over them makes this even more likely.
Is it faster to group when the value is guaranteed to not be null?
Possibly, although I doubt the difference is measurable.
The SELECT does not have to match exactly, it only needs to be functionally dependent on GROUP BY columns and aggregation functions. It may not be functionally dependent on any other columns.
The most important thing top consider: indexing.
This is much, much more important than the other considerations. When grouping, if you can hit an index then it will go much faster, because it can remove sorting and just use Stream Aggregate. This is not possible if you use ISNULL in the GROUP BY (barring computed columns or indexed views).
Note that your results will not be the same: the first example collapses the NULL group into the -1 group. The second example does not, so you may want to remove the ISNULL from the SELECT also, in order to differentiate them. Alternatively, put a WHERE ... IS NOT NULL instead.

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

Should I use distinct in my queries

Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .
It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.
Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries
select distinct
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
VS
select
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"
Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT. If your intention was to do aggregation, use GROUP BY
Take a look at this question. There are some nice answers that might help.
The answer provided by #zedfoxus is useful to understand the context.
However, I don't believe your query should require distinct records if the data is designed correctly.
It appears you are selecting the primary key of table spriden, so all that data should be unique. You're also joining onto the spraddr table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?
This is why I get nervous about use of "distinct" - the spraddr table may include additional columns which you should use to filter out data, and "distinct" may be hiding that.
Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr for each row in spriden, and you should use the "is_current" flag to find the 2 or 3 "real" ones.
Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.
In your example distinct and group by do the same thing. I think your colleagues means that your query should not return duplicates in the first instance and that you should be able to write your query without a distinct or group by clause. You maybe be able to reduce the duplicates by extending your join conditions.
Ask them why is it a bad practice. A lot of people make up rules or come up with things that they consider bad practice from reading the first page of the book or the first result of a google search. If it does the job and doesn't cause any issues there is no reason to create more work by finding alternatives. From the two options you have posted I would use distinct too because its shorter and easier to read and maintain.
Whoever told you using DISTINCT is a bad sign in itself is wrong. In reality, it all depends on what problem you are trying to solve by using DISTINCT in the first place.
If you're querying a table that is expected to have repeated values of some field or combination of fields, and you're reporting a list of the values or combinations of values (and not performing any aggregations on them), then DISTINCT is the most sensible thing to use. It doesn't really make sense in my mind to use GROUP BY instead just because somebody thinks DISTINCT shouldn't be used. Indeed, I think this is the kind of thing DISTINCT is designed for.
If OTOH you've found that your query has a bug meaning that repeated values are being returned, you shouldn't use either DISTINCT or GROUP BY to cancel out this bug. Rather, you should figure out the cause of the bug and fix it.
Using DISTINCT as a safety net is also a poor practice, as it potentially hides problems, and furthermore it can be computationally expensive (typically O(n log n) or O(n2)). In this scenario, I can't see that using GROUP BY instead would help you.
Yes, Distinct tends to raise a little alarm in my head when I come across it in someones' query. It is required in some cases ofcourse, but most data models should not require it. It tends to be a last resort, or outlier case, for having to use it. It may also be systemic of a bad application sitting ontop of the database, allowing duplicate entries to be inserted or updated to be duplicates (and likewise, no corresponding database level constraints to prevent such actions). So the first thing to check is the data. It could be a sign of bad datamodel design. But most likely the query should not get to that stage in a select where duplicate rows are lingering.
In constructing a large query, normally I would start with the nugget of a subquery which is specifying the unique fields, and any subquery after that must Inner join or Left join onto that but never add or reduce the number of rows already defined by the nugget query.. and remembering to handle the possible NULLs of the left joins.
So for example, the nugget query could select the right rows also by using Partitions to, for example, select the most recent row of a joined table, or to do some other grouping at that stage.
In your example, I would not expect duplicates. If a person can have historical addresses, fine, but then do you need to see all addresses, or only the most recent, and if there were duplicate addresses, for the same person, does that mean incorrectly duplicated data, or does it mean the person left that address but returned to it later... in which case the partition select would fix that with much better control than a distinct.. especially when fields are added to the query by someone else later and breaks the distinct-ness.
This means that all other data hangs off this nugget of a sub query.. you stick the other possible fields onto the right of the core set of fields.
If Distincts are a last resort, then they are typically reserved for cases where the data is known to have duplicate entries in that table for that set of fields, and it's perfectly normal. In my head though a distinct is a slow, post-select process in the plan especially when it's a large result set being returned. I ought to verify that one of these days.
Provided your queries are correct, DISTINCT and GROUP BY provide the same result set, but your colleagues are correct in stating that DISTINCT hides problems. If you are missing a join and using a GROUP BY, you'll get back more information than you're expecting. If you are missing a join and using DISTINCT the SQL engine will perform an unbounded (or partially bounded) join, narrow the results down, and then come up with the expected answer.
Beyond the obvious performance degradation of generating more data than is necessary, you also run the risk of filling your tempdb (i.e.: running out of room on the hard drive where your tempdb lives).
Use GROUP BY in production.

Do CTEs improve performance?

with ini as
(
select ...
)
select ini.a
join ini.b
join ini.c
How many times does the SQL Server engine calculate the results from the ini table ?
My question which I'm trying to answer (with your help) is if the with statement (CTE) improves performance by aliasing the results.
The CTE ini is simply a macro that expands and this use is syntax/clarity only.
MSDN says:
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries
Nothing about performance.
It is evaluated per mention: so three times here which you can see from an execution plan.
For recursive CTEs it's somewhat different as the CTE builds upon itself but it will still be evaluated once per mention
A CTE (common table expression, the part that is wrapped in the "with") is essentially a 1-time view. If you think of it in terms of a temporary view, perhaps the answer will become more clear. As far as I know, the interpreter will simply do the equivalent of copy/pasting whatever is within the CTE into the main query wherever it finds the reference.
I'm sure there are outside instances where it appears to help, but more often than not, I'd assume that the mere presence of a CTE itself is not going to improve the performance of a query. It'll help with readability and re-usability within that single select statement (i.e., you won't have to re-type the same sub-query multiple times), but I don't believe it will magically make things run faster (all things being equal). Of course, if your query is structured differently within the CTE than you would have done w/ sub-queries, then it's quite possible the CTE runs faster at that point, but you're now comparing apples to oranges.
I suppose it would also depend on whther you were using it to replace a derived table or a correlated subquery. Performance would be about the same in the first case and probably significantly better in the second if you joined to the CTE rather than just replaced the suquery code with a reference to the CTE. If you used it to replace a where NOT EXISTS clause with a left join to a CTE (in order to find the records in one table but not the other), I'd expect performance to be worse as Where Exists is usually the fastets way to do that type of task. I guess what I'm saying is that performance will still depend on how you use the CTE not just the fact that you generated one.

Why do I need to explicitly specify all columns in a SQL "GROUP BY" clause - why not "GROUP BY *"?

This has always bothered me - why does the GROUP BY clause in a SQL statement require that I include all non-aggregate columns? These columns should be included by default - a kind of "GROUP BY *" - since I can't even run the query unless they're all included. Every column has to either be an aggregate or be specified in the "GROUP BY", but it seems like anything not aggregated should be automatically grouped.
Maybe it's part of the ANSI-SQL standard, but even so, I don't understand why. Can somebody help me understand the need for this convention?
It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.
SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.
When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.
In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.
One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.
The only logical reason I can think of to keep the GROUP BY clause as it is that you can include fields that are NOT included in your selection column in your grouping.
For example.
Select column1, SUM(column2) AS sum
FROM table1
GROUP BY column1, column3
Even though column3 is not represented elsewhere in the query, you can still group the results by it's value. (Of course once you have done that, you cannot tell from the result why the records were grouped as they were.)
It does seem like a simple shortcut for the overwhelmingly most common scenario (grouping by each of the non-aggregate columns) would be a simple yet effective tool for speeding up coding.
Perhaps "GROUP BY *"
Since it is already pretty common in SQL tools to allow references to the columns by result column number (ie. GROUP BY 1,2,3, etc.) It would seem simpler still to be able to allow the user to automatically include all the non-aggregate fields in one keystroke.
It's simple just like this: you asked to sql group the results by every single column in the from clause, meaning for every column in the from clause SQL, the sql engine will internally group the result sets before to present it to you. So that explains why it ask you to mention all the columns present in the from too because its not possible group it partially. If you mentioned the group by clause that is only possible to sql achieve your intent by grouping all the columns as well. It's a math restriction.

Does SELECT DISTINCT imply a sort of the results

Does including DISTINCT in a SELECT query imply that the resulting set should be sorted?
I don't think it does, but I'm looking for a an authoritative answer (web link).
I've got a query like this:
Select Distinct foo
From Bar
In oracle, the results are distinct but are not in sorted order. In Jet/MS-Access there seems to be some extra work being done to ensure that the results are sort. I'm assuming that oracle is following the spec in this case and MS Access is going beyond.
Also, is there a way I can give the table a hint that it should be sorting on foo (unless otherwise specified)?
From the SQL92 specification:
If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.
...
4) If an is not specified, then the ordering of the rows of Q is implementation-dependent.
Ultimately the real answer is that DISTINCT and ORDER BY are two separate parts of the SQL statement; If you don't have an ORDER BY clause, the results by definition will not be specifically ordered.
No. There are a number of circumstances in which a DISTINCT in Oracle does not imply a sort, the most important of which is the hashing algorithm used in 10g+ for both group by and distinct operations.
Always specify ORDER BY if you want an ordered result set, even in 9i and below.
There is no "authoritative" answer link, since this is something that no SQL server guarantees.
You will often see results in order when using distinct as a side effect of the best methods of finding those results. However, any number of other things can mix up the results, and some server may hand back results in such a way as to not give them sorted even if it had to sort to get the results.
Bottom line: if your server doesn't guarantee something you shouldn't count on it.
Not to my knowledge, no. The only reason I can think of is that SQL Server would internally sort the data in order to detect and filter out duplicates, and thus return it in a "pre-sorted" manner. But I wouldn't rely on that "side effect" :-)
No, it is not implying a sort. In my experience, it sorts by the known index, which may happen to be foo.
Why be subtle? Why not specific Select Distinct foo from Bar Order by foo?
On at least one server I've used (probably either Oracle or SQL Server, about six years ago), SELECT DISTINCT was rejected if you didn't have an ORDER BY clause. It was accepted on the "other" server (Oracle or SQL Server). Your mileage may vary.
No, the results are not sorted. If you want to give it a 'hint', you can certainly supply an ORDER BY:
select distinct foo
from bar
order by foo
But keep in mind that you might want to sort on more than just alphabetically. Instead you might want to sort on criteria on other fields. See:
http://weblogs.sqlteam.com/jeffs/archive/2007/12/13/select-distinct-order-by-error.aspx
As the answers mostly say, DISTINCT does not mandate a sort - only ORDER BY mandates that. However, one standard way of achieving DISTINCT results is to sort; the other is to hash the values (which tends to lead to semi-random sequencing). Relying on the sort effect of DISTINCT would be foolish.
In my case (SQL server), as an example I had a list of countries with a numerical value X assigned against each. When I did a select distinct * from Table order by X, it ordered it by X but at the same time result set countries were also ordered which was not directly implemented.
From my experience, I'll say that distinct does imply an implicit sort.
Yes. Oracle does use a sort do calculate a distinct. You can see that if you look at the explain plan. The fact that it did a sort for that calculation does not in any way imply
that the result set will be sorted. If you want the result set sorted, you are required to use the ORDER BY clause.