Question about execution relating to derived columns in SQL query - sql

This is a more conceptual question about what happens during execution rather than anything wrong with the code. I was thinking about this in relation to the exercise here: https://www.hackerrank.com/challenges/earnings-of-employees/submissions/code/214661572
The solution I provided was:
SELECT * FROM (
SELECT (SALARY * MONTHS), COUNT(*)
FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC
)
WHERE ROWNUM = 1;
First, looking at the subquery, we see that we are grouping on the derived column salary*months.
However, something that confuses me is that it was explained to me that the order of execution begins with the from statement, and then proceeds to joins, where, group by, etc. clauses.
The problem I have was in my mind, I have imagined the from statement as the command which tells SQL what table we are dealing with - so when we invoke from the employee table we have the columns as specified in the exercise link.
However, now the next step is to group by (SALARY * MONTHS)... which is a derived column. But, the derived column does not exist in the table that we specified in the "FROM" statement. So how does SQL know what to group by if the column isn't provided in the original table?
The order of execution explanation I am looking at is here: https://sqlbolt.com/lesson/select_queries_order_of_execution
Thank you.

The article from SQLBolt, is not realistic.
The article explains how a naïve database engine would execute a query. It can be helpful as a simplistic approach to understand the basics of a query execution, but if an engine worked that way it would be very slow except for the most simple queries only. Nowadays the engines are much smarter than that.
The key concept that you must understand is that SQL is a declarative language, not an imperative one. You tell what you need, the engine decides how to produce it.
As a general rule modern databases process the query using the phases listed below:
Caching: Find if the query was already executed. If found, skip to step #7.
Parsing: Translate the SQL statement into an internal representation.
Rephrasing: Simplify the internal representation. All dirty tricks are valid here: for example using the same node for all occurrences of SALARY * MONTHS.
Planning/Prunning: Produce all possible execution plan trees, and prune as soon as possible. Include existing indexes to produce plans.
Cost Assessment: Determine the cost of a plan. The cost algorithm must be extremely fast (typically an heuristic can do), and somewhat accurate, since it must be computed for all candidate plans.
Optimizing: Select the best plan according to the cost. Update the cache.
Executing: Execute the plan tree starting from the root node and walk the tree by depth. This is not strictly true due to pipelining.
Node Pipelining or Materializing: a node can start returning rows as soon as possible to the parent node.
Return Result Set: Walk back to the parent nodes, until the root node is reached. The root node starts returning data to the client app.
As you see, there's a lot going on behind the scenes. Mind that some engines are much more sophisticated than this (e.g. Oracle, DB2, PostgreSQL) since they have smarter shortcuts and have implemented so may dirty tricks. Yep... all is valid as long as the returned result is correct.

Two different things are going on here.
The more important is that what gets executed is a directed acyclic graph of data operations. It really has (very little) to do with the SQL you write. The only guarantee is that it produces the results that you specify.
That is, SQL is a declarative language, not a procedural language. A query describes the result set.
The second thing that is going on is the scoping of identifiers: what does a column reference mean? These are defined in the FROM clause. JOINs have nothing to do with this, because they are just operators (like + or || except on tables) in the FROM clause.
Then, the references can be used in the WHERE, GROUP BY, SELECT, and other clauses. Column aliases defined in the SELECT can really only be used in the ORDER BY clause. Some databases also allow them in the HAVING and GROUP BY clauses as well but not Oracle.
As for your specific question, there is no requirement in SQL that the GROUP BY keys be present in the SELECT. Usually they are, but that is not a requirement. In fact, there are cases when using dates with string names that they might not be used. For instance, one could write:
select to_char(datecol, 'MON'), count(*)
from t
group by to_char(datecol, 'MON'), extract(month from date)
order by extract(month from date);
The month number is functionally equivalent to the month name. To have it for sorting, you could include it as a group by key.
Some people confuse the scoping rules with the order of execution. That is due to a misunderstanding of how SQL engines actually work.

I just give you my opinion, it is pretty intuitive.
You have the subquery query:
SELECT * FROM
-- init_subquery
(SELECT (SALARY * MONTHS), COUNT(*) FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC)
-- finish_subquery
WHERE ROWNUM = 1;
First of all it do the arithmetic operation, It produces the result of SALARY*MONTHS. That result is the column for Oracle, It only understand that it has some values in one column...
Like: SELECT (SALARY * MONTHS), FROM EMPLOYEE
After this execution it gets the GROUP BY clouse to do the COUNT, because COUNT operation depends if there are any group clause.
And finally it order your table.
I do not want to get any point from your article. Just that is a good question and I am preparing for 1Z0_071 Exam, and it is interesting to me to discuse.

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Order of execution Oracle Select clause

Consider a query with this structure:
select ..., ROWNUM
from t
where <where clause>
group by <columns>
having <having clause>
order by <columns>;
As per my understanding, the order of processing is
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
SELECT is applied.
I cant understand why this article in Oracle magazine by TOM specifies:
Think of it as being processed in this order:
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
SELECT is applied.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
Can anyone explain this order?
There is not a direct relationship between the clauses in a SQL statement and the processing order. SQL queries are processed in two phases. In the first phase, the code is compiled and optimized. The optimized version is run.
For parsing purposes, the query is evaluated in a particular order. For instance, FROM is parsed first, then WHERE, then GROUP BY, and so on. This explains why a column alias defined in the SELECT is not available in the FROM.
Your description, however, is incorrect with regards to ROWNUM. ROWNUM is a special construct in Oracle . . . as explained in the documentation. It is processed before the ORDER BY.
I think that it is not only difficult to identify an execution order for a SQL statement, it is actually harmful to your understanding of SQL to attempt to do so.
SQL is a declarative language, in which you define the result that you want, not the way in which that result is to be achieved (although it is possible to strongly affect that way). I have had many experiences of being asked, "So how does this SQL get executed?" by developers more familiar with conventional languages, and the truth is that the SQL doesn't tell you that at all, expect for very simplistic cases. As soon as the case is non-simplistic, you cannot afford to be thinking about SQL in the "wrong way".
It is possibly analogous to the difference between object-oriented and non-object oriented languages, or between functional programming and procedural programming -- there is a necessarily different way of thinking involved.
In SQL, the emphasis should be on understanding the syntax and how it defines the result set, and then on understanding the way that the SQL is processed by the database in the context of the schema to which it refers.
I would focus on reading the Oracle Concepts Guide on the subject, which explains that a query submitted to the system goes through various phases of (and this is a simplistic overview):
Parsing
Transformation
Estimation
Plan generation
Execution
It's important to realise that the SQL that is executed may not actually be the SQL that you submitted, but that you can use various developer tools to get deep insight into just about all of these phases.
It's a very different world!

Faster querying with temp table creation (SQL SERVER) [duplicate]

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Does order by in view guarantee order of select?

I have view for which it only makes sense to use a certain ordering. What I would like to do is to include the ORDER BY clause in the view, so that all SELECTs on that view can omit it. However, I am concerned that the ordering may not necessarily carry over to the SELECT, because it didn't specify the order.
Does there exist a case where an ordering specified by a view would not be reflected in the results of a select on that view (other than an order by clause in the view)?
You can't count on the order of rows in any query that doesn't have an explicit ORDER BY clause. If you query an ordered view, but you don't include an ORDER BY clause, be pleasantly surprised if they're in the right order, and don't expect it to happen again.
That's because the query optimizer is free to access rows in different ways depending on the query, table statistics, row counts, indexes, and so on. If it knows your query doesn't have an ORDER BY clause, it's free to ignore row order in order (cough) to return rows more quickly.
Slightly off-topic . . .
Sort order isn't necessarily identical across platforms even for well-known collations. I understand that sorting UTF-8 on Mac OS X is particularly odd. (PostgreSQL developers call it broken.) PostgreSQL relies on strcoll(), which I understand relies on the OS locales.
It's not clear to me how PostgreSQL 9.1 will handle this. In 9.1, you can have multiple indexes, each with a different collation. An ORDER BY that doesn't specify a collation will usually use the collation of the underlying base table's columns, but what will the optimizer do with an index that specifies a different collation than an unindexed column in the base table?
Couldn't see how to reply further up. Just adding my reply here.
You can rely on the ordering in every case where you could rely on it if you manually wrote the query.
That's because PostgreSQL rewrites your query merging in the view.
CREATE VIEW v AS SELECT * FROM people ORDER BY surname;
-- next two are identical
SELECT * FROM v WHERE forename='Fred';
SELECT * FROM people WHERE forename='Fred' ORDER BY surname;
However, if you use the view as a sub-query then the sorting might not remain, just as the output order from a sub-query is never maintained.
So - am I saying to rely on this? No, probably better all round to specify your desired sort order in the application. You'll need to do it for every other query anyway. If it's a utility view for DBA use, that's a different matter though - I have plenty of utility views that provide sorted output.
While observations have so far been true for the following, this answer is not definitive by any means. #Catcall and I, both, could not find anything definitive in the documentation and I have to admit, I'm too lazy to wade through and make sense of the source code.
But for observations sake, consider the following:
SELECT * FROM (select * from foo order by bar) foobar;
The query should return ordered.
SELECT * FROM vw_foo; -- where vw_foo is the sub-select above
The query should return ordered.
SELECT * FROM vw_foo LEFT JOIN (select * from bar) bar ON vw_foo.id = bar.id;
The query should use it's own discretion and may return unordered.
Disclaimer:
Much like #Catcall said, you should never truly depend on any implicit sorting, as many times it will be left up to the database engine. Databases are designed for quickness and reliability; they often interface with memory and try to pull/push data as quickly as possible. However, the ordering isn't solely based on memory management, there are several factors that are involved.
Unless you have something specific in mind, you should do your sorting at the end (on the outer query).
If the above observation was true, something like the following should always turn the results in the correct order:
SELECT *
FORM (select trunc(random()*999999+1) as i
from generate_series(1,1000000)
order by i
) foo;
The simple process would be: perform preprocessing and perform query identification (identify that an order exists), start loop, fetch first field (generate random number), add to output stack in sorted order. The ordering may also occur at the end of the stack generation, instead of during (eg compile the list and then do the sorting). This depends on versioning and the query.