Order of execution Oracle Select clause - sql

Consider a query with this structure:
select ..., ROWNUM
from t
where <where clause>
group by <columns>
having <having clause>
order by <columns>;
As per my understanding, the order of processing is
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
SELECT is applied.
I cant understand why this article in Oracle magazine by TOM specifies:
Think of it as being processed in this order:
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
SELECT is applied.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
Can anyone explain this order?

There is not a direct relationship between the clauses in a SQL statement and the processing order. SQL queries are processed in two phases. In the first phase, the code is compiled and optimized. The optimized version is run.
For parsing purposes, the query is evaluated in a particular order. For instance, FROM is parsed first, then WHERE, then GROUP BY, and so on. This explains why a column alias defined in the SELECT is not available in the FROM.
Your description, however, is incorrect with regards to ROWNUM. ROWNUM is a special construct in Oracle . . . as explained in the documentation. It is processed before the ORDER BY.

I think that it is not only difficult to identify an execution order for a SQL statement, it is actually harmful to your understanding of SQL to attempt to do so.
SQL is a declarative language, in which you define the result that you want, not the way in which that result is to be achieved (although it is possible to strongly affect that way). I have had many experiences of being asked, "So how does this SQL get executed?" by developers more familiar with conventional languages, and the truth is that the SQL doesn't tell you that at all, expect for very simplistic cases. As soon as the case is non-simplistic, you cannot afford to be thinking about SQL in the "wrong way".
It is possibly analogous to the difference between object-oriented and non-object oriented languages, or between functional programming and procedural programming -- there is a necessarily different way of thinking involved.
In SQL, the emphasis should be on understanding the syntax and how it defines the result set, and then on understanding the way that the SQL is processed by the database in the context of the schema to which it refers.
I would focus on reading the Oracle Concepts Guide on the subject, which explains that a query submitted to the system goes through various phases of (and this is a simplistic overview):
Parsing
Transformation
Estimation
Plan generation
Execution
It's important to realise that the SQL that is executed may not actually be the SQL that you submitted, but that you can use various developer tools to get deep insight into just about all of these phases.
It's a very different world!

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Question about execution relating to derived columns in SQL query

This is a more conceptual question about what happens during execution rather than anything wrong with the code. I was thinking about this in relation to the exercise here: https://www.hackerrank.com/challenges/earnings-of-employees/submissions/code/214661572
The solution I provided was:
SELECT * FROM (
SELECT (SALARY * MONTHS), COUNT(*)
FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC
)
WHERE ROWNUM = 1;
First, looking at the subquery, we see that we are grouping on the derived column salary*months.
However, something that confuses me is that it was explained to me that the order of execution begins with the from statement, and then proceeds to joins, where, group by, etc. clauses.
The problem I have was in my mind, I have imagined the from statement as the command which tells SQL what table we are dealing with - so when we invoke from the employee table we have the columns as specified in the exercise link.
However, now the next step is to group by (SALARY * MONTHS)... which is a derived column. But, the derived column does not exist in the table that we specified in the "FROM" statement. So how does SQL know what to group by if the column isn't provided in the original table?
The order of execution explanation I am looking at is here: https://sqlbolt.com/lesson/select_queries_order_of_execution
Thank you.
The article from SQLBolt, is not realistic.
The article explains how a naïve database engine would execute a query. It can be helpful as a simplistic approach to understand the basics of a query execution, but if an engine worked that way it would be very slow except for the most simple queries only. Nowadays the engines are much smarter than that.
The key concept that you must understand is that SQL is a declarative language, not an imperative one. You tell what you need, the engine decides how to produce it.
As a general rule modern databases process the query using the phases listed below:
Caching: Find if the query was already executed. If found, skip to step #7.
Parsing: Translate the SQL statement into an internal representation.
Rephrasing: Simplify the internal representation. All dirty tricks are valid here: for example using the same node for all occurrences of SALARY * MONTHS.
Planning/Prunning: Produce all possible execution plan trees, and prune as soon as possible. Include existing indexes to produce plans.
Cost Assessment: Determine the cost of a plan. The cost algorithm must be extremely fast (typically an heuristic can do), and somewhat accurate, since it must be computed for all candidate plans.
Optimizing: Select the best plan according to the cost. Update the cache.
Executing: Execute the plan tree starting from the root node and walk the tree by depth. This is not strictly true due to pipelining.
Node Pipelining or Materializing: a node can start returning rows as soon as possible to the parent node.
Return Result Set: Walk back to the parent nodes, until the root node is reached. The root node starts returning data to the client app.
As you see, there's a lot going on behind the scenes. Mind that some engines are much more sophisticated than this (e.g. Oracle, DB2, PostgreSQL) since they have smarter shortcuts and have implemented so may dirty tricks. Yep... all is valid as long as the returned result is correct.
Two different things are going on here.
The more important is that what gets executed is a directed acyclic graph of data operations. It really has (very little) to do with the SQL you write. The only guarantee is that it produces the results that you specify.
That is, SQL is a declarative language, not a procedural language. A query describes the result set.
The second thing that is going on is the scoping of identifiers: what does a column reference mean? These are defined in the FROM clause. JOINs have nothing to do with this, because they are just operators (like + or || except on tables) in the FROM clause.
Then, the references can be used in the WHERE, GROUP BY, SELECT, and other clauses. Column aliases defined in the SELECT can really only be used in the ORDER BY clause. Some databases also allow them in the HAVING and GROUP BY clauses as well but not Oracle.
As for your specific question, there is no requirement in SQL that the GROUP BY keys be present in the SELECT. Usually they are, but that is not a requirement. In fact, there are cases when using dates with string names that they might not be used. For instance, one could write:
select to_char(datecol, 'MON'), count(*)
from t
group by to_char(datecol, 'MON'), extract(month from date)
order by extract(month from date);
The month number is functionally equivalent to the month name. To have it for sorting, you could include it as a group by key.
Some people confuse the scoping rules with the order of execution. That is due to a misunderstanding of how SQL engines actually work.
I just give you my opinion, it is pretty intuitive.
You have the subquery query:
SELECT * FROM
-- init_subquery
(SELECT (SALARY * MONTHS), COUNT(*) FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC)
-- finish_subquery
WHERE ROWNUM = 1;
First of all it do the arithmetic operation, It produces the result of SALARY*MONTHS. That result is the column for Oracle, It only understand that it has some values in one column...
Like: SELECT (SALARY * MONTHS), FROM EMPLOYEE
After this execution it gets the GROUP BY clouse to do the COUNT, because COUNT operation depends if there are any group clause.
And finally it order your table.
I do not want to get any point from your article. Just that is a good question and I am preparing for 1Z0_071 Exam, and it is interesting to me to discuse.

SQL Conceptual Order Evaluation

Can someone explain, what is conceptual order of execution of a SELECT statement and provide an example of one please?
I've searched on Google but they all seem to use the same example without thorough explanation.
A select query is evaluated, conceptually, in the following order:
The from clause
The where clause
The group by clause
The having clause
The select clause
The order by clause
This is "conceptual" processing and explains some of the scoping rules of SQL. The way queries are executed in practice may differ.
SQL Server documentation explains this ordering here.

Why do I need to explicitly specify all columns in a SQL "GROUP BY" clause - why not "GROUP BY *"?

This has always bothered me - why does the GROUP BY clause in a SQL statement require that I include all non-aggregate columns? These columns should be included by default - a kind of "GROUP BY *" - since I can't even run the query unless they're all included. Every column has to either be an aggregate or be specified in the "GROUP BY", but it seems like anything not aggregated should be automatically grouped.
Maybe it's part of the ANSI-SQL standard, but even so, I don't understand why. Can somebody help me understand the need for this convention?
It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.
SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.
When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.
In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.
One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.
The only logical reason I can think of to keep the GROUP BY clause as it is that you can include fields that are NOT included in your selection column in your grouping.
For example.
Select column1, SUM(column2) AS sum
FROM table1
GROUP BY column1, column3
Even though column3 is not represented elsewhere in the query, you can still group the results by it's value. (Of course once you have done that, you cannot tell from the result why the records were grouped as they were.)
It does seem like a simple shortcut for the overwhelmingly most common scenario (grouping by each of the non-aggregate columns) would be a simple yet effective tool for speeding up coding.
Perhaps "GROUP BY *"
Since it is already pretty common in SQL tools to allow references to the columns by result column number (ie. GROUP BY 1,2,3, etc.) It would seem simpler still to be able to allow the user to automatically include all the non-aggregate fields in one keystroke.
It's simple just like this: you asked to sql group the results by every single column in the from clause, meaning for every column in the from clause SQL, the sql engine will internally group the result sets before to present it to you. So that explains why it ask you to mention all the columns present in the from too because its not possible group it partially. If you mentioned the group by clause that is only possible to sql achieve your intent by grouping all the columns as well. It's a math restriction.