SQL Conceptual Order Evaluation - sql

Can someone explain, what is conceptual order of execution of a SELECT statement and provide an example of one please?
I've searched on Google but they all seem to use the same example without thorough explanation.

A select query is evaluated, conceptually, in the following order:
The from clause
The where clause
The group by clause
The having clause
The select clause
The order by clause
This is "conceptual" processing and explains some of the scoping rules of SQL. The way queries are executed in practice may differ.
SQL Server documentation explains this ordering here.

Related

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Question about execution relating to derived columns in SQL query

This is a more conceptual question about what happens during execution rather than anything wrong with the code. I was thinking about this in relation to the exercise here: https://www.hackerrank.com/challenges/earnings-of-employees/submissions/code/214661572
The solution I provided was:
SELECT * FROM (
SELECT (SALARY * MONTHS), COUNT(*)
FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC
)
WHERE ROWNUM = 1;
First, looking at the subquery, we see that we are grouping on the derived column salary*months.
However, something that confuses me is that it was explained to me that the order of execution begins with the from statement, and then proceeds to joins, where, group by, etc. clauses.
The problem I have was in my mind, I have imagined the from statement as the command which tells SQL what table we are dealing with - so when we invoke from the employee table we have the columns as specified in the exercise link.
However, now the next step is to group by (SALARY * MONTHS)... which is a derived column. But, the derived column does not exist in the table that we specified in the "FROM" statement. So how does SQL know what to group by if the column isn't provided in the original table?
The order of execution explanation I am looking at is here: https://sqlbolt.com/lesson/select_queries_order_of_execution
Thank you.
The article from SQLBolt, is not realistic.
The article explains how a naïve database engine would execute a query. It can be helpful as a simplistic approach to understand the basics of a query execution, but if an engine worked that way it would be very slow except for the most simple queries only. Nowadays the engines are much smarter than that.
The key concept that you must understand is that SQL is a declarative language, not an imperative one. You tell what you need, the engine decides how to produce it.
As a general rule modern databases process the query using the phases listed below:
Caching: Find if the query was already executed. If found, skip to step #7.
Parsing: Translate the SQL statement into an internal representation.
Rephrasing: Simplify the internal representation. All dirty tricks are valid here: for example using the same node for all occurrences of SALARY * MONTHS.
Planning/Prunning: Produce all possible execution plan trees, and prune as soon as possible. Include existing indexes to produce plans.
Cost Assessment: Determine the cost of a plan. The cost algorithm must be extremely fast (typically an heuristic can do), and somewhat accurate, since it must be computed for all candidate plans.
Optimizing: Select the best plan according to the cost. Update the cache.
Executing: Execute the plan tree starting from the root node and walk the tree by depth. This is not strictly true due to pipelining.
Node Pipelining or Materializing: a node can start returning rows as soon as possible to the parent node.
Return Result Set: Walk back to the parent nodes, until the root node is reached. The root node starts returning data to the client app.
As you see, there's a lot going on behind the scenes. Mind that some engines are much more sophisticated than this (e.g. Oracle, DB2, PostgreSQL) since they have smarter shortcuts and have implemented so may dirty tricks. Yep... all is valid as long as the returned result is correct.
Two different things are going on here.
The more important is that what gets executed is a directed acyclic graph of data operations. It really has (very little) to do with the SQL you write. The only guarantee is that it produces the results that you specify.
That is, SQL is a declarative language, not a procedural language. A query describes the result set.
The second thing that is going on is the scoping of identifiers: what does a column reference mean? These are defined in the FROM clause. JOINs have nothing to do with this, because they are just operators (like + or || except on tables) in the FROM clause.
Then, the references can be used in the WHERE, GROUP BY, SELECT, and other clauses. Column aliases defined in the SELECT can really only be used in the ORDER BY clause. Some databases also allow them in the HAVING and GROUP BY clauses as well but not Oracle.
As for your specific question, there is no requirement in SQL that the GROUP BY keys be present in the SELECT. Usually they are, but that is not a requirement. In fact, there are cases when using dates with string names that they might not be used. For instance, one could write:
select to_char(datecol, 'MON'), count(*)
from t
group by to_char(datecol, 'MON'), extract(month from date)
order by extract(month from date);
The month number is functionally equivalent to the month name. To have it for sorting, you could include it as a group by key.
Some people confuse the scoping rules with the order of execution. That is due to a misunderstanding of how SQL engines actually work.
I just give you my opinion, it is pretty intuitive.
You have the subquery query:
SELECT * FROM
-- init_subquery
(SELECT (SALARY * MONTHS), COUNT(*) FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC)
-- finish_subquery
WHERE ROWNUM = 1;
First of all it do the arithmetic operation, It produces the result of SALARY*MONTHS. That result is the column for Oracle, It only understand that it has some values in one column...
Like: SELECT (SALARY * MONTHS), FROM EMPLOYEE
After this execution it gets the GROUP BY clouse to do the COUNT, because COUNT operation depends if there are any group clause.
And finally it order your table.
I do not want to get any point from your article. Just that is a good question and I am preparing for 1Z0_071 Exam, and it is interesting to me to discuse.

Order of execution Oracle Select clause

Consider a query with this structure:
select ..., ROWNUM
from t
where <where clause>
group by <columns>
having <having clause>
order by <columns>;
As per my understanding, the order of processing is
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
SELECT is applied.
I cant understand why this article in Oracle magazine by TOM specifies:
Think of it as being processed in this order:
The FROM/WHERE clause goes first.
ROWNUM is assigned and incremented to each output row from the FROM/WHERE clause.
SELECT is applied.
GROUP BY is applied.
HAVING is applied.
ORDER BY is applied.
Can anyone explain this order?
There is not a direct relationship between the clauses in a SQL statement and the processing order. SQL queries are processed in two phases. In the first phase, the code is compiled and optimized. The optimized version is run.
For parsing purposes, the query is evaluated in a particular order. For instance, FROM is parsed first, then WHERE, then GROUP BY, and so on. This explains why a column alias defined in the SELECT is not available in the FROM.
Your description, however, is incorrect with regards to ROWNUM. ROWNUM is a special construct in Oracle . . . as explained in the documentation. It is processed before the ORDER BY.
I think that it is not only difficult to identify an execution order for a SQL statement, it is actually harmful to your understanding of SQL to attempt to do so.
SQL is a declarative language, in which you define the result that you want, not the way in which that result is to be achieved (although it is possible to strongly affect that way). I have had many experiences of being asked, "So how does this SQL get executed?" by developers more familiar with conventional languages, and the truth is that the SQL doesn't tell you that at all, expect for very simplistic cases. As soon as the case is non-simplistic, you cannot afford to be thinking about SQL in the "wrong way".
It is possibly analogous to the difference between object-oriented and non-object oriented languages, or between functional programming and procedural programming -- there is a necessarily different way of thinking involved.
In SQL, the emphasis should be on understanding the syntax and how it defines the result set, and then on understanding the way that the SQL is processed by the database in the context of the schema to which it refers.
I would focus on reading the Oracle Concepts Guide on the subject, which explains that a query submitted to the system goes through various phases of (and this is a simplistic overview):
Parsing
Transformation
Estimation
Plan generation
Execution
It's important to realise that the SQL that is executed may not actually be the SQL that you submitted, but that you can use various developer tools to get deep insight into just about all of these phases.
It's a very different world!

What is the semantic difference between WHERE and HAVING?

Let's put GROUP BY aside for a second. In normal queries (without GROUP BY), what is the semantic difference? Why does this answer work? (put an alias in a HAVING clause instead of WHERE)
HAVING operates on the summarized row - WHERE is operating on the entire table before the GROUP BY is applied. (You can't put GROUP BY aside, HAVING is a clause reserved for use with GROUP BY - leaving out the GROUP BY doesn't change the implicit action that is occurring behind the scenes).
It's also important to note that because of this, WHERE can use an index while HAVING cannot. (In super trivial un-grouped result sets you could theoretically use an index for HAVING, but I've never seen a query optimizer actually implemented in this way).
MySQL evaluates the query up to and including the WHERE clause, then filters it with the HAVING clause. That's why HAVING can recognize column aliases, whereas WHERE can't.
By omitting the GROUP BY clause, I believe you simply tell the query not to group any of your results.
Very broadly, WHERE filters the data going into the query (the DB tables), while HAVING filters the output of the query.
Statements in the WHERE clause can only refer to the tables (and other external data sources), while statements in the HAVING clause can only refer to the data produced by the query.

Why are recursive CTEs unable to use grouping and other clauses?

I recently learned about Recursive Common Table Expressions (CTEs) while looking for a way to build a certain view of some data. After taking a while to write out how the first iteration of my query would work, I turned it into a CTE to watch the whole thing play out. I was surprised to see that grouping didn't work, so I just replaced it with a "Select TOP 1, ORDER BY" equivalent. I was again surprised that "TOP" wasn't allowed, and came to find that all of these clauses aren't allowed in the recursive part of a CTE:
DISTINCT
GROUP BY
HAVING
TOP
LEFT
RIGHT
OUTER JOIN
So I suppose I have 2 questions:
In order to better understand my situation and SQL, why aren't these clauses allowed?
If I need to do some sort of recursion using some of these clauses, is my only alternative to write a recursive stored procedure?
Thanks.
Referring to :-
In order to better understand my situation and SQL, why aren't these clauses allowed?
Based on my understanding of CTE's, the whole idea behind creating a CTE is so that you can create a temporary result-set and use it in a named manner like a regular table in SELECT, INSERT, UPDATE, or DELETE statements.
Because a CTE is logically very much like a view, the SELECT statement in your CTE query must follow the same requirements as those used for creating a view. Refer **CTE query definitions* section in following link on MSDN
Also, because a CTE is basically a named resultset and it can be used like any table in SELECT, INSERT, UPDATE, or DELETE statements, you always have the option of using the various operators you mentioned when you use the CTE in any of those statements.
With regarding to
2.If I need to do some sort of recursion using some of these clauses, is my only alternative to write a recursive stored procedure?
Also, I am pretty sure that you can use at least some of the keywords that you have mentioned above in a view / CTE select statement. For Example: Refer to the use of GROUP BY in a CTE here in the Creating a simple common table expression example
Maybe, if you can provide a sample scenario of what you are trying to achieve, we can suggest some possible solution.