The usual SQL logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Where does OVER clause fall in the SQL logical processing order? I am trying to understand logically whether the OVER happens after the data is grouped (that is - after HAVING and before SELECT). I am confused whether DISTINCT, ORDER BY and TOP have any impact on data window used by the OVER clause.
Reference: https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=sql-server-ver15#logical-processing-order-of-the-select-statement
A Beginner’s Guide to the True Order of SQL Operations by Lukas Eder:
The logical order of operations is the following (for “simplicity” I’m
leaving out vendor specific things like CONNECT BY, MODEL,
MATCH_RECOGNIZE, PIVOT, UNPIVOT and all the others):
FROM: ...
WHERE: ...
GROUP BY: ...
HAVING: …
WINDOW:
If you’re using the awesome window function feature, this is the step where they’re all calculated. Only now.
And the cool thing is, because we have already calculated (logically!) all the aggregate functions, we can nest aggregate functions in window functions.
It’s thus perfectly fine to write things like sum(count(*)) OVER () or row_number() OVER (ORDER BY count(*)).
Window functions being logically calculated only now also explains why you can put them only in the SELECT or ORDER BY clauses.
They’re not available to the WHERE clause, which happened before.
SELECT: ....
DISTINCT: ...
UNION, INTERSECT, EXCEPT: ...
ORDER BY: ....
OFFSET: .
LIMIT, FETCH, TOP: ...
Related:
Why no windowed functions in where clauses?
Snowflake - QUALIFY
The QUALIFY clause filters the results of window functions. QUALIFY does with window functions what HAVING does with aggregate functions and GROUP BY clauses.
You are confusing expressions with clauses.
Although called a "clause" in SQL Server documentation, OVER is part of an analytic function. It is an expression that returns a scalar results.
Analytic functions can appear in the SELECT clause and ORDER BY clause. They are parsed as part of those clauses.
SQL, by the way, is a descriptive language not a procedural language. A query does not specify the "order of execution". That is determined by the compiler and optimizer. What you are referring to is the "order of parsing", which explains how identifiers are resolved in the query.
The confusion I think is usually traced to this reference. The documentation is quite clear that this refers to the "logical processing order", ("This order determines when the objects defined in one step are made available to the clauses in subsequent steps.") But people seem confused anyway.
Related
I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement working in having condition and not giving any error.
As we know 'having' clause execute before the select one
This affirmation is wrong. The HAVING clause is used to apply filters in aggregation functions (such as SUM, AVG, COUNT, MIN and MAX). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT statement is done when the HAVING clause start to be processed.
Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.
So it would be perfectly feasible to put the HAVING clause before the SELECT or in any part of the instruction, because this is just a syntax decision. Currently, HAVING clause is after GROUP BY clause because someone decided that this syntax makes more sense in SQL.
Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.
the order of exution is
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know
Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.
Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.
So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.
I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.
I am looking for at what point do windowed functions happen in sql.
I know they can be used in the SELECT and ORDER BY clauses, so I am inclined to think they happen after ORDER BY, but before TOP
Window functions happen when the optimizer decides that they should happen. This is best understood looking at the query plan.
SQL Server advertises the logical processing of queries. This is used to explaining scoping rules (in particular). It is not related to how the query is actually executed.
Clearly, the rules for window functions are:
The effect is after the FROM, WHERE, GROUP BY, and HAVING clauses are processed.
The effect is not related to the ORDER BY (even if you use order by (select null))).
TOP does not affect the processing.
The processing occurs before SELECT DISTINCT.
I think the conclusion is that they are parsed in the SELECT or ORDER BY as with other expressions in those clauses. There is no separate place for them.
Some database engines like Microsoft Access supports FIRST() as an aggregate function and I was using it in cases I know the column will only have one value in the group.
Potentially, the database engine can optimise this as if it reaches any value it can mark this value as already calculated. So it is a surprise why this is not supported in for example Oracle or SQL Server, and more importantly, not in the SQL standard.
In practice, people uses MIN() or MAX() instead, but they all require
The data type underneath have natural ordering semantic and the ordering does matter for the user;
The database engine have to compare the intermediate value with the values in each rows
So this is not optimal in many cases.
Are there any specific reasons people don't want to allow SELECT ANY(FIELD) ...? (I could think of two variants: ANY() gives any value in the result set that the column is not null; FIRST() gives the column value for the first row in result set, or null if there is no rows)
Regarding first/last
The syntax supported by Microsoft Access SQL doesn't make sense in standard SQL:
SELECT
First(LastName) as First,
Last(LastName) as Last
FROM Employees
(source)
In Standard SQL, grouping takes places before sorting. Normally, groups are not sorted. That means, it is undefined which row comes first/last. Standard SQL generally aims to avoid constructs that have nondeterministic behaviour (exceptions exist).
Standard SQL offers to so-called ordered set functions that accept an within group (order by...) clause to establish an order in a group prior to aggregation:
SELECT
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY val)
FROM ...
The range for the argument of percentile_disc is 0 to 1 whereas 0 is the first result and 1 the last. 0.5 is the median (this is the common use-case for percentile_disc).
However, standard SQL does not offer first/last as ordered set function, but percentile_disc with an argument of 0 is basically first, while the value 1 would basically give you the last result.
The more common SQL way to obtain the first/last value is to use a top-n query:
SELECT LastName
FROM Employees
ORDER BY ...
FETCH FIRST 1 ROW ONLY
Fetching first and last value in one go is a little bit awkward.
Other than that, standard SQL also offers the window functions first_value and last_value to pick those values out of a partition without grouping.
Regarding any
Standard SQL has an aggregate function any but for a different use case. Again, what you (MS Access SQL) suggest for any gives you a non-deterministic result, which is not what standard SQL encourages.
The standard SQL function any returns a boolean that is true if any of the conditions is true. It is best used in having clauses:
SELECT
*
FROM ..
GROUP BY ...
HAVING ANY(<condition>)
This remove all groups where no <condition> evaluates to true.
References:
Slides regarding WITHIN GROUP: https://www.slideshare.net/MarkusWinand/modern-sql/105
Blog post on the every function (which is similar to any): https://blog.jooq.org/2014/12/18/a-true-sql-gem-you-didnt-know-yet-the-every-aggregate-function/
FIRST (or better ANY_VALUE as MySQL calls the function) is not in the SQL standard. This is probably because it is hardly needed in a standard-compliant DBMS.
You say you use FIRST "in cases I know the column will only have one value in the Group". In a well-built database such a case should hardly ever occur. Maybe you are rather using it, because MS Access (and several other DBMS) violate the standard when it comes to aggregation with a GROUP BY clause. An example:
select department_id, d.department_name, count(*) as num_employees
from employees e
join departments d using (department_id)
group by d.department_id;
You may want to use FIRST(d.department_name) here, because you know that per department_id there will be just one department_name of course. But so does the DBMS (or better: it should!). In standard SQL the above query is valid, because the department_name is functionally dependent on the department_id. No need hence for a FIRST or ANY_VALUE function.
MySQL introduced ANY_VALUE mainly in order to deal with cases where the DBMS fails to detect the functional dependency, but again these cases should be extremely rare. I like the function, because it gives you the opportunity to say "I don't care which", e.g. give me the departments and one leader per department (i.e. in case there are two department leaders: one of them arbitrarily chosen). But well, I guess in the standard comittee they decided that MIN or MAX would serve the purpose in such rare cases, too.
My research has shown that at least these SQL dialects support ANY_VALUE() as aggregate functions (and often also window functions):
BigQuery
MySQL
Oracle 21c
Redshift
SingleStore (MemSQL)
Snowflake
And this one supports FIRST_VALUE and LAST_VALUE, which are similar:
HANA
It's not as widespread as it could be, but not completely unheard of, either.
Let's put GROUP BY aside for a second. In normal queries (without GROUP BY), what is the semantic difference? Why does this answer work? (put an alias in a HAVING clause instead of WHERE)
HAVING operates on the summarized row - WHERE is operating on the entire table before the GROUP BY is applied. (You can't put GROUP BY aside, HAVING is a clause reserved for use with GROUP BY - leaving out the GROUP BY doesn't change the implicit action that is occurring behind the scenes).
It's also important to note that because of this, WHERE can use an index while HAVING cannot. (In super trivial un-grouped result sets you could theoretically use an index for HAVING, but I've never seen a query optimizer actually implemented in this way).
MySQL evaluates the query up to and including the WHERE clause, then filters it with the HAVING clause. That's why HAVING can recognize column aliases, whereas WHERE can't.
By omitting the GROUP BY clause, I believe you simply tell the query not to group any of your results.
Very broadly, WHERE filters the data going into the query (the DB tables), while HAVING filters the output of the query.
Statements in the WHERE clause can only refer to the tables (and other external data sources), while statements in the HAVING clause can only refer to the data produced by the query.