SQL order of execution - sql

I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement working in having condition and not giving any error.

As we know 'having' clause execute before the select one
This affirmation is wrong. The HAVING clause is used to apply filters in aggregation functions (such as SUM, AVG, COUNT, MIN and MAX). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT statement is done when the HAVING clause start to be processed.
Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.
So it would be perfectly feasible to put the HAVING clause before the SELECT or in any part of the instruction, because this is just a syntax decision. Currently, HAVING clause is after GROUP BY clause because someone decided that this syntax makes more sense in SQL.
Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.

the order of exution is
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know

Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.
Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.
So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.

Related

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Where does OVER clause fall in the SQL logical processing order?

The usual SQL logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Where does OVER clause fall in the SQL logical processing order? I am trying to understand logically whether the OVER happens after the data is grouped (that is - after HAVING and before SELECT). I am confused whether DISTINCT, ORDER BY and TOP have any impact on data window used by the OVER clause.
Reference: https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=sql-server-ver15#logical-processing-order-of-the-select-statement
A Beginner’s Guide to the True Order of SQL Operations by Lukas Eder:
The logical order of operations is the following (for “simplicity” I’m
leaving out vendor specific things like CONNECT BY, MODEL,
MATCH_RECOGNIZE, PIVOT, UNPIVOT and all the others):
FROM: ...
WHERE: ...
GROUP BY: ...
HAVING: …
WINDOW:
If you’re using the awesome window function feature, this is the step where they’re all calculated. Only now.
And the cool thing is, because we have already calculated (logically!) all the aggregate functions, we can nest aggregate functions in window functions.
It’s thus perfectly fine to write things like sum(count(*)) OVER () or row_number() OVER (ORDER BY count(*)).
Window functions being logically calculated only now also explains why you can put them only in the SELECT or ORDER BY clauses.
They’re not available to the WHERE clause, which happened before.
SELECT: ....
DISTINCT: ...
UNION, INTERSECT, EXCEPT: ...
ORDER BY: ....
OFFSET: .
LIMIT, FETCH, TOP: ...
Related:
Why no windowed functions in where clauses?
Snowflake - QUALIFY
The QUALIFY clause filters the results of window functions. QUALIFY does with window functions what HAVING does with aggregate functions and GROUP BY clauses.
You are confusing expressions with clauses.
Although called a "clause" in SQL Server documentation, OVER is part of an analytic function. It is an expression that returns a scalar results.
Analytic functions can appear in the SELECT clause and ORDER BY clause. They are parsed as part of those clauses.
SQL, by the way, is a descriptive language not a procedural language. A query does not specify the "order of execution". That is determined by the compiler and optimizer. What you are referring to is the "order of parsing", which explains how identifiers are resolved in the query.
The confusion I think is usually traced to this reference. The documentation is quite clear that this refers to the "logical processing order", ("This order determines when the objects defined in one step are made available to the clauses in subsequent steps.") But people seem confused anyway.

How to avoid duplicated SELECT phrases in SQL (MariaDB)

I am working with a small MariaDB database. To extract time intervals per user, I use the following query:
SELECT
SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime)) AS seconds,
TIME_FORMAT(SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime))),'%Hh %im %ss') AS formattedTime,
TSUser.name
FROM Activity
INNER JOIN User ON User.id = Activity.userID
GROUP BY User.id
ORDER BY seconds DESC;
I have to select the time as plain seconds (... AS seconds) to be able to order the results by it, as can be seen in my query.
However, I also want MariaDB to format the time interval, for that I use the TIME_FORMAT function. The problem is, I have to duplicate the whole SUM(...) phrase inside the TIME_FORMAT call again. This doesn't seem very elegant. Will MariaDB recognize the duplication and calculate the SUM only once? Also, is there a way to get the same result without duplicating the SUM?
I figured this should be possible with a nested query construct like so:
SELECT
innerQuery.name,
innerQuery.seconds,
TIME_FORMAT(SEC_TO_TIME(innerQuery.seconds), '%Hh %im')
FROM (
//Do the sum here, once.
) AS innerQuery
ORDER BY innerQuery.seconds DESC;
Is this the best way to do it / "ok" to do?
Note: I don't need the raw seconds in the result, only the formatted time is needed.
I'd appreciate help, thanks.
Alas. There isn't a really good solution. When you use a subquery, then MariaDb materializes the subquery (as does MySQL). Your query is rather complex, so there is a lot of I/O happening anyway, so the additional materialization may not be important.
Repeating the expression is really more an issue of aesthetics than performance. The expression will be re-executed multiple times. But, the real expense of doing aggregations is the file sort for the group by (or whatever method is used). Doing the sum() twice is not a big deal (unless you are calling a really expensive function as well as the aggregation function).
Other database engines do not automatically materialize subqueries, so using a subquery in other databases is usually the recommended approach. In MariaDB/MySQL, I would guess that repeating the expression is more efficient, although you can try both on your data and report back.
In this case, you don't need the raw values. The formatted value will work correctly in the ORDER BY.
Your subquery idea is likely to be slower because of all the overhead in having two queries.
This is a Rule of Thumb: It takes far more effort for MySQL to fetch a row than to evaluate expressions in the row. With that rule, duplicate expressions are not a burden.

What is the semantic difference between WHERE and HAVING?

Let's put GROUP BY aside for a second. In normal queries (without GROUP BY), what is the semantic difference? Why does this answer work? (put an alias in a HAVING clause instead of WHERE)
HAVING operates on the summarized row - WHERE is operating on the entire table before the GROUP BY is applied. (You can't put GROUP BY aside, HAVING is a clause reserved for use with GROUP BY - leaving out the GROUP BY doesn't change the implicit action that is occurring behind the scenes).
It's also important to note that because of this, WHERE can use an index while HAVING cannot. (In super trivial un-grouped result sets you could theoretically use an index for HAVING, but I've never seen a query optimizer actually implemented in this way).
MySQL evaluates the query up to and including the WHERE clause, then filters it with the HAVING clause. That's why HAVING can recognize column aliases, whereas WHERE can't.
By omitting the GROUP BY clause, I believe you simply tell the query not to group any of your results.
Very broadly, WHERE filters the data going into the query (the DB tables), while HAVING filters the output of the query.
Statements in the WHERE clause can only refer to the tables (and other external data sources), while statements in the HAVING clause can only refer to the data produced by the query.

Which SQL statement is faster? (HAVING vs. WHERE...)

SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
GROUP BY NR_DZIALU
HAVING NR_DZIALU = 30
or
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
WHERE NR_DZIALU = 30
GROUP BY NR_DZIALU
The theory (by theory I mean SQL Standard) says that WHERE restricts the result set before returning rows and HAVING restricts the result set after bringing all the rows. So WHERE is faster. On SQL Standard compliant DBMSs in this regard, only use HAVING where you cannot put the condition on a WHERE (like computed columns in some RDBMSs.)
You can just see the execution plan for both and check for yourself, nothing will beat that (measurement for your specific query in your specific environment with your data.)
It might depend on the engine. MySQL for example, applies HAVING almost last in the chain, meaning there is almost no room for optimization. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
I believe this behavior is the same in most SQL database engines, but I can't guarantee it.
The two queries are equivalent and your DBMS query optimizer should recognise this and produce the same query plan. It may not, but the situation is fairly simple to recognise, so I'd expect any modern system - even Sybase - to deal with it.
HAVING clauses should be used to apply conditions on group functions, otherwise they can be moved into the WHERE condition. For example. if you wanted to restrict your query to groups that have COUNT(DZIALU) > 10, say, you would need to put the condition into a HAVING because it acts on the groups, not the individual rows.
I'd expect the WHERE clause would be faster, but it's possible they'd optimize to exactly the same.
Saying they would optimize is not really taking control and telling the computer what to do. I would agree that the use of having is not an alternative to a where clause. Having has a special usage of being applied to a group by where something like a sum() was used and you want to limit the result set to show only groups having a sum() > than 100 per se. Having works on groups, Where works on rows. They are apples and oranges. So really, they should not be compared as they are two very different animals.
"WHERE" is faster than "HAVING"!
The more complex grouping of the query is - the slower "HAVING" will perform to compare because: "HAVING" "filter" will deal with larger amount of results and its also being additional "filter" loop
"HAVING" will also use more memory (RAM)
Altho when working with small data - the difference is minor and can absolutely be ignored
"Having" is slower if we compare with large amount of data because it works on group of records and "WHERE" works on number of rows..
"Where" restricts results before bringing all rows and 'Having" restricts results after bringing all the rows
Both the statements will be having same performance as SQL Server is smart enough to parse both the same statements into a similar plan.
So, it does not matter if you use WHERE or HAVING in your query.
But, ideally you should use WHERE clause syntactically.