What is the semantic difference between WHERE and HAVING? - sql

Let's put GROUP BY aside for a second. In normal queries (without GROUP BY), what is the semantic difference? Why does this answer work? (put an alias in a HAVING clause instead of WHERE)

HAVING operates on the summarized row - WHERE is operating on the entire table before the GROUP BY is applied. (You can't put GROUP BY aside, HAVING is a clause reserved for use with GROUP BY - leaving out the GROUP BY doesn't change the implicit action that is occurring behind the scenes).
It's also important to note that because of this, WHERE can use an index while HAVING cannot. (In super trivial un-grouped result sets you could theoretically use an index for HAVING, but I've never seen a query optimizer actually implemented in this way).

MySQL evaluates the query up to and including the WHERE clause, then filters it with the HAVING clause. That's why HAVING can recognize column aliases, whereas WHERE can't.
By omitting the GROUP BY clause, I believe you simply tell the query not to group any of your results.

Very broadly, WHERE filters the data going into the query (the DB tables), while HAVING filters the output of the query.
Statements in the WHERE clause can only refer to the tables (and other external data sources), while statements in the HAVING clause can only refer to the data produced by the query.

Related

Performance of ISNULL() in GROUP BY clause SQL

I have been refactoring some old queries recently and noticed that a lot of them repeat ISNULL() in the GROUP BY clause, where it is used in the SELECT clause. I feel in my bones that removing the ISNULL() in the GROUP BY clause will improve performance, but I can't find any documentation on whether it is actually likely to or not. Here's the sort of thing I mean:
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
ISNULL(Foo,-1)
,ISNULL(Bar,-1);
GO
The above is the way I keep coming across - When there is grouping on the Foo column, the SELECT and the GROUP BY for selected columns match exactly. The example below is a possible alternative - some possibly unnecessary ISNULL() calls have been removed, and the SELECT and GROUP BY clauses no longer match.
SELECT
ISNULL(Foo,-1) AS Foo
,ISNULL(Bar,-1) AS Bar
,SUM(This) AS This
,SUM(That) AS That
FROM
dbo.ThisThatTable AS ThisThat
LEFT JOIN dbo.FooBarTable AS FooBar ON ThisThat.FooBarId = FooBar.Id
GROUP BY
Foo
,Bar;
GO
I suppose maybe when the SELECT and GROUP BY clauses match, the optimiser only has to do the ISNULL() calculation once to know what is going on, so it might be theoretically more performative to group by the results that are actually selected? Alternatively, maybe it is better to avoid adding a second set of ISNULL() calls that don't change the granularity of the data at all... Maybe the optimiser is clever enough to realise that the NULLS in the grouping are (in this case) -1s in the selection...?
I personally would prefer removing any unnecessary functions, especially once that might affect index usage but when I look online, the references to performance are all like the answers here, about using ISNULL() in the WHERE clause, which I already know to avoid.
I also suspect that any gains are going to be vanishingly small, so this is really asking for an academic or theoretical answer, but as I work, I keep wondering and it bugs me, so I thought I would ask if anyone has any thoughts.
Non-aggregated columns in SELECT clauses generally must precisely match the ones in GROUP BY clauses. If I were you, and I were dealing with tested production code, I would not make the change you propose.
Edit the match between non-aggregated SELECT columns and GROUP BY columns is necessary for GROUP BY. If the columns in SELECT are 1:1 dependent on the columns in GROUP BY, it will work. Otherwise the results are ambiguous.
Internally, SQL does not really have two copies of each ISNULL. They are all flattened together in the internal tree used during compilation. So, this level of optimization is not useful to consider in SQL Server. A query without any ISNULL in it would probably perform a bit faster and potentially a lot faster depending on the rest of the schema and query. However, the ISNULL in the select list and the GROUP BY list are not executed twice in most cases within SQL - this level of detail can show up in showplan, but it's often below the level of detail most people would care to examine.
There are a few different aspects to consider here:
Referring to the same value multiple times in the same scope
In most situations, the optimizer is clever enough to collapse these into calculating them once. The fact that you have a GROUP BY over them makes this even more likely.
Is it faster to group when the value is guaranteed to not be null?
Possibly, although I doubt the difference is measurable.
The SELECT does not have to match exactly, it only needs to be functionally dependent on GROUP BY columns and aggregation functions. It may not be functionally dependent on any other columns.
The most important thing top consider: indexing.
This is much, much more important than the other considerations. When grouping, if you can hit an index then it will go much faster, because it can remove sorting and just use Stream Aggregate. This is not possible if you use ISNULL in the GROUP BY (barring computed columns or indexed views).
Note that your results will not be the same: the first example collapses the NULL group into the -1 group. The second example does not, so you may want to remove the ISNULL from the SELECT also, in order to differentiate them. Alternatively, put a WHERE ... IS NOT NULL instead.

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Does the number of columns used for a CTE affects the performance of the query?

Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.

Why do I need to explicitly specify all columns in a SQL "GROUP BY" clause - why not "GROUP BY *"?

This has always bothered me - why does the GROUP BY clause in a SQL statement require that I include all non-aggregate columns? These columns should be included by default - a kind of "GROUP BY *" - since I can't even run the query unless they're all included. Every column has to either be an aggregate or be specified in the "GROUP BY", but it seems like anything not aggregated should be automatically grouped.
Maybe it's part of the ANSI-SQL standard, but even so, I don't understand why. Can somebody help me understand the need for this convention?
It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.
SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.
When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.
In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.
One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.
The only logical reason I can think of to keep the GROUP BY clause as it is that you can include fields that are NOT included in your selection column in your grouping.
For example.
Select column1, SUM(column2) AS sum
FROM table1
GROUP BY column1, column3
Even though column3 is not represented elsewhere in the query, you can still group the results by it's value. (Of course once you have done that, you cannot tell from the result why the records were grouped as they were.)
It does seem like a simple shortcut for the overwhelmingly most common scenario (grouping by each of the non-aggregate columns) would be a simple yet effective tool for speeding up coding.
Perhaps "GROUP BY *"
Since it is already pretty common in SQL tools to allow references to the columns by result column number (ie. GROUP BY 1,2,3, etc.) It would seem simpler still to be able to allow the user to automatically include all the non-aggregate fields in one keystroke.
It's simple just like this: you asked to sql group the results by every single column in the from clause, meaning for every column in the from clause SQL, the sql engine will internally group the result sets before to present it to you. So that explains why it ask you to mention all the columns present in the from too because its not possible group it partially. If you mentioned the group by clause that is only possible to sql achieve your intent by grouping all the columns as well. It's a math restriction.

Which SQL statement is faster? (HAVING vs. WHERE...)

SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
GROUP BY NR_DZIALU
HAVING NR_DZIALU = 30
or
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
WHERE NR_DZIALU = 30
GROUP BY NR_DZIALU
The theory (by theory I mean SQL Standard) says that WHERE restricts the result set before returning rows and HAVING restricts the result set after bringing all the rows. So WHERE is faster. On SQL Standard compliant DBMSs in this regard, only use HAVING where you cannot put the condition on a WHERE (like computed columns in some RDBMSs.)
You can just see the execution plan for both and check for yourself, nothing will beat that (measurement for your specific query in your specific environment with your data.)
It might depend on the engine. MySQL for example, applies HAVING almost last in the chain, meaning there is almost no room for optimization. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
I believe this behavior is the same in most SQL database engines, but I can't guarantee it.
The two queries are equivalent and your DBMS query optimizer should recognise this and produce the same query plan. It may not, but the situation is fairly simple to recognise, so I'd expect any modern system - even Sybase - to deal with it.
HAVING clauses should be used to apply conditions on group functions, otherwise they can be moved into the WHERE condition. For example. if you wanted to restrict your query to groups that have COUNT(DZIALU) > 10, say, you would need to put the condition into a HAVING because it acts on the groups, not the individual rows.
I'd expect the WHERE clause would be faster, but it's possible they'd optimize to exactly the same.
Saying they would optimize is not really taking control and telling the computer what to do. I would agree that the use of having is not an alternative to a where clause. Having has a special usage of being applied to a group by where something like a sum() was used and you want to limit the result set to show only groups having a sum() > than 100 per se. Having works on groups, Where works on rows. They are apples and oranges. So really, they should not be compared as they are two very different animals.
"WHERE" is faster than "HAVING"!
The more complex grouping of the query is - the slower "HAVING" will perform to compare because: "HAVING" "filter" will deal with larger amount of results and its also being additional "filter" loop
"HAVING" will also use more memory (RAM)
Altho when working with small data - the difference is minor and can absolutely be ignored
"Having" is slower if we compare with large amount of data because it works on group of records and "WHERE" works on number of rows..
"Where" restricts results before bringing all rows and 'Having" restricts results after bringing all the rows
Both the statements will be having same performance as SQL Server is smart enough to parse both the same statements into a similar plan.
So, it does not matter if you use WHERE or HAVING in your query.
But, ideally you should use WHERE clause syntactically.