Which SQL statement is faster? (HAVING vs. WHERE...) - sql

SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
GROUP BY NR_DZIALU
HAVING NR_DZIALU = 30
or
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
WHERE NR_DZIALU = 30
GROUP BY NR_DZIALU

The theory (by theory I mean SQL Standard) says that WHERE restricts the result set before returning rows and HAVING restricts the result set after bringing all the rows. So WHERE is faster. On SQL Standard compliant DBMSs in this regard, only use HAVING where you cannot put the condition on a WHERE (like computed columns in some RDBMSs.)
You can just see the execution plan for both and check for yourself, nothing will beat that (measurement for your specific query in your specific environment with your data.)

It might depend on the engine. MySQL for example, applies HAVING almost last in the chain, meaning there is almost no room for optimization. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
I believe this behavior is the same in most SQL database engines, but I can't guarantee it.

The two queries are equivalent and your DBMS query optimizer should recognise this and produce the same query plan. It may not, but the situation is fairly simple to recognise, so I'd expect any modern system - even Sybase - to deal with it.
HAVING clauses should be used to apply conditions on group functions, otherwise they can be moved into the WHERE condition. For example. if you wanted to restrict your query to groups that have COUNT(DZIALU) > 10, say, you would need to put the condition into a HAVING because it acts on the groups, not the individual rows.

I'd expect the WHERE clause would be faster, but it's possible they'd optimize to exactly the same.

Saying they would optimize is not really taking control and telling the computer what to do. I would agree that the use of having is not an alternative to a where clause. Having has a special usage of being applied to a group by where something like a sum() was used and you want to limit the result set to show only groups having a sum() > than 100 per se. Having works on groups, Where works on rows. They are apples and oranges. So really, they should not be compared as they are two very different animals.

"WHERE" is faster than "HAVING"!
The more complex grouping of the query is - the slower "HAVING" will perform to compare because: "HAVING" "filter" will deal with larger amount of results and its also being additional "filter" loop
"HAVING" will also use more memory (RAM)
Altho when working with small data - the difference is minor and can absolutely be ignored

"Having" is slower if we compare with large amount of data because it works on group of records and "WHERE" works on number of rows..
"Where" restricts results before bringing all rows and 'Having" restricts results after bringing all the rows

Both the statements will be having same performance as SQL Server is smart enough to parse both the same statements into a similar plan.
So, it does not matter if you use WHERE or HAVING in your query.
But, ideally you should use WHERE clause syntactically.

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

Does the number of columns used for a CTE affects the performance of the query?

Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.

How to avoid duplicated SELECT phrases in SQL (MariaDB)

I am working with a small MariaDB database. To extract time intervals per user, I use the following query:
SELECT
SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime)) AS seconds,
TIME_FORMAT(SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND,Activity.startTime,Activity.endTime))),'%Hh %im %ss') AS formattedTime,
TSUser.name
FROM Activity
INNER JOIN User ON User.id = Activity.userID
GROUP BY User.id
ORDER BY seconds DESC;
I have to select the time as plain seconds (... AS seconds) to be able to order the results by it, as can be seen in my query.
However, I also want MariaDB to format the time interval, for that I use the TIME_FORMAT function. The problem is, I have to duplicate the whole SUM(...) phrase inside the TIME_FORMAT call again. This doesn't seem very elegant. Will MariaDB recognize the duplication and calculate the SUM only once? Also, is there a way to get the same result without duplicating the SUM?
I figured this should be possible with a nested query construct like so:
SELECT
innerQuery.name,
innerQuery.seconds,
TIME_FORMAT(SEC_TO_TIME(innerQuery.seconds), '%Hh %im')
FROM (
//Do the sum here, once.
) AS innerQuery
ORDER BY innerQuery.seconds DESC;
Is this the best way to do it / "ok" to do?
Note: I don't need the raw seconds in the result, only the formatted time is needed.
I'd appreciate help, thanks.
Alas. There isn't a really good solution. When you use a subquery, then MariaDb materializes the subquery (as does MySQL). Your query is rather complex, so there is a lot of I/O happening anyway, so the additional materialization may not be important.
Repeating the expression is really more an issue of aesthetics than performance. The expression will be re-executed multiple times. But, the real expense of doing aggregations is the file sort for the group by (or whatever method is used). Doing the sum() twice is not a big deal (unless you are calling a really expensive function as well as the aggregation function).
Other database engines do not automatically materialize subqueries, so using a subquery in other databases is usually the recommended approach. In MariaDB/MySQL, I would guess that repeating the expression is more efficient, although you can try both on your data and report back.
In this case, you don't need the raw values. The formatted value will work correctly in the ORDER BY.
Your subquery idea is likely to be slower because of all the overhead in having two queries.
This is a Rule of Thumb: It takes far more effort for MySQL to fetch a row than to evaluate expressions in the row. With that rule, duplicate expressions are not a burden.

using distinct command

using distinct command in SQL is good practice or not? is there any drawback of distinct command?
It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.
If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing
From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .
It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.
If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.

Transact-SQL - sub query or left-join?

I have two tables containing Tasks and Notes, and want to retrieve a list of tasks with the number of associated notes for each one. These two queries do the job:
select t.TaskId,
(select count(n.TaskNoteId) from TaskNote n where n.TaskId = t.TaskId) 'Notes'
from Task t
-- or
select t.TaskId,
count(n.TaskNoteId) 'Notes'
from Task t
left join
TaskNote n
on t.TaskId = n.TaskId
group by t.TaskId
Is there a difference between them and should I be using one over the other, or are they just two ways of doing the same job? Thanks.
On small datasets they are wash when it comes to performance. When indexed, the LOJ is a little better.
I've found on large datasets that an inner join (an inner join will work too.) will outperform the subquery by a very large factor (sorry, no numbers).
In most cases, the optimizer will treat them the same.
I tend to prefer the second, because it has less nesting, which makes it easier to read and easier to maintain. I have started to use SQL Server's common table expressions to reduce nesting as well for the same reason.
In addition, the second syntax is more flexible if there are further aggregates which may be added in the future in addition to COUNT, like MIN(some_scalar), MAX(), AVG() etc.
The subquery will be slower as it is being executed for every row in the outer query. The join will be faster as it is done once. I believe that the query optimiser will not rewrite this query plan as it can't recognize the equivalence.
Normally you would do a join and group by for this sort of count. Correlated subqueries of the sort you show are mainly of interest if they have to do some grouping or more complex predicate on a table that is not participating in another join.
If you're using SQL Server Management Studio, you can enter both versions into the Query Editor and then right-click and choose Display Estimated Execution Plan. It will give you two percentage costs relative to the batch. If they're expected to take the same time, they'll both show as 50% - in which case, choose whichever you prefer for other reasons (easier to read, easier to maintain, better fit with your coding standards etc). Otherwise, you can pick the one with the lower percentage cost relative to the batch.
You can use the same technique to look at changing any query to improve performance by comparing two versions that do the same thing.
Of course, because it's a cost relative to the batch, it doesn't mean that either query is as fast as it could be - it just tells you how they compare to each other, not to some notional optimum query to get the same results.
There's no clear-cut answer on this. You should view the SQL Plan. In terms of relational algebra, they are essentially equivalent.
I make it a point to avoid subqueries wherever possible. The join will generally be more efficient.
You can use either, and they are semantically identical. In general, the rule of thumb is to use whichever form is easier for you to read, unless performance is an issue.
If performance is an issue, then experiment with rewriting the query using the other form. Sometimes the optimizer will use an index for one form, and not the other.