Why is GROUP BY faster than DISTINCT in SQL?

Why is GROUP BY faster than DISTINCT in SQL? - sql

This article explains that GROUP BY is faster than DISTINCT
https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
Why doesn't the implementation of DISTINCT just use the same underlying logic if it's faster?

A very key part of this article – easily overlooked – is the reference to "execution plans." As presented by the EXPLAIN *query* verb.
Every time you present an SQL query to the database engine, it "compiles it" into an "execution plan," which is what the underlying logic actually "executes" to give you the results that you want. The formulation of this plan is entirely "case by case," and it can be based on quite a number of things, including table sizes and "statistics" which the engine is constantly accumulating about the data which each table contains.
(The EXPLAIN verb will present this plan-information to you, for any query. The format, content, and purpose of it is entirely engine-dependent, but the principle is not.)
Therefore, the "take-away" that I suggest that you take from this article is not that GROUP BY is [always ...] "faster than DISTINCT," but rather, how the engine goes about its business such that [sometimes] this is the case. It goes without saying that "every engine is different," but the principles governing all of them, by now, are more-or-less the same.
Whether or not "this actually turns out to be true, for you, with your data, in your case," is less important than your understanding of what actually happens behind-the-scenes when you submit any SQL query to a database engine. There's much more to it than meets the eye ...

Related

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?

In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.

You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.

Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...

Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.

Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

How far does optimization in sql queries go?

My background is application programming and there is a guideline that says to not try to "outthink" the compiler e.g. JIT etc when it comes to optimization.
Is this the case also with SQL queries?
I mean I have read that the SQL servers do some kind of execution plan for a query that is expected to be optimal (right?) but do the rearange/modify the actual queries?
Or is the programmer expected to make sure the queries are optimal? E.g. first select and then join etc

My experience, which includes working for a database server vendor, is as follows.
First, databases have been highly optimized, and compiled to machine code (often written in C or C++). On modern equipment most operations are so fast, that sub-optimal execution won't be noticed.
However, there are some areas to be aware of.
If you have no indexes, then the database has to do a table scan and that can be slow. Many people only put one field into an index, but you should consider multiple fields as they apply. The explain utilities are there to show you what index it found, and suggest what index would help.
Co-related queries can be slow. Also when you have a where clause with an expression, the database has to evaluate that for each record, and cannot use an index.
Opening a connection is slow, so be sure to reuse the connection and not re-open it for every operation.
However, the biggest issue today is typically the network communication between the database client and the database server. Try to minimize the network turns to the database, and have the database filter results so less data needs to be sent over the network.

There are things that you want to let the Database do, and there are things that only people can do. Database Management cannot be left up to the database itself. People have to be involved.
Database Optimization is both an art and a science. The Database does a great job of optimizing queries by selecting the best index from those that are already created. However, databases don't automatically create the best indexes. It is the job of a DBA/Programmer to determine what the best indexes are.
An index may make the query extremely fast, but it may require 1 GB of memory. That is not an index you generally want to add. A person can look at the query, though, and realize that a slight reformatting of the query is all that is needed.
A developer with knowledge of the data itself is equipped to make good decisions on what indexes to use and such. It is also good to review your indexes to see if some of them are even being used. Sometimes indexes are created and never used by the database, because a different index is always better or a search is never run that needs the index.
So, databases make great decisions on how to run queries most efficiently based on the indexes that they already have, but it is our job to analyze whether or not the databases have the right indexes and take appropriate action.

In general, the advice is good. Many more person-years of development go into the creation of the optimization engine than you are going to manage.
That said, there are definitely pitfalls with every database. In some cases, you need to express certain logic in a certain way to make it more efficient. Or, you might need to add hints to get the right execution path.
This is because optimization for SQL is generally much more difficult than optimization for other languages. It requires understanding the data and the distribution of values to arrive at the best solution.
My advice is to write the queries in a way that best expresses what you want done, to write them with naming conventions and indentation that convey the purpose of the query. That way, if you do have to modify the query, you will at least understand what it is doing.

There are situations where your own knowlege comes in handy. Here are some examples.
1 - you want everything for this month. This is straightforward
where Year(datefield) = 2013
and month(datefield) = 'February'
but this will run faster
where datefield >= '2013-02-01'
and datefield < '2013-03-01'
2 - you want boys named Pat. Sex is indexed, name is not. this is faster
where sex = 'M'
and name = 'Pat'
than this
where name = 'Pat'
and sex = 'M'
3 - in a case construct, list the situation that will occur most often first. This
case when something that almost always happens then 'yes' else 'no' end
will run faster than
case when something that almost never happens then 'no' else 'yes' end

Performance difference between MOVE and = assignment in ABAP

Is there any kind of performance gain between 'MOVE TO' vs x = y? I have a really old program I am optimizing and would like to know if it's worth it to pull out all the MOVE TO. Any other general tips on ABAP optimization would be great as well.

No, that is just the same operation expressed in two different ways. Nothing to gain there. If you're out for generic hints, there's a good book available that I'd recommend studying in detail. If you have to optimize a specific program, use the tracing tools (transaction SAT in sufficiently current releases).

The two statements are equivalent:
"
To assign the value of a data object source to a variable destination, use the following statement:
MOVE source TO destination.
or the equivalent statement
destination = source.
"

No, they're the same.
Here's a couple quick hints from my years of performance enhancement:
1) if you use move-corresponding where possible, your code can be a lot more concise, modular, and extendable (in the distant past this was frowned upon but the technical reasons for this are generally not applicable anymore).
2) Use SAT at every opportunity, and be sure to turn on internal table tracking. This is like turning on the lights versus stumbling over furniture in the dark.
3) Make the database layer do as much work as possible for you. Try to combine queries wherever possible, especially when combining result sets. Two queries linked by a join is usually much better than select > itab > select FOR ALL ENTRIES.
4) This is a bit advanced, but FOR ALL ENTRIES often has much slower performance than the equivalent select-options IN phrase. This seems to be because the latter is built as one big query to the database layer while the former requires multiple trips to the database layer. The caveat, of course, is that if you have too many records in your select-options the generated query at the database layer will exceed the allowable size on your system, but large performance gains are possible within that limitation. In general, SAP just loves select-options.
5) Index, index, index!

First of all move does not really affect much performance.
What is affecting quite a lot in the projects I worked for is following:
Nested loops (very evil). For example, loop through all documents, and for each document select single to check it company code is allowed to be displayed.
Instead, make a list of company codes, consult them all once from db and consult this results table instead.
Use hash or sorted tables where possible. Where not possible, use standard table, but sort it by keys and use "binary search".
Select from DB by all key fields. If not possible, consider creating indexes.
For small and simple selects, use joins. For bigger selects using joins will still work faster, but would be difficult to follow up.
Minor thing - use field symbols to read table line, this makes it much faster.

1) You should be careful while using SELECT statement in ABAP language.
Unnecessary database connections significantly decreases the performance of ABAP program.
2) While using internal table with functions you should call it by reference to reduce memory usage.
Call By Reference:
Passes a pointer to the memory location.Changes made to the
variable within the subroutine affects the variable outside
the subroutine.
3)Should not use internal tables with workarea.
4)While using nested loops, use sorting algorithms.

They are the same, as is the ADD keyword and + operator.
If you do want to optimize your ABAP, I have found the largest culprits to be:
Not using binary lookups and/or (internal) table keys properly.
The syntax of ABAP is brain-dead when it comes to table use. Know how
to work with tables efficiently. Basically write
better/optimal/elegant high-level code. This is always a winner!
Fewer instructions == less time. The fewer instructions you hit the
faster the program will run. This is important in tight loops... I
know this sounds obvious, but ABAP is so slow, that if you are really
trying to optimize critical programs, this will make a difference.
(We have processes that run days... and shaving off an hour or so
makes a difference!)
Don't mix types. There is a little bit of overhead to some
implicit conversions... for instance if you are initializing a
string data type, then use the correct literal string with
(backtick) quotes: `literal`. This also counts for looking up in
tables using keys... use exact match datatypes.
Function calls... I cannot stress the overhead of function calls
enough... the less you have the better. Goes against anything a real
computer programmer believes, but there you have it... ABAP is a
special case.
Loop using ASSIGNING (or REF TO - slightly slower on certain
types), avoid INTO like a plague.
PS: Also keep in mind that SWITCH statements are just glorified IF conditionals... thus move the most common conditions to the top!

You can create CDS with ADT Eclipse. Or views(se11) have good performance for selecting.

"MOVE a TO" b and "a = b" are just same in ABAP. There is no performance difference "MOVE" is just a more visible, noticeable version.
But if you talk about "MOVE-CORRESPONDING", yes, there is a performance difference. It's more practical to code, but actually runs slower then direct movement.

Why is SQL's grammar inside-out?

In just about any formally structured set of information, you start reading either from the start towards the end, or occasionally from the end towards the beginning (street addresses, for example.) But in SQL, especially SELECT queries, in order to properly understand its meaning you have to start in the middle, at the FROM clause. This can make long queries very difficult to read, especially if it contains nested SELECT queries.
Usually in programming, when something doesn't seem to make any sense, there's a historical reason behind it. Starting with the SELECT instead of the FROM doesn't make sense. Does anyone know the reason it's done that way?

I think the way in which a SQL statement is structured makes logical sense as far as English sentences are structured. Basically
I WANT THIS
FROM HERE
WHERE WHAT I WANT MEETS THESE CRITERIA
I don't think it makes much sense, In English at least, to say
FROM HERE
I WANT THIS
WHERE WHAT I WANT MEETS THESE CRITERIA

The SQL Wikipedia entry briefly describes some history:
During the 1970s, a group at IBM San Jose Research Laboratory developed the System R relational database management system, based on the model introduced by Edgar F. Codd in his influential paper, "A Relational Model of Data for Large Shared Data Banks". Donald D. Chamberlin and Raymond F. Boyce of IBM subsequently created the Structured English Query Language (SEQUEL) to manipulate and manage data stored in System R. The acronym SEQUEL was later changed to SQL because "SEQUEL" was a trademark of the UK-based Hawker Siddeley aircraft company.
The original name explicitly mentioned English, explaining the syntax.
Digging a little deeper, we find the FLOW-MATIC programming language.
FLOW-MATIC, originally known as B-0 (Business Language version 0), is possibly the first English-like data processing language. It was invented and specified by Grace Hopper, and development of the commercial variant started at Remington Rand in 1955 for the UNIVAC I. By 1958, the compiler and its documentation were generally available and being used commercially.
FLOW-MATIC was the inspiration behind the Common Business Oriented Language, one of the oldest programming languages still in active use. Keeping with that spirit, SEQUEL was designed with English-like syntax (1970s is modern, compared with 1950s and 1960s).
In perspective, "modern" programming systems still access databases using the age old ideas behind
MULTIPLY PRICE BY QUANTITY GIVING COST.

I must disagree. SQL grammar is not inside-out.
From the very first look you can tell whether the query will SELECT, INSERT, UPDATE, or DELETE data (all the rest of SQL, e.g. DDL, omitted on purpose).
Back to your SELECT statement confusion: The aim of SQL is to be declarative. Which means you express WHAT you want and not HOW you want it. So it makes every sense to first state WHAT YOU WANT (list of attributes you're selecting) and then provide the DBMS with some additional info on where that should be looked up FROM.
Placing the WHERE clause at the end makes great sense too: Imagine a funnel, wide at the top, narrow at the bottom. By adding a WHERE clause towards the end of the statement, you are choking down the amount of resulting data. Applying restrictions to your query any place else than at the bottom would require the developer to turn their head around.
ORDER BY clause at the very end: once the data has gone through the funnel, sort it.
JOINS (JOIN criteria) really belong into the FROM clause.
GROUPING: basically running data through a funnel before it gets into another funnel.
SQL sytax is sweet. There's nothing inside out about it. Maybe that's why SQL is so popular even after so many decades. It's rather easy to grasp and to make sense out of. (Although I have once faced a 7-page (A4-size) SQL statement which took me quite a while to get my head around.)

It's designed to be English like. I think that's the primary reason.
As a side note, I remember the initial previews of LINQ were directly modeled after it (select ... from ...). This was changed in later previews to be more programming language like (so that the scope goes downwards). Anders Hejlsberg specifically mentioned this weird fact about SQL (which makes IntelliSense harder and doesn't match C# scope rules) as the reason they made this decision.
Anyhow, good or bad, it's what it is and it's too late to change anything.

The order of the clauses in SQL is absolutely logical. Remember that SQL is a declarative language, where you declare what you want and the system figures out how best to get it for you. The first clause is the select clause where you list the columns that you want in the result table. This is the primary purpose of the query. Having stated what you want the result to look like, you next state where the data should come from. The where clause limits the amount of data being returned. There is no point in thinking about how to limit your data unless you know where it comes from, so it goes after the from clause. The group by clause works with the aggregation operators in the select clause and could go anywhere after the from clause however it is better to think about aggregation on the filtered data, so it comes after the where clause. The having clause has to come after the group by clause. The order by clause is about how the data is presented and could go anywhere after the select.

It's consistent with the rest of SQL's syntax of having every statement start with a verb (CREATE, DROP, UPDATE, etc.).
The major disadvantage of having the column list first is that it's inconvenient for auto-complete (as Hejlsberg has mentioned), but this wasn't a concern when the syntax was designed in the 1970s.
We could have had the best of both worlds with a syntax like SELECT FROM SomeTable: ColumnA, ColumnB, but it's too late to change it now.
Anyhow, SQL's SELECT statement order isn't unique. It exactly matches that of Python list comprehensions:
[(rec.a, rec.b) for rec in data where rec.a > 0]

History of the language aside (although it is fascinating) I think the thing you are missing is that SQL isn't about telling the system what to do, so much as what end result you want (and it figures out how to do it)
saying 'go over there to that rack, pick up the hats with hatbands, blue hats first, then green, then red, and bring them to me' is very much telling the system how to do what you want. it's programmer think where we presume the worker is very stupid and needs minutely detailed instructions.
SQL is starting with the end result first, the data you want, the order of the columns, etc.. it's very much the perspective of someone who is building a report. "I want firstname, lastname, then age, then....." That is after all the purpose of making the request. So it starts with that, the format of the results you want. Then it goes into where you expect it to find the data, what criteria to look for, the order to present it, etc.
So as an alternative to specifying in minute detail what you want the worker to do, SQL presumes the system knows how to do that, and centers more on what you want.
So instead of pedantically telling your worker to go here, get this, bring it over there.. it's more like saying "I want hats, from rack 12, which have hatbands, and please sort them by color."

Does the order of parameters in the where clause affect whether a table uses an index?

So I am wondering if there is a definitive answer to this question.
Also, does it matter if the index is clustered vs. non-clustered.
Is it the same in all RDBMS implementations or is the exact behavior going to be proprietary?

SQL is a declarative language, not a procedural one. Each SQL implementation is going to have its own quirks about implementation details like which indexes get used, how the optimizer decides which indexes to use, what the SQL programmer can do to affect the choice, and so on.

Use of indexes is not a part of the SQL standard, but rather an implementation detail of the particular DBMS.
Ideally, it shouldn't affect it, since it doesn't affect the actual rows that are returned.
But I've seen queries on an unnamed DBMS that does change index use based on the SQL query order.

The ordering of the where clause shouldn't affect the query plan or the indices used in any respectable database although I have seen at least one (non-respectable) database where this wasn't the case.

At one time (long ago, i.e. until about 1995) Oracle used to have only a "rule based optimiser", and with this it was certainly the case that the order of predicates in the WHERE clause, and the order of tables in the FROM clause (there was no JOIN syntax then), affected the query plan: this was documented to be the case. However, cost-based optimisers (which Oracle has had since then) attempt to examine all possible plans (or at least, as many as they can within some sensible parameters) and choose the most efficient.

This is tough to answer because I think no one really knows, including the DBMS engineers! LOL, that is sarcasm, but what I really mean is that it is inherently non-deterministic. I could be wrong, but it really does boil down to the DB engine implementation, since the ANSI SQL standard, or any other standard does not to my knowledge regulate this notion of index semantics. I do know, however, that the FIRST reference to any indexed field matters, as it marks the top of the decision tree for the query engine. From there and depending on the number and type of indexes, the query engine may choose to use the first and most matching index it finds, or it may decide to "optimize" and instead use another index. That, I think, is the part that makes it non-deterministic.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas