Compound "OR" evaluation in DB2 - sql

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?

In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.

You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.

Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...

Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.

Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

Related

Deterministic/Volatile function in SQL

Let's take a basic deterministic function and a non-deterministic one:
ABS(2)
NOW()
What about the third case of something that may change but we're not sure, such as:
SELECT
ABS(2) -- deterministic
, NOW() -- not
, getTableCount(otherTbl) -- function that does a 'SELECT count(1) FROM table'
FROM
table
Basically, if a row is inserted or deleted, the subselect's value will change. So would that one be considered deterministic? The result should always be the same...unless the underlying data is changed, so it's almost like a third case. Or, is volatile/non-deterministic just taken to mean 'if it ever changes, ever, ever, ever, under any circumstances, then it's volatile.' ?
There are different interpretations for determinism, even when restricted to the SQL functions domain. It depends on what determinism consumer needs and assumes.
The usual definition of determinism is that a deterministic function always return the same value when confronted with same input argument values for its parameters.
If the function consumes state, it would implictly consider it as an extra input paramenter. The original function(p1,...pn) would become function(p1,...pn,state). But in this case if two different states are compared, then the inputs would not be the same, so we couldn't talk about determinism anymore. Knowing this, we will use the terms state-sensitive-determinism and state-insensitive-determinism to differentiate those cases.
Our state-insensitive-determinism is equivalent of PostgreSQL's IMMUTABLE (PostgreSQL is a good comparinson as it avoids using the term determinism to avoid confusion, as it is possible to see in postgresql docs). In this case, the function always returns the same value no matter the state (example select 1+2). It is the most strict form of determinism and consumers usually take it for granted - query optimizers for example can substitute them by their result (select 1+2 would become select 3). In those cases, the state does not influence the result. So, even if we put state as an extra parameter, the function remains resulting the same.
When the result does not change facing the same state but risk changing otherwise we have our state-sensitive-determinism or PostgreSQL's STABLE (example select v,sum(v) over () from tbl where v>1000;). Determinism here is on a gray area. A query optimizer consumer sees it as deterministic because since query lives a well defined state, at least in transactionable databases, it is fine to calculate it only once instead of many times because future calculations would result the same. But a materialized calculated column or index can't accept this same function as deterministic because a little change in the state would turn invalid all its pre-calculated and stored values. In this scenario resides the OP's getTableCount(otherTbl). For a query optimizer its deterministism is enough to avoid extra calculations, for materialized calculated values it is not enough and can't be accepted as a source of value for being written. If we use the state as an extra parameter, the result may change between different states.
If we consume a value that is generated based on some uncontrolled state like random() (at least when we don't choose seed and pseudorandom function), then we can't achieve determinism. In PostgreSQL's terms, this would be VOLATILE. A VOLATILE is undeterministic by nature because it can have different values even in the same table scan, as it is the case of random() (For time related functions see Postgres now() timestamp doesn't change, when script works, the time may be the transaction time or can be the query time, what would impact your view of what is deterministic).
MySQL have different keywords, NOT DETERMINISTIC DETERMINISTIC, READS SQL DATA MODIFIES SQL DATA (similiar to PostgreSQL's LEAKPROOF), NO SQL CONTAINS SQL as seen on mysql docs, with the same objective of PostgreSQL - giving hints to the specific consumer, be it a query optimizer or a materialized value, of whether it would adapt its behaviour depending on its interpretation of determinism. The database vendors probably leave this responsibility to the users because leaving them the responsibility of determining the causal graph what influences what would be complex and problematic.
When vendors talk about determinsim they will probably be talking about one of those that we said. In sqlserver docs microsoft says that state must be the same, so they are probably talking about our state-sensitive-determinism. In sqlite docs otherwise it is taken the state-insensitive-determinism approach, where functions that must result equally even in different states to be considered deterministic, because they would follow stricter rules. Oracle implicitly follows the same sqlite flavor in their docs.
Our transactionable databases will eventually use some mechanism like MVCC to hold state in a transaction. In this case we could think the transactionTimestamp as a input to our functions. But if we take more complex cases like distributed databases, then our determinism can be harder to achieve and eventualy it would have to consider consensus algorithms.

How to make DBMS check all conditions in WHERE section?

I write SQL load testing tool where user could just specify the number of conditions in WHERE section (and some more functionality) using sliders, then press button "Start" for starting load testing of database.
The problem is: If I use OR logical operator for joining clauses, the DBMS would stop checking of WHERE section once it encounter predicate that return TRUE. With AND logical operator is similar situation: once DBMS encounter predicate that return FALSE, the the DBMS will stop checking WHERE section. How to make DBMS check all clauses in WHERE section independently of clauses TRUE/FALSE values?
You can't.
SQL is a declarative language, not an imperative one. That means the database engine is absolutely free to use any and all kinds of optimizations (and dirty tricks) to get the correct result according to your specification.
Moreover, the strategy the engine may choose today may change in the future without notice, so long it returns the correct result. The optimizer logic is typically very simple (and predictable) in low end databases, while it's very sophisticated in high end ones (more operations, better histograms, smarter logic, etc.). In short the strategy is constantly adapting the specific method to the existing conditions: data present on each table, hardware and software conditions, etc.
I decided to add "Short-circuit protection mode" to my app and build WHERE section like "WHERE ((cond1 = cond2) = cond3) = cond4)" or "WHERE cond1 = (cond2 = (cond3 = cond4))", the last one would be easier to implement.
UPDATE:

Does select * impact stored procedure performance?

I know this could be a trivial question but I keep hearing one of my teachers voice saying
don't use SELECT * within a stored procedure, that affects performance and it's returning data that could be braking its clients if it's schema changes causing unknown ripple
I can't find any article confirming that concept, and I think that should be noticeable if true.
In most modern professional SQL implementations (Oracle, SQL Server, DB2, etc.), the use of SELECT * has a negative impact only in a top-level SELECT. In all other cases the SQL compiler should perform column-optimization anyway, eliminating any columns that are not used.
And the negative effect of * in a top-level SELECT is almost entirely related to returning all of the columns when you probably do not need all of them.
IMHO, in all other cases(**), including most ad-hoc cases, the use of * is perfectly fine and has no determimental effects (and obvious beneficial conveniences). The widespread universal pronouncements agaist using * are largely an archiac holdover from the time (10-15 years ago) when most SQL compilers did not have effective column-elimination optimization techniques.
(** - one exception is in VIEW definitions in SQL Server, because it doesn't automatically notice if the bound column list changes.)
The other reason that you sometimes see for not using SELECT * is not because of any performance issue, but just as a matter of coding practices. That is, that it's generally better to write your SQL code to be explicit about what columns you (or your client code) expects and thus are dependent on. If you use * then it's implicit and someone reading your SQL code cannot easily tell if your application is truly dependent on a certain column or not. (And IMHO, this is the more valid reason.)
I found this quote in a paper when we use SELECT * instruction:
“[…] real harm occurs when a sort is required. Every SELECTed column, with the sorting columns repeated, makes up the width of the sort work file wider. The wider and longer the file, the slower the sort is.” In http://www.quest.com/whitepapers/10_SQL_Tips.pdf
This paper is form DB2 engine but likely this is applied for other engines too.

Performance difference between MOVE and = assignment in ABAP

Is there any kind of performance gain between 'MOVE TO' vs x = y? I have a really old program I am optimizing and would like to know if it's worth it to pull out all the MOVE TO. Any other general tips on ABAP optimization would be great as well.
No, that is just the same operation expressed in two different ways. Nothing to gain there. If you're out for generic hints, there's a good book available that I'd recommend studying in detail. If you have to optimize a specific program, use the tracing tools (transaction SAT in sufficiently current releases).
The two statements are equivalent:
"
To assign the value of a data object source to a variable destination, use the following statement:
MOVE source TO destination.
or the equivalent statement
destination = source.
"
No, they're the same.
Here's a couple quick hints from my years of performance enhancement:
1) if you use move-corresponding where possible, your code can be a lot more concise, modular, and extendable (in the distant past this was frowned upon but the technical reasons for this are generally not applicable anymore).
2) Use SAT at every opportunity, and be sure to turn on internal table tracking. This is like turning on the lights versus stumbling over furniture in the dark.
3) Make the database layer do as much work as possible for you. Try to combine queries wherever possible, especially when combining result sets. Two queries linked by a join is usually much better than select > itab > select FOR ALL ENTRIES.
4) This is a bit advanced, but FOR ALL ENTRIES often has much slower performance than the equivalent select-options IN phrase. This seems to be because the latter is built as one big query to the database layer while the former requires multiple trips to the database layer. The caveat, of course, is that if you have too many records in your select-options the generated query at the database layer will exceed the allowable size on your system, but large performance gains are possible within that limitation. In general, SAP just loves select-options.
5) Index, index, index!
First of all move does not really affect much performance.
What is affecting quite a lot in the projects I worked for is following:
Nested loops (very evil). For example, loop through all documents, and for each document select single to check it company code is allowed to be displayed.
Instead, make a list of company codes, consult them all once from db and consult this results table instead.
Use hash or sorted tables where possible. Where not possible, use standard table, but sort it by keys and use "binary search".
Select from DB by all key fields. If not possible, consider creating indexes.
For small and simple selects, use joins. For bigger selects using joins will still work faster, but would be difficult to follow up.
Minor thing - use field symbols to read table line, this makes it much faster.
1) You should be careful while using SELECT statement in ABAP language.
Unnecessary database connections significantly decreases the performance of ABAP program.
2) While using internal table with functions you should call it by reference to reduce memory usage.
Call By Reference:
Passes a pointer to the memory location.Changes made to the
variable within the subroutine affects the variable outside
the subroutine.
3)Should not use internal tables with workarea.
4)While using nested loops, use sorting algorithms.
They are the same, as is the ADD keyword and + operator.
If you do want to optimize your ABAP, I have found the largest culprits to be:
Not using binary lookups and/or (internal) table keys properly.
The syntax of ABAP is brain-dead when it comes to table use. Know how
to work with tables efficiently. Basically write
better/optimal/elegant high-level code. This is always a winner!
Fewer instructions == less time. The fewer instructions you hit the
faster the program will run. This is important in tight loops... I
know this sounds obvious, but ABAP is so slow, that if you are really
trying to optimize critical programs, this will make a difference.
(We have processes that run days... and shaving off an hour or so
makes a difference!)
Don't mix types. There is a little bit of overhead to some
implicit conversions... for instance if you are initializing a
string data type, then use the correct literal string with
(backtick) quotes: `literal`. This also counts for looking up in
tables using keys... use exact match datatypes.
Function calls... I cannot stress the overhead of function calls
enough... the less you have the better. Goes against anything a real
computer programmer believes, but there you have it... ABAP is a
special case.
Loop using ASSIGNING (or REF TO - slightly slower on certain
types), avoid INTO like a plague.
PS: Also keep in mind that SWITCH statements are just glorified IF conditionals... thus move the most common conditions to the top!
You can create CDS with ADT Eclipse. Or views(se11) have good performance for selecting.
"MOVE a TO" b and "a = b" are just same in ABAP. There is no performance difference "MOVE" is just a more visible, noticeable version.
But if you talk about "MOVE-CORRESPONDING", yes, there is a performance difference. It's more practical to code, but actually runs slower then direct movement.

Measuring the complexity of SQL statements

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?
It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?
[Edit/Note]
While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.
[Edit/Note 2]
Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.
Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).
The "control flow" in a SQL query is best related to "and" and "or" operators in query.
The "computational complexity" is best related to operators such as SUM or implicit JOINS.
Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.
What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.
Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.
Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.
I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".
Ultimately it's about the chances of making a mistake, or not understanding the code I've written?
Something like:
number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)
Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -
Calculate TSQL Stored Procedure Complexity
SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.
Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by:
the number of joins
the number of where conditions
the number of functions
the number of subqueries
the number of casts to differnt datatypes
the number of case statements
the number of loops or cursors
the number of steps in a transaction
However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.
In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.
Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).
Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.
In programming languages we have several methods to compute the time complexity or space complexity.
Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.
Like a simple row by row query
Select * from table ;
// This will totally depend on no of
records say n hence O(n)
Select max(input) from table;
// here max would be an extra
overhead added to each
Therefore t*O(n) where t is max
Evaluation time
Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:
Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
SQL keywords
SQL function names
string literals with character escapes
string literals without character escape
string literals which are dates or date+time
numeric literals
comma
parenthesis
SQL comments (--, /* ... */)
quoted user words
non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.
This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.
By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).
The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)
Note: I have been inspired by #redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).
Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL:
https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity
Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).
Here is a link that goes over some of the things you should look at in the execution plan.
Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.
SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).
A good question. The problem is that for a SQL query like:
SELECT * FROM foo;
the complexity may depend on what "foo" is and on the database implementation. For a function like:
int f( int n ) {
if ( n == 42 ) {
return 0;
}
else {
return n;
}
}
there is no such dependency.
However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.
It's reasonably enough to consider complexity as what it would be if you coded the query yourself.
If the table has N rows then,
A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )