Deterministic/Volatile function in SQL - sql

Let's take a basic deterministic function and a non-deterministic one:
ABS(2)
NOW()
What about the third case of something that may change but we're not sure, such as:
SELECT
ABS(2) -- deterministic
, NOW() -- not
, getTableCount(otherTbl) -- function that does a 'SELECT count(1) FROM table'
FROM
table
Basically, if a row is inserted or deleted, the subselect's value will change. So would that one be considered deterministic? The result should always be the same...unless the underlying data is changed, so it's almost like a third case. Or, is volatile/non-deterministic just taken to mean 'if it ever changes, ever, ever, ever, under any circumstances, then it's volatile.' ?

There are different interpretations for determinism, even when restricted to the SQL functions domain. It depends on what determinism consumer needs and assumes.
The usual definition of determinism is that a deterministic function always return the same value when confronted with same input argument values for its parameters.
If the function consumes state, it would implictly consider it as an extra input paramenter. The original function(p1,...pn) would become function(p1,...pn,state). But in this case if two different states are compared, then the inputs would not be the same, so we couldn't talk about determinism anymore. Knowing this, we will use the terms state-sensitive-determinism and state-insensitive-determinism to differentiate those cases.
Our state-insensitive-determinism is equivalent of PostgreSQL's IMMUTABLE (PostgreSQL is a good comparinson as it avoids using the term determinism to avoid confusion, as it is possible to see in postgresql docs). In this case, the function always returns the same value no matter the state (example select 1+2). It is the most strict form of determinism and consumers usually take it for granted - query optimizers for example can substitute them by their result (select 1+2 would become select 3). In those cases, the state does not influence the result. So, even if we put state as an extra parameter, the function remains resulting the same.
When the result does not change facing the same state but risk changing otherwise we have our state-sensitive-determinism or PostgreSQL's STABLE (example select v,sum(v) over () from tbl where v>1000;). Determinism here is on a gray area. A query optimizer consumer sees it as deterministic because since query lives a well defined state, at least in transactionable databases, it is fine to calculate it only once instead of many times because future calculations would result the same. But a materialized calculated column or index can't accept this same function as deterministic because a little change in the state would turn invalid all its pre-calculated and stored values. In this scenario resides the OP's getTableCount(otherTbl). For a query optimizer its deterministism is enough to avoid extra calculations, for materialized calculated values it is not enough and can't be accepted as a source of value for being written. If we use the state as an extra parameter, the result may change between different states.
If we consume a value that is generated based on some uncontrolled state like random() (at least when we don't choose seed and pseudorandom function), then we can't achieve determinism. In PostgreSQL's terms, this would be VOLATILE. A VOLATILE is undeterministic by nature because it can have different values even in the same table scan, as it is the case of random() (For time related functions see Postgres now() timestamp doesn't change, when script works, the time may be the transaction time or can be the query time, what would impact your view of what is deterministic).
MySQL have different keywords, NOT DETERMINISTIC DETERMINISTIC, READS SQL DATA MODIFIES SQL DATA (similiar to PostgreSQL's LEAKPROOF), NO SQL CONTAINS SQL as seen on mysql docs, with the same objective of PostgreSQL - giving hints to the specific consumer, be it a query optimizer or a materialized value, of whether it would adapt its behaviour depending on its interpretation of determinism. The database vendors probably leave this responsibility to the users because leaving them the responsibility of determining the causal graph what influences what would be complex and problematic.
When vendors talk about determinsim they will probably be talking about one of those that we said. In sqlserver docs microsoft says that state must be the same, so they are probably talking about our state-sensitive-determinism. In sqlite docs otherwise it is taken the state-insensitive-determinism approach, where functions that must result equally even in different states to be considered deterministic, because they would follow stricter rules. Oracle implicitly follows the same sqlite flavor in their docs.
Our transactionable databases will eventually use some mechanism like MVCC to hold state in a transaction. In this case we could think the transactionTimestamp as a input to our functions. But if we take more complex cases like distributed databases, then our determinism can be harder to achieve and eventualy it would have to consider consensus algorithms.

Related

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?
In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.
You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.
Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...
Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.
Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

What's the most efficient way to do a case-insensitive like expression?

In Pervasive v13, is there a "more performant" way to perform a case-insensitive like expression than is shown below?
select * from table_name
where upper(field_name) like '%TEST%'
The UPPER function above has performance cost that I'd like to avoid.
I disagree with those who say that the performance-overhead of UPPER is minor; it is doubling the execution time compared to the exact same query without UPPER.
Background:
I was very satisfied with the execution time of this wildcard-like-expression until I realized the result set was missing records due to mismatched capitalization.
Then, I implemented the UPPER technique (above). This achieved including those missing records, but it doubled the execution time of my query.
This UPPER technique, for case-insensitive comparison, seems outlandishly intensive to me at even a conceptual level. Instead of changing a field's case, for every record in a large database table, I'm hoping that the SQL standard provides some type of syntactical flag that modifies the like-expression's behavior regarding case-sensitivity.
From there, behind the scenes, the database engine could generate a compiled regular expression (or some other optimized case-insensitive evaluator) that could well outperform this UPPER technique. This seems like a possibility that might exist.
However, I must admit, at some level there still must be a conversion to make the letter-comparisons. And perhaps, this UPPER technique is no worse than any other method that might achieve the same result set.
Regardless, I'm posting this question in hopes someone might reveal a more performant syntax I'm unaware of.
You do not need the UPPER, when you define your table using CASE.
The CASE keyword causes PSQL to ignore case when evaluating
restriction clauses involving a string column. CASE can be specified
as a column attribute in a CREATE TABLE or ALTER TABLE statement, or
in an ORDER BY clause of a SELECT statement.
(see: https://docs.actian.com/psql/psqlv13/index.html#page/sqlref%2Fsqlref.CASE_(string).htm )
CREATE TABLE table_name (field_name VARCHAR(100) CASE)

Does select * impact stored procedure performance?

I know this could be a trivial question but I keep hearing one of my teachers voice saying
don't use SELECT * within a stored procedure, that affects performance and it's returning data that could be braking its clients if it's schema changes causing unknown ripple
I can't find any article confirming that concept, and I think that should be noticeable if true.
In most modern professional SQL implementations (Oracle, SQL Server, DB2, etc.), the use of SELECT * has a negative impact only in a top-level SELECT. In all other cases the SQL compiler should perform column-optimization anyway, eliminating any columns that are not used.
And the negative effect of * in a top-level SELECT is almost entirely related to returning all of the columns when you probably do not need all of them.
IMHO, in all other cases(**), including most ad-hoc cases, the use of * is perfectly fine and has no determimental effects (and obvious beneficial conveniences). The widespread universal pronouncements agaist using * are largely an archiac holdover from the time (10-15 years ago) when most SQL compilers did not have effective column-elimination optimization techniques.
(** - one exception is in VIEW definitions in SQL Server, because it doesn't automatically notice if the bound column list changes.)
The other reason that you sometimes see for not using SELECT * is not because of any performance issue, but just as a matter of coding practices. That is, that it's generally better to write your SQL code to be explicit about what columns you (or your client code) expects and thus are dependent on. If you use * then it's implicit and someone reading your SQL code cannot easily tell if your application is truly dependent on a certain column or not. (And IMHO, this is the more valid reason.)
I found this quote in a paper when we use SELECT * instruction:
“[…] real harm occurs when a sort is required. Every SELECTed column, with the sorting columns repeated, makes up the width of the sort work file wider. The wider and longer the file, the slower the sort is.” In http://www.quest.com/whitepapers/10_SQL_Tips.pdf
This paper is form DB2 engine but likely this is applied for other engines too.

PostgreSQL query isolation with C extension using persistent data

The situation:
I need to do a procedural sort of a query result set.
The data set size/access frequency does not allow this sort to occur in application memory.
I want a shared library written in C to function as the ORDER BY parameter in the query. It should accept some fields from the row being sorted and assigns a score, with the result dependent on what has been read already.
So: how to handle heap data in a PostgreSQL shared library which should persist within a query but not between them?
The DBMS will determine whether the ORDER BY clause means that the data is kept in memory or spilled to disk. It is highly unlikely that you can alter that by a stored procedure invoked in the ORDER BY clause of your query. It is also completely unclear to me whether your hypothetical procedure would try to keep the data in memory or spill it to disk. You should let the DBMS do the sorting; its sort is usually fairly well tuned. You just need to ensure that it (the DBMS) can do the comparison you need.
Unfortunately the sort is procedural; SQL just can't do it. The data in question is memo data for the procedural sort and PostgreSQL has no knowledge of its existence.
If you can write a stored procedure (or C function) that takes in the 'memo data' and generates a sortable string (or other type, but string is most plausible), then you can evaluate the function on the data in the select-list, and have SQL sort by the result value. The procedure will have to determine a stable value for the string based solely on one row at a time.
SELECT t.id, t.memo_data, magic_function(t.memo_data) AS sortable
FROM SomeTable AS T
ORDER BY sortable;
You might have to specify the function in the ORDER BY clause, or fall back on the 'ordinal position' sort (ORDER BY 3). You write the C code that SQL knows as magic_function().
Note that this function must operate on only a single value (or, more accurately, the arguments it is passed from a single row of data at a time). It is not usually feasible to make it depend on any other rows. It must be an invariant function - given the same input, it must always produce the same output. If you don't do that, you are going to get quasi-random results.
You may need to look up 'memory duration'. You might, conceivably, be able to allocate memory with 'statement duration', which the function could use, but you then need to consider how that is initialized and released. You might need to look at the manual on Memory Management.

Why are SQL-Server UDFs so limited?

From the MSDN docs for create function:
User-defined functions cannot be used to perform actions that modify the database state.
My question is simply - why?
Yes, a UDF that modifies data may have potentially unwanted side-effects.
Yes, there is overhead involved if a UDF is called thousands of times.
But that is the whole point of design and testing - to ensure that such issues are ironed out before deployment. So why do DB vendors insist on imposing these artificial limitations on developers? What is the point of a language construct that can essentially only be used as a wrapper for select statements?
The reason for this question is as follows: I am writing a function to return a GUID for a certain unique integer ID. If a GUID is already allocated for that ID I simply return it; otherwise I want to generate a new GUID, store that into a table, and return the newly-generated GUID. (Yes, this sounds long-winded and possibly crazy, but when you're sending data to another dev company who believes their design was handed down by God and cannot be improved upon, it's easier just to smile and nod and do what they ask).
I know that I can use a stored procedure with an output parameter to achieve the same result, but then I have to declare a new variable just to hold the result of the sproc. Not only that, I then have to convert my simple select into a while loop that inserts into a temporary table, and call the sproc for every iteration of that loop.
It's usually best to think of the available tools as a spectrum, from Views, through UDFs, out to Stored Procedures. At the one end (Views) you have a lot of restrictions, but this means the optimizer can actually "see through" the code and make intelligent choices. At the other end (Stored Procedures), you've got lots of flexibility, but because you have such freedom, you lose some abilities (e.g. because you can return multiple result sets from a stored proc, you lose the ability to "compose" it as part of a larger query).
UDFs sit in a middle ground - you can do more than you can do in a view (multiple statements, for example), but you don't have as much flexibility as a stored proc. By giving up this freedom, it allows the outputs to be composed as part of a larger query. By not having side effects, you guarantee that, for example, it doesn't matter in which row order the UDF is applied in. If you could have side effects, the optimizer might have to give an ordering guarantee.
I understand your issue, I think, but taking this from your comment:
I want to do something like select my_udf(my_variable) from my_table, where my_udf either selects or creates the value it returns
So you want a select that (potentially) modifies data. Can you look at that sentence on its own and tell me that that reads perfectly OK? - I certainly can't.
Reading your description of what you actually need to do:
I am writing a function to return a
GUID for a certain unique integer ID.
If a GUID is already allocated for
that ID I simply return it; otherwise
I want to generate a new GUID, store
that into a table, and return the
newly-generated GUID.
I know that I can use a stored
procedure with an output parameter to
achieve the same result, but then I
have to declare a new variable just to
hold the result of the sproc. Not only
that, I then have to convert my simple
select into a while loop that inserts
into a temporary table, and call the
sproc for every iteration of that
loop.
from that last sentence it sounds like you have to process many rows at once, so how about a single INSERT that inserts the GUIDs for those IDs that don't already have them, followed by a single SELECT that returns all the GUIDs that (now) exist?
Sometimes if you cannot implement the solution you came up with, it may be an indication that your solution is not optimal.
Using a statement like this
INSERT INTO IntGuids(IntValue, GuidValue)
SELECT MyIntValues.IntValue, NEWID()
FROM MyIntValues
LEFT OUTER JOIN IntGuids ON MyIntValues.IntValue = IntGuids.IntValue
WHERE IntGuids.IntValue IS NULL
creates all the GUIDs you need to have in 1 statement. No need to SELECT+INSERT for every single value.