Measuring the complexity of SQL statements

Measuring the complexity of SQL statements - sql

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?
It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?
[Edit/Note]
While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.
[Edit/Note 2]
Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.

Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).
The "control flow" in a SQL query is best related to "and" and "or" operators in query.
The "computational complexity" is best related to operators such as SUM or implicit JOINS.
Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.
What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.
Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.
Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.

I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".
Ultimately it's about the chances of making a mistake, or not understanding the code I've written?
Something like:
number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)

Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -
Calculate TSQL Stored Procedure Complexity

SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.

Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by:
the number of joins
the number of where conditions
the number of functions
the number of subqueries
the number of casts to differnt datatypes
the number of case statements
the number of loops or cursors
the number of steps in a transaction
However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.

In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.
Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).
Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.

In programming languages we have several methods to compute the time complexity or space complexity.
Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.
Like a simple row by row query
Select * from table ;
// This will totally depend on no of
records say n hence O(n)
Select max(input) from table;
// here max would be an extra
overhead added to each
Therefore t*O(n) where t is max
Evaluation time

Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:
Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
SQL keywords
SQL function names
string literals with character escapes
string literals without character escape
string literals which are dates or date+time
numeric literals
comma
parenthesis
SQL comments (--, /* ... */)
quoted user words
non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.
This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.
By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).
The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)
Note: I have been inspired by #redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).

Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL:
https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity

Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).
Here is a link that goes over some of the things you should look at in the execution plan.

Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.
SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).

A good question. The problem is that for a SQL query like:
SELECT * FROM foo;
the complexity may depend on what "foo" is and on the database implementation. For a function like:
int f( int n ) {
if ( n == 42 ) {
return 0;
}
else {
return n;
}
}
there is no such dependency.
However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.

It's reasonably enough to consider complexity as what it would be if you coded the query yourself.
If the table has N rows then,
A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )

Related

What is the efficiency of a query + subquery that finds the minimum parameter of a table in SQL?

I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!

If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?

In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.

You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.

Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...

Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.

Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

Which is faster? Multiplication in code, or multiplication in SQL?

Question:
What's faster:
Multiplying 2 doubles in a SQL table and return the table, or returning the table and multiplying two column values in code ?
You can assume that the two columns that need to be multiplied need to be returned anyway.

Multiplication is an extremely fast computation, and whether the chip is asked to do it from SQL or from another places shouldnt make any differnce. The thing that will probably make it quicker in SQL is that it can be done in a single pass (though that depends on how SQL implements it), where as if you do it in code you have to cycle through the result set, but then again you might be doing that anyway.
The real answer though is it really doesnt matter unless you plan to multiply 10's of millions of numbers at a time.

SQL is faster IF you are
using appropriate data types with efficient hardware arithmetic (more)
using join/view efficiently.
and generally
- Avoid delimited fields (more)

The only time I've seen SQL process much slower than native code is when doing excessive powers and logs. Why, I don't know, as the CPU has the same amount of work to do either way.
The only possible reason is that T-SQL is a scripted language. You never mentioned what language you're using.
As mentioned by others, test.
The real answer is to only perform math on the records that need it. This you should achieve with SQL.
Of course, in the extreme, you could use SSE (or AVX with Sandy Bridge) in native code, which will of course yield significantly faster results. I don't believe SQL Server would be able to apply such optimizations.

SQL if you're doing it right!

In this case SQL, include the Mulitply as part of the select statement when you retrieve the resultset. If you are doing in code, then you will have to iterate through the resultset seperately which will be time consuming.
On the other hand if you are just returning two values and want to multiply that, it doesn't matter if its code or sql :)

The answer is to do it in code because the DBMS might be wrong. SQL Server is known to get it wrong.
Try this and notice the scary difference between the two results:
select 1.0 * 3/2
select 3/2 * 1.0

SQL Query theory question - single-statement vs multi-statement queries

When I write SQL queries, I find myself often thinking that "there's no way to do this with a single query". When that happens I often turn to stored procedures or multi-statement table-valued functions that use temp tables (of one sort or another) and end up simply combining the results and returning the result table.
I'm wondering if anyone knows, simply as a matter of theory, whether it should be possible to write ANY query that returns a single result set as a single query (not multiple statements). Obviously, I'm ignoring relevant points such as code readability and maintainability, maybe even query performance/efficiency. This is more about theory - can it be done... and don't worry, I certainly don't plan to start forcing myself to write a single-statement query when multi-statement will better suit my purpose in all cases, but it might make me think twice or a little bit longer on whether there is a viable way to get the result from a single query.
I guess a few parameters are in order - I'm thinking of a relational database (such as MS SQL) with tables that follow common best practices (such as all tables having a primary key and so forth).
Note: in order to win 'Accepted Answer' on this, you'll need to provide a definitive proof (reference to web material or something similar.)

I believe it is possible. I've worked with very difficult queries, very long queries, and often, it is possible to do it with a single query. But most of the time, it's harder to mantain, so if you do it with a single query, make sure you comment your query carefully.
I've never encountered something that could not be done in a single query.
But sometimes it's best to do it in more than one query.

At least with the a recent version of Oracle is absolutely possible. It has a 'model clause' which makes sql turing complete. ( http://blog.schauderhaft.de/2009/06/18/building-a-turing-engine-in-oracle-sql-using-the-model-clause/ ). Of course this is all with the usual limitation that we don't really have unlimited time and memory.
For a normal sql dialect without these abdominations I don't think it is possible.
A task that I can't see how to implement in 'normal sql' would be:
Assume a table with a single column of type integer
For every row
'take the value at the current row and go that many rows back, fetch that value, go that many rows back, and continue until you fetch the same value twice consecutively and return that as the result.'

I can't prove it, but I believe the answer is a cautious yes - provided your database design is done properly. Usually being forced to write multiple statements to get a certain result is a sign that your schema may need some improvements.

I'd say "yes" but can't prove it. However, my main thought process:
Any select should be a set based operation
Your assumption is that you are dealing with mathematically correct sets (ie normalised correctly)
Set theory should guarantee it's possible
Other thoughts:
Multiple SELECT statement often load temp tables/table variables. These can be derived or separated in CTEs.
Any RBAR processing (for good or bad) now be dealt with CROSS/OUTER APPLY onto derived tables
UDFs would be classed as "cheating" in this context I feel, because it allows you to put a SELECT into another module rather than in your single one
No writes allowed in your "before" sequence of DML: this changes state from SELECT to SELECT
Have you seen some of the code in our shop?
Edit, glossary
RBAR = Row By Agonising Row
CTE = Common Table Expression
UDF = User Defined Function
Edit: APPLY: cheating?
SELECT
*
FROM
MyTable1 t1
CROSS APPLY
(
SELECT * FROM MyTable2 t2
WHERE t1.something = t2.something
) t2

In theory yes, if you use functions or a torturous maze of OUTER APPLYs or sub-queries; however, for readability and performance, we have always ended up going with temp tables and multi-statement stored procedures.
As someone above commented, this is usually a sign that your data structure is starting to smell; not that it's bad, but that maybe it's time to denormalise for performance reasons (happens to the best of us), or maybe put a denormalised querying layer in front of your normalised "real" data.

Surprising SQL speed increase

I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?

More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'

It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.

There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.

If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'

Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.

What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.

As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas