Howto test SQL statements and ensure their quality?

Howto test SQL statements and ensure their quality? - sql

I am working on some business intelligence reports these days. The data is fetched by ordinary SQL SELECT statements. The statements are getting more and more complex.
The data I want to report is partially business critical. So I would feel better, if I could do something to proof the correctness and quality of the SQL statements.
I know there are some ways to do this for application-code. But what can I do to reach these goals at SQL level?
Thanks in advance.

I'm not aware of any SQL-level proof of QA you could do since you are looking for the intent (semantics) of the query rather than syntactical correctness.
I would write a small test harness that takes the SQL statement and runs it on a known test database, then compare the result with an expecetd set of reference data (spreadsheet, simple CSV file etc).
For bonus points wrap this in a unit test and make it part of your continuous build process.
If you use a spreadsheet or CSV for the reference data it may be possible to walk through it with the business users to capture their requirements ahead of writing the SQL (i.e test-driven development).

Testing correctness of the statements would require a detailed description of the logic that genreated the report requirement, and then independant of your SQL the creation of appropriate test data sets, constructed against the requirements to ensure that the correct data, and only the correct data was selected for each test case.
Constructing these cases on more and more complex conditions will get very difficult though - reporting is notorious for ever changing requirements.

You could also consider capturing metrics such as the duration of running each statement. You could either do this at the application level, or by writing into an audit table at the beginning and end of each SQL statement. This is made easier if your statements are encapsulated in stored procedures, and can also be used to monitor who is calling the procedure, at what times of day and where from.

Looking forward to reading answers on this one. It's simple to check if the statement works or not, either it runs or doesn't run. You can also check against the requirement, does it return these 14 columns in the specified order.
What is hard to check is whether the result set is the correct answer. When you have tables with millions of rows joined to other tables with millions of rows, you can't physically check everything to know what the results should be. I like the theory of running against a test database with known conditions, but building this, and accounting for the edge cases that might affect the data in production, is something that I think would be hard to tackle.
You can sometimes look at things in such a way as to tell you if things are right. Sometimes I add a small limiting where clause to a report query to get a result set I can manually check. (Say only the records for a few days). Then I go into the database tables individually and see if they match what I have. For instance if I know there were 12 meetings for the client in that time period, is that the result I got? Oh I got 14, hmmm must be one of the joins needs more limiting data (there are two records and I only want the latest one) Or I got 10, then I figure out what is eliminating the other two (Usually a join that should be a left join or a where condition) Should those two be missing with the business rules I've been given or not.
Often when building, I return more columns than I actually need so that I can see the other data, this may make you realize that you forgot to filter for something that you need to filter for when an unexpected value turns up.
I look at the number of results carefully as I go through and add joins and where conditions. Did they go up or down, if so is that what I wanted.
If I have a place that is currently returning data that I can expect my users will compare this report to, I will look there. For instance if they can search on the website for the available speakers and I'm doing an export to the client of speaker data, the totals between the two had better match or be explanable by different business rules.
When doing reporting the users of the report often have a better idea of what the data should say than the developer. I always ask one of them to look at the qa data and confirm the reort is correct. They will often say things like what happened to project XYZ, it should be on this report. Then you know to look at that particularly case.
One other thing you need to test, is not just correctness of the data, but performance. If you only test against a small test database, you may have a query that works getting your the correct data but which times out every time you try to run it on prod with the larger data set. So never test just against the limited data set. And if at all possible run load tests as well. You do not want a bad query to take down your prod system and have that be the first indicator that there is a problem.

It is rather hard to test if a SQL is correct. One idea I have is write a program that inserts semi-random data into the database. Since the program generates the data, the program can also do the calculation from within and produce the expected result. Run the SQL and compare if the program produce the same result as the SQL.
If the results from the program and the SQL are different, then this test raises a flag that there maybe a logic issue either in the SQL, the testing program or both.
Writing the unit test program to calculate the result can be time consuming, but at least you have the program to do the validation.

Adding some news about this subject, I have found this project https://querycert.github.io/index.html that, in their own words:
Q*cert is a query compiler: it takes some input query and generates code for execution. It can compile several source query languages, such as (subsets of) SQL and OQL. It can produce code for several target backends, such as Java, JavaScript, Cloudant and Spark.
Using Coq allows to prove properties on these languages, to verify the correctness of the translation from one to another, and to check that optimizations are correct
Also, you should take a look in the Cosette http://cosette.cs.washington.edu/
Cosette is an automated prover for checking equivalences of SQL queries. It formalizes a substantial fragment of SQL in the Coq Proof Assistant and the Rosette symbolic virtual machine. It returns either a formal proof of equivalence or a counterexample for a pair of given queries.
While Q*cert seems to be better if you have a formal definition of what you are trying to build in Coq, the Cosette would be better if you are trying to rewrite some query but ensure to get the same result.

Related

SSMS 2019 large queries debugging

I am recently involved with supporting customers with very large MSSQL databases (terabyte) which means sometimes queries can take hours to complete.
What is the best way to debug sql queries? What I mean is, suppose you have a 1000-line sql query with sub queries etc that takes one hour to complete and you end up and with a "Conversion failed when converting the varchar value '2021-03-01 00:00:00' to data type int".
The live query statistics of SMSS suddenly at 88% of execution gives this error. The query has statements that contain that date are all over the query. Is there any way of horizontal step debugger is SMSS in which you can see on what particular line of the query is the error?
I know about the Stored Procedure debugging functionalities but what about simple queries? How can someone manage to have a better view of what's happening when they are executed? There is no way of a step debugger or something in sql queries, right?

Most likely you have an implicit conversion which is throwing an error.
The easiest way to spot this, especially on a large or long query, is to generate an estimated execution plan.
Then you can look at one of the following:
On the SELECT node, you will find Warnings, this will tell which expression it is, although it doesn't tell you which part of the plan it is
You can search the nodes one by one
You can right-click, open the XML, and search it for CONVERT_IMPLICIT, this will give you the exact node. You may have to follow it through a bit to work out which part of the query it relates to.
SentryOne Plan Explorer can show you where the XML relates to in the graphical plan.
What caused it is most likely:
An accidental comparison between the wrong columns, or
DATEADD or DATEDIFF with incorrect parameters, or
UNION ALL or VALUES clause which is combining the two different data types

This is more generalized sql advice than specific SMSS advice (since I don't use the program)
You can test your subqueries separately to see if they're valid.
You should also test your queries on a smaller localized data set. This will give you rapid responses and allow you to test/debug more easily. You should never test queries against a live data set. (You really don't want to have the conversation with your boss of how recent the last backup was made, trust me on that one)
Performance is to be done with a more accurate data set, or several sizes to get an accurate feel for performance scaling.
Use the EXPLAIN and ANALYZE to get a more in depth view of what the database will do with your query.

"Conversion failed when converting the varchar value '2021-03-01 00:00:00' to data type int"
This is a horrible error to have to deal with, speaking from personal experience.
Basically, you need to fix the data or the query. And it takes a bit of work. If you are lucky, then your query has no implicit conversions. If that is the case, you can simply replace cast()/convert() with try_cast() and try_convert().
Unfortunately, there probably are implicit conversions in your query. This means that something is amiss in the data model, usually a string is used to store a number or a date -- and the user of the data doesn't realize that the wrong type os being used. And somewhere along the way, someone put an inappropriate value in the table.
This takes a lot of debugging. I would start by investigating the data types of each column being used and looking at the operators/functions to be sure they are appropriate for the data type.

In which order does SQL Server apply filters - top down or bottom up (first to last or last to first)?

When I write code I like to make sure I'm optimizing performance. I would assume that this includes ordering the filters to have the heavy reducers (filter out lots of rows) at the top and the lighter reducers (filter out a few rows) at the bottom.
But when I have errors in my filters I have noticed that SQL Server first catches the errors in the filters at the bottom and then catches the errors in the filters at the top. Does this mean that SQL Server processes filters from the bottom up?
For example (for clarity I'm the filter - with intentional typos - in the WHERE clause rather than the JOIN clause):
select
l.Loan_Number
,l.Owner_First_Name
,l.Owner_Last_Name
,l.Street
,l.City
,l.State
,p.Balance
,p.Delinquency_Bucket
,p.Next_Due_Date
from
Location l
join Payments p on l.Account_Number = p.Account_Number
where
l.OOOOOwner_Last_Name = 'Kostoryz' -- I assume this would reduce the most, so I put it first
and p.DDDDelinquency = '90+' -- I assume this would reduce second most, so I put it second
and l.SSSState <> 'WY' -- I assume this would reduce the least, so I put it last
Yet the first error SQL Server would return would be ERROR - THERE IS NO COLUMN SSSState IN Location TABLE
The next error it would return would be ERROR - THERE IS NO COLUMN DDDDelinquency IN Payments TABLE
Does this mean that the State filter would be applied before the Delinquency filter and the Delinquency filter would be applied before the Last_Name filter?

There are roughly three stages that happen, when a query is received in text form by the DBMS until you get its result.
The text is usually transformed into some internal format, the DBMS can easier work with.
From the internal format the DBMS tries to compute an optimal way of actual execution, you can think of it as a little program that is developed there.
That program is actually executed and the result is written somewhere (in the memory) you can fetch it from.
(These stages possibly can be divided in even smaller substages, but that level of detail isn't needed here, I guess.)
Now with that in mind, note that for one the errors you mention are emitted in stage 1, when the DBMS tries to bind actual objects in the DB and cannot find them. The query is far from execution at that point and the order that binding is done has got nothing to do with the order the filters are actually applied later. Additionally thereafter is stage 2. In order to find an optimal way of execution, the DBMS can and will reorder things (not necessarily only filters). So it usually doesn't matter how you ordered the filters or how the order of binding went. The DBMS will look at them and decide which one is better to be applied earlier and which one may wait until later.
Keep in mind, that SQL is a descriptive language. Rather than telling the machine what to do -- what we'd typically do when writing programs in imperative languages -- we describe what result we want and let the machine figure out how to calculate it and how to do this in the best possible way or at least a good way.
(Of course, that optimization may not always work a 100%. Sometimes there are some tricks in queries, that help the DBMS to find a better solution. But with a query of the kind you posted, any DBMS should cope pretty well in finding a good order to apply the filters no matter how you ordered them.)

Before SQL Server attempts to run the query, it creates a Query Execution Plan (QEP). The errors you are seeing are happening while the QEP is being built. You cannot infer any information about the sequence of "filters" based on the order you get these errors.
Once you have provided a valid query, SQL Server will build a QEP and that will govern the operations it uses to satisfy the query. The QEP will be based on many factors including what indexes and statistics are available on the table - though not usually the order that you specify conditions in the WHERE clause. There are ways to do this, but it is usually not recommended.

In Short, NO. The order of the filters don't matter.
At a high level, the query goes through multiple stages before execution. The stages are:
Parsing & Normalization (where the syntax is checked and tables are validated)
Compilation & Optimization (Where the code is compiled and optimized for execution)
In the Optimization stage, the table statistics, index statistics are checked to arrive at the optimal execution plan for executing the query. So, the filers are checked based on the statistics and are applied in order based on the statistics. So, the order of filters in the query DON'T MATTER. The column statistics DO MATTER.
Read more on Stages of query execution

Keeping dynamic out of SQL while using specifications with stored procedures

A specification essentially is a text string representing a "where" clause created by an end user.
I have stored procedures that copy a set of related tables and records to other places. The operation is always the same, but dependent on some crazy user requirements like "products that are frozen and blue and on sale on Tuesday".
What if we fed the user specification (or string parameter) to a scalar function that returned true/false which executed the specification as dynamic SQL or just exec (#variable).
It could tell us whether those records exist. We could add the result of the function to our copy products where clause.
It would keep us from recompiling the copy script each time our where clauses changed. Plus it would isolate the product selection in to a single function.
Anyone ever do anything like this or have examples? What bad things could come of it?
EDIT:
This is the specification I simply added to the end of each insert/select statement:
and exists (
select null as nothing
from SameTableAsOutsideTable inside
where inside.ID = outside.id and -- Join operations to outside table
inside.page in (6, 7) and -- Criteria 1
inside.dept in (7, 6, 2, 4) -- Criteria 2
)
It would be great to feed a parameter into a function that produces records based on the user criteria, so all that above could be something like:
and dbo.UserCriteria( #page="6,7", #dept="7,6,2,4")

Dynamic Search Conditions in T-SQL
When optimizing SQL the important thing is optimizing the access path to data (ie. index usage). This trumps code reuse, maintainability, nice formatting and just about every other development perk you can think of. This is because a bad access path will cause the query to perform hundreds of times slower than it should. The article linked sums up very well all the options you have, and your envisioned function is nowhere on the radar. Your options will gravitate around dynamic SQL or very complicated static queries. I'm afraid there is no free lunch on this topic.

It doesn't sound like a very good idea to me. Even supposing that you had proper defensive coding to avoid SQL injection attacks it's not going to really buy you anything. The code still needs to be "compiled" each time.
Also, it's pretty much always a bad idea to let users create free-form WHERE clauses. Users are pretty good at finding new and innovative ways to bring a server to a grinding halt.
If you or your users or someone else in the business can't come up with some concrete search requirements then it's likely that someone isn't thinking about it hard enough and doesn't really know what they want. You can have pretty versatile search capabilities without letting the users completely loose on the system. Alternatively, look at some of the BI tools out there and consider creating a data mart where they can do these kinds of ad hoc searches.

How about this:
You create another store procedure (instead of function) and pass the right condition to it.
Based on that condition it dumps the record ids to a temp table.
Next you move procedure will read ids from that table and do the needful things?
Or you could create a user function that returns a table which is nothing but the ids of the records that matches your criteria (dynamic)
If I am totally off, then please clarify me.
Hope this helps.

If you are forced to use dynamic queries and you don't have any solid and predefined search requirements, it is strongly recommended to use sp_executesql instead of EXEC . It provides parametrized queries to prevent SQL Injection attacks (to some extent) and It makes use of execution plans to speed up performance. (More info)

detect cartesian product or other non sensible queries

I'm working on a product which gives users a lot of "flexibility" to create sql, ie they can easily set up queries that can bring the system to it's knees with over inclusive where clauses.
I would like to be able to warn users when this is potentially the case and I'm wondering if there is any known strategy for intelligently analysing queries which can be employed to this end?

I feel your pain. I've been tasked with something similar in the past. It's a constant struggle between users demanding all of the features and functionality of SQL while also complaining that it's too complicated, doesn't help them, doesn't prevent them from doing stupid stuff.
Adding paging into the query won't stop bad queries from being executed, but it will reduce the damage. If you only show the first 50 records returned from SELECT * FROM UNIVERSE and provide the ability to page to the next 50 and so on and so forth, you can avoid out of memory issues and reduce the performance hit.
I don't know if it's appropriate for your data/business domain; but I forcefully add table joins when the user doesn't supply them. If the query contains TABLE A and TABLE B, A.ID needs to equal B.ID; I add it.
If you don't mind writing code that is specific to a database, I know you can get data about a query from the database (Explain Plan in Oracle - http://www.adp-gmbh.ch/ora/explainplan.html). You can execute the plan on their query first, and use the results of that to prompt or warn the user. But the details will vary depending on which DB you are working with.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?

This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.

There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.

I'm with you - a single SQL would be better

There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.

It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.

The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!

You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.

Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.

The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.

A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.

Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)

You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.

There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas