Proving SQL query equivalency - sql

How would you go about proving that two queries are functionally equivalent, eg they will always both return the same result set.
As I had a specific query in mind when I was doing this, I ended up doing as #dougman suggested, over about 10% of rows the tables concerned and comparing the results, ensuring there was no out of place results.

The best you can do is compare the 2 query outputs based on a given set of inputs looking for any differences. To say that they will always return the same results for all inputs really depends on the data.
For Oracle one of the better if not best approaches (very efficient) is here (Ctrl+F Comparing the Contents of Two Tables):
http://www.oracle.com/technetwork/issue-archive/2005/05-jan/o15asktom-084959.html
Which boils down to:
select c1,c2,c3,
count(src1) CNT1,
count(src2) CNT2
from (select a.*,
1 src1,
to_number(null) src2
from a
union all
select b.*,
to_number(null) src1,
2 src2
from b
)
group by c1,c2,c3
having count(src1) <> count(src2);

1) Real equivalency proof with Cosette:
Cosette checks (with a proof) if 2 SQL query's are equivalent and counter examples when not equivalent. It's the only way to be absolutely sure, well almost ;) You can even throw in 2 query's on their website and check (formal) equivalence right away.
Link to Cosette:
https://cosette.cs.washington.edu/
Link to article that gives a good explanation of how Cosette works: https://medium.com/#uwdb/introducing-cosette-527898504bd6
2) Or if you're just looking for a quick practical fix:
Try this stackoverflow answer: [sql - check if two select's are equal]
Which comes down to:
(select * from query1 MINUS select * from query2)
UNION ALL
(select * from query2 MINUS select * from query1)
This query gives you all rows that are returned by only one of the queries.

This sounds to me like a an NP complete problem. I'm not sure there is a sure fire way to prove this kind of thing

This is pretty easy to do.
Lets assume your queries are named a and b
a
minus
b
should give you an empty set. If it does not. then the queries return different sets, and the result set shows you the rows that are different.
then do
b
minus
a
that should give you an empty set. If it does, then the queries do return the same sets.
if it is not empty, then the queries are different in some respect, and the result set shows you the rows that are different.

The DBMS vendors have been working on this for a very, very long time. As Rik said, it's probably an intractable problem, but I don't think any formal analysis on the NP-completeness of the problem space has been done.
However, your best bet is to leverage your DBMS as much as possible. All DBMS systems translate SQL into some sort of query plan. You can use this query plan, which is an abstracted version of the query, as a good starting point (the DBMS will do LOTS of optimization, flattening queries into more workable models).
NOTE: modern DBMS use a "cost-based" analyzer which is non-deterministic across statistics updates, so the query planner, over time, may change the query plan for identical queries.
In Oracle (depending on your version), you can tell the optimizer to switch from the cost based analyzer to the deterministic rule based analyzer (this will simplify plan analysis) with a SQL hint, e.g.
SELECT /*+RULE*/ FROM yourtable
The rule-based optimizer has been deprecated since 8i but it still hangs around even thru 10g (I don't know 'bout 11). However, the rule-based analyzer is much less sophisticated: the error rate potentially is much higher.
For further reading of a more generic nature, IBM has been fairly prolific with their query-optimization patents. This one here on a method for converting SQL to an "abstract plan" is a good starting point:
http://www.patentstorm.us/patents/7333981.html

Perhaps you could draw (by hand) out your query and the results using Venn Diagrams, and see if they produce the same diagram. Venn diagrams are good for representing sets of data, and SQL queries work on sets of data. Drawing out a Venn Diagram might help you to visualize if 2 queries are functionally equivalent.

This will do the trick. If this query returns zero rows the two queries are returning the same results. As a bonus, it runs as a single query, so you don't have to worry about setting the isolation level so that the data doesn't change between two queries.
select * from ((<query 1> MINUS <query 2>) UNION ALL (<query 2> MINUS <query 1>))
Here's a handy shell script to do this:
#!/bin/sh
CONNSTR=$1
echo query 1, no semicolon, eof to end:; Q1=`cat`
echo query 2, no semicolon, eof to end:; Q2=`cat`
T="(($Q1 MINUS $Q2) UNION ALL ($Q2 MINUS $Q1));"
echo select 'count(*)' from $T | sqlplus -S -L $CONNSTR

CAREFUL! Functional "equivalence" is often based on the data, and you may "prove" equivalence of 2 queries by comparing results for many cases and still be wrong once the data changes in a certain way.
For example:
SQL> create table test_tabA
(
col1 number
)
Table created.
SQL> create table test_tabB
(
col1 number
)
Table created.
SQL> -- insert 1 row
SQL> insert into test_tabA values (1)
1 row created.
SQL> commit
Commit complete.
SQL> -- Not exists query:
SQL> select * from test_tabA a
where not exists
(select 'x' from test_tabB b
where b.col1 = a.col1)
COL1
----------
1
1 row selected.
SQL> -- Not IN query:
SQL> select * from test_tabA a
where col1 not in
(select col1
from test_tabB b)
COL1
----------
1
1 row selected.
-- THEY MUST BE THE SAME!!! (or maybe not...)
SQL> -- insert a NULL to test_tabB
SQL> insert into test_tabB values (null)
1 row created.
SQL> commit
Commit complete.
SQL> -- Not exists query:
SQL> select * from test_tabA a
where not exists
(select 'x' from test_tabB b
where b.col1 = a.col1)
COL1
----------
1
1 row selected.
SQL> -- Not IN query:
SQL> select * from test_tabA a
where col1 not in
(select col1
from test_tabB b)
**no rows selected.**

You don't.
If you need a high level of confidence that a performance change, for example, hasn't changed the output of a query then test the hell out it.
If you need a really high level of confidence .. then errrm, test it even more.
Massive level's of testing aren't that hard to cobble together for a SQL query. Write a proc which will iterate around a large/complete set of possible paramenters, and call each query with each set of params, and write the outputs to respective tables. Compare the two tables and there you have it.
It's not exactly scientific, which I guess was the OP's question, but I'm not aware of a formal method to prove equivalency.

Related

Any resources for this SQL filtering?

I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.

Select top 4 with order by, but only if actually required?

I have part of a stored proc that is called thousands and thousands of times and as a result takes up the bulk of the whole thing. Having run it through execution plan it looks like the TOP 4 and Order By part is taking up a lot of that. The order by uses a function that although streamlined, will still be being used a fair bit.
This is an odd situation in that for 99.5% of the data there will be 4 or less results returned anyway, it's only for the 0.5% of times that we need the TOP 4. This is a requirement of the data algorithm so eliminating the TOP 4 entirely is not an option.
So lets say my syntax is
SELECT SomeField * SomeOtherField as MainField, SomeOtherField
FROM
(
SELECT TOP 4
SomeField, 1/dbo.[Myfunction](Param1, Param2, 34892) as SomeOtherField
FROM #MytempTable
WHERE
Param1 > #NextMargin1 AND Param1 < #NextMargin1End
AND Param2 > #NextMargin2 AND Param2 < #NextMargin2End
ORDER BY dbo.[MyFunction](Param1, Param2, 34892)
) d
Is there a way I can tell SQL server to do the order by if and only if there are more than 4 results returned after the where takes place? I don't need the order otherwise. Perhaps a table variable and count of the table in an if?
--- Update based on Davids Answer to try to work out why it was slower:
I did a check and can confirm that 96.5% of times there are 4 or less results so it's not a case of more data than expected.
Here is the execution plan for the insert into the #FunctionResults
And the breakdowns of the Insert and spool:
And then the execution plan for the selection of the top4 and orderby:
Please let me know if any further information or breakdowns are required, the size of #Mytemptable could typically be 28000 rows and it has index
CREATE INDEX MyIndex on #MyTempTable (Param1, Param2) INCLUDE ([SomeField])
This answer has been updated based on continued feedback from the question asker. The original suggestion was to attempt to use a table variable to store pre-calculations and select the top 4 from the results. However, in practice it appears that the optimizer was over-estimating the number of rows and choosing a bad execution plan.
In addition to the previous recommendations, I would also recommend updating statistics periodically after any change to this process to provide the query optimizer with updated information to make more informed decisions.
As this is a performance tuning process without direct access to the source environment, this answer is expected to change based on user feedback. Per the recommendation of #SteveFord above, the sample query below reflects the use of a CROSS APPLY to attempt to avoid multiple unnecessary function calls.
SELECT TOP 4
M.SomeField,
M.SomeField * 1/F.FunctionResults [SomeOtherField]
FROM #MytempTable M
CROSS APPLY (SELECT dbo.Myfunction(M.Param1, M.Param2, 34892)) F(FunctionResults)
ORDER BY F.FunctionResults

how to prove 2 sql statements are equivalent

I set out to rewrite a complex SQL statement with joins and sub-statements and obtained a more simple looking statement. I tested it by running both on the same data set and getting the same result set. In general, how can I (conceptually) prove that the 2 statements are the same in any given data set?
I would suggest studying relational algebra (as pointed out by Mchl). It is the most essential concept you need if you want to get serious about optimizing queries and designing databases properly.
However I will suggest an ugly brute force approach that helps you to ensure correct results if you have sufficient data to test with: create views of both versions (for making the comparisons easier to manage) and compare the results. I mean something like
create view original as select xxx yyy zzz;
create view new as select xxx yyy zzz;
-- If the amount differs something is quite obviously very wrong
select count(*) from original;
select count(*) from new;
-- What is missing from the new one?
select *
from original o
where not exists (
select *
from new n
where o.col1=n.col2 and o.col2=n.col2 --and so on
);
-- Did something extra appear?
select *
from new o
where not exists (
select *
from old n
where o.col1=n.col2 and o.col2=n.col2 --and so on
)
Also as pointed out by others in comments you might feed both the queries to the optimizers of the product you are working with. Most of the time you get something that can be parsed with humans, complete drawings of the execution paths with the subqueries' impact on the performance etc. It is most often done with something like
explain plan for
select *
from ...
where ...
etc

In SQL, 'distinct' reduces the number of result rows from one to zero

I have a SQL statement of the following structure:
select distinct ...
from table1,
(select from table2, table3, table4 where ...)
where ...
order by ...
With certain values in the where clauses, the statement returns zero rows in the result set. When I remove the 'distinct' keyword, it returns a single row. I would expect to see a single result row in both cases. Is there some property of the 'distinct' keyword that I am not aware of and that causes this behavior?
The database is Oracle 11g.
What you describe is not the expected behaviour of DISTINCT. This is:
SQL> select * from dual
2 /
D
-
X
1 row selected.
SQL> select distinct * from dual
2 /
D
-
X
1 row selected.
SQL>
So, if what you say is happening really is what is happening then it's a bug. However, you also say it's a rare occurrence which means there is a good chance it is some peculiarity in your data and/or transient conditions in your environment, and not a bug.
You need to create a reproducible test case, for two reasons. Partly, nobody will be able to investigate your problem without one. But mainly because building a test case is an investigation in its own right: attempting to isolate the precise combination of data and/or ambient factors often generates the insight which leads to a solution.
It turned out that one of the sub-selects resulted in a data set that contained, among others, a row where every column was NULL. It seems that this row influenced the evaluation of the DISTINCT in a non-obvious way (at least to me). Maybe this is due to some under-the-hood SQL optimizations. After I removed the cause of this NULL-filled row, the problem is gone and the statement evaluates to one row in the result as it should.

SQL Server UNION - What is the default ORDER BY Behaviour

If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.