how to prove 2 sql statements are equivalent

how to prove 2 sql statements are equivalent - sql

I set out to rewrite a complex SQL statement with joins and sub-statements and obtained a more simple looking statement. I tested it by running both on the same data set and getting the same result set. In general, how can I (conceptually) prove that the 2 statements are the same in any given data set?

I would suggest studying relational algebra (as pointed out by Mchl). It is the most essential concept you need if you want to get serious about optimizing queries and designing databases properly.
However I will suggest an ugly brute force approach that helps you to ensure correct results if you have sufficient data to test with: create views of both versions (for making the comparisons easier to manage) and compare the results. I mean something like
create view original as select xxx yyy zzz;
create view new as select xxx yyy zzz;
-- If the amount differs something is quite obviously very wrong
select count(*) from original;
select count(*) from new;
-- What is missing from the new one?
select *
from original o
where not exists (
select *
from new n
where o.col1=n.col2 and o.col2=n.col2 --and so on
);
-- Did something extra appear?
select *
from new o
where not exists (
select *
from old n
where o.col1=n.col2 and o.col2=n.col2 --and so on
)
Also as pointed out by others in comments you might feed both the queries to the optimizers of the product you are working with. Most of the time you get something that can be parsed with humans, complete drawings of the execution paths with the subqueries' impact on the performance etc. It is most often done with something like
explain plan for
select *
from ...
where ...
etc

Related

Can I alias and reuse my subqueries?

I'm working with a data warehouse doing report generation. As the name would suggest, I have a LOT of data. One of the queries that pulls a LOT of data is getting to take longer than I like (these aren't performed ad-hoc, these queries run every night and rebuild tables to cache the reports).
I'm looking at optimizing it, but I'm a little limited on what I can do. I have one query that's written along the lines of...
SELECT column1, column2,... columnN, (subQuery1), (subquery2)... and so on.
The problem is, the sub queries are repeated a fair amount because each statement has a case around them such as...
SELECT
column1
, column2
, columnN
, (SELECT
CASE
WHEN (subQuery1) > 0 AND (subquery2) > 0
THEN CAST((subQuery1)/(subquery2) AS decimal)*100
ELSE 0
END) AS "longWastefulQueryResults"
Our data comes from multiple sources and there are occasional data entry errors, so this prevents potential errors when dividing by a zero. The problem is, the sub-queries can repeat multiple times even though the values won't change. I'm sure there's a better way to do it...
I'd love something like what you see below, but I get errors about needing sq1 and sq2 in my group by clause. I'd provide an exact sample, but it'd be painfully tedious to go over.
SELECT
column1
, column2
, columnN
, (subQuery1) as sq1
, (subquery2) as sq2
, (SELECT
CASE
WHEN (sq1) > 0 AND (sq2) > 0
THEN CAST((sq1)/(sq2) AS decimal)*100
ELSE 0
END) AS "lessWastefulQueryResults"
I'm using Postgres 9.3 but haven't been able to get a successful test yet. Is there anything I can do to optimize my query?

Yup, you can create a Temp Table to store your results and query them again in the same session

I'm not sure how good the Postgres optimizer is, so I'm not sure whether optimizing in this way will do any good. (In my opinion, it shouldn't because the DBMS should be taking care of this kind of thing; but it's not at all surprising if it isn't.) OTOH if your current form has you repeating query logic, then you can benefit from doing something different whether or not it helps performance...
You could put the subqueries in with clauses up front, and that might help.
with subauery1 as (select ...)
, subquery2 as (select ...)
select ...
This is similar to putting the subqueries in the FROM clause as Allen suggests, but may offer more flexibility if your queries are complex.
If you have the freedom to create a temp table as Andrew suggests, that too might work but could be a double-edged sword. At this point you're limiting the optimizer's options by insisting that the temp tables be populated first and then used in the way that makes sense to you, which may not always be the way that actually gets the most efficiency. (Again, this comes down to how good the optimizer is... it's often folly to try to outsmart a really good one.) On the other hand, if you do create temp or working tables, you might be able to apply useful indexes or stats (if they contain large datasets) that would further improve downstream steps' performance.
It looks like many of your subqueries might return single values. You could put the queries into a procedure and capture those individual values as variables. This is similar to the temp table approach, but doesn't require creation of objects (as you may not be able to do that) and will have less risk of confusing the optimizer by making it worry about a table where there's really just one value.

Sub-queries in the column list tend to be a questionable design. The first approach I'd take to solving this is to see if you can move them down to the from clause.
In addition to allowing you to use the result of those queries in multiple columns, doing this often helps the optimizer to come up with a better plan for your query. This is because the queries in the column list have to be executed for every row, rather than merged into the rest of the result set.
Since you only included a portion of the query in your question, I can't demonstrate this particularly well, but what you should be looking for would look more like:
SELECT column1,
column2,
columnn,
subquery1.sq1,
subquery2.sq2,
(SELECT CASE
WHEN (subquery1.sq1) > 0 AND (subquery2.sq2) > 0 THEN
CAST ( (subquery1.sq1) / (subquery2.sq2) AS DECIMAL) * 100
ELSE
0
END)
AS "lessWastefulQueryResults"
FROM some_table
JOIN (SELECT *
FROM other_table
GROUP BY some_columns) subquery1
ON some_table.some_columns = subquery1.some_columns
JOIN (SELECT *
FROM yet_another_table
GROUP BY more_columns) subquery1
ON some_table.more_columns = subquery1.more_columns

can oracle hints be used to defer a (pl/sql) condition till last?

I'm trying to optimise a select (cursor in pl/sql code actually) that includes a pl/sql function e.g.
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
and myplsqlfunction(t.val) = 'X'
The myplsqlfunction() is very expensive, but is only applicable to a manageably small subset of the other conditions.
The problem is that Oracle appears to evaluating myplsqlfunction() on more data than is ideal.
My evidence for this is if I recast the above as either
select * from (
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
) where myplsqlfunction(t.val) = 'X'
or pl/sql as:
begin
for t in ( select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns ) loop
if myplsqlfunction(t.val) = 'X' then
-- process the desired subset
end if;
end loop;
end;
performance is an order of magnitude better.
I am resigned to restructuring the offending code to use either of the 2 above idioms, but it would be delighted if there was any simpler way to get the Oracle optimizer to do this for me.

You could specify a bunch of hints to force a particular plan. But that would almost assuredly be more of a pain than restructuring the code.
I would expect that what you really want to do is to associate non-default statistics with the function. If you tell Oracle that the function is less selective than the optimizer is guessing or (more likely) if you provide high values for the CPU or I/O cost of the function, you'll cause the optimizer to try to call the function as few times as possible. The oracle-developer.net article walks through how to pick reasonably correct values for the cost (or going a step beyond that how to make those statistics change over time as the cost of the function call changes). You can probably fix your immediate problem by setting crazy-high costs but you probably want to go to the effort of setting accurate values so that you're giving the optimizer the most accurate information possible. Setting costs way too high or way too low tends to cause some set of queries to do something stupid.

You can use WITH clause to first evaluate all your join conditions and get a manageable subset of data. Then you can go for the pl/sql Function on the subset of data. But it all depends on the volume still you can try this. Let me know for any issues.

You can use CTE like:
WITH X as
( select /*+ MATERIALIZE */ * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
)
SELECT * FROM X
where myplsqlfunction(t.val) = 'X';
Note the Materiliaze hint. CTEs can be either inlined or materialized(into TEMP tablespace).
Another option would be to use NO_PUSH_PRED hint. This is generally better solution (avoids materializing of the subquery), but it requires some tweaking.
PS: you should not call another SQL from myplsqlfunction. This SQL might see data added after your query started and you might get surprising results.
You can also declare your function as RESULT_CACHE, to force the Oracle to remember return values from the function - if applicable i.e. the amount of possible function's parameter values is reasonably small.
Probably the best solution is to associate the stats, as Justin describes.

How much load database takes when I SELECT ... IN `big_array`?

Let's say, I have accumulated some array of id's (for example [1, 2, 3, ..., 1000]). Is wise to SELECT such big array from database. It's not big deal to take array of 10-20 things out of DB, but what if it were 1000-10000?
EDIT
Somehow it seems, that SELECT ... IN (SELECT ....id FROM ... BETWEEN 0 AND 100) is much slower(about 1200 ms !!!), than just form an array and SELECT ... IN [array]

In general, when you need to select many (1000+) records based on an array of ID's, a better approach than using the IN-operator, is to load your array of ID's into a temporary table, and then perform a join:
So instead of this:
SELECT * FROM MyTable WHERE Id IN (...)
Do this:
CREATE TABLE #TempIDs ( Id AS INT );
-- Bulk load the #TempIDs table with the ID's (DON'T issue one INSERT statement per ID!)
SELECT * FROM MyTable INNER JOIN #TempIDs ON MyTable.Id = #TempIDs.Id
Note the comment. For best performance, you need a mechanism for bulk loading the temporary table with ID's - this depends on your RDBMS and your application.

The problem
Pressure on parser and optimizer
A query of the kind
SELECT * FROM x WHERE x.a IN (1,2,3,...,1000)
will (at least in Oracle) be transformed to
SELECT * FROM x WHERE x.a=1 OR x.a=2 OR x.a=3 OR ... OR x.a=1000
You will get a very big parser tree and (at least in Oracle) you will hit the limit of the parser tree with more than 1000 values. So you put pressure on the parser and the optimizer and this will cost you some performance. Additionally the database will not be able to use some optimizations.
But there is another problem:
Fixed number of bind variables
Because your query is transformed into an equivalent query using OR expressions, you cannot use bind variables for the IN clause (SELECT * FROM x WHERE x.a IN (:values) will not work). You can only use bind variables for each value in the IN clause. So when you alter the number of values, you get a structural different query. This puts pressure on the query cache and (at least in Oracle) on the cursor cache.
Solutions
Solution: Use ranges
If you can describe your query without numbering each value, it will usually become much faster. E.g. instead of WHERE a.x in (1,...,1000) write WHERE a.x>=1 AND a.x <=1000.
Solution: Use a temporary table
This solution is already described in the answer from Dan: Pump your values in an (indexed!) temporary table and use either a nested query (WHERE a.x IN (SELECT temp.x FROM temp) or joins (FROM a JOIN temp USING (x), or WHERE EXISTS (SELECT * FROM temp WHERE temp.x=a.x)).
Style guide: My rule of thumb is to use a nested query when you expected few results in the temp table (mot much more than 1000) and joins when you expect many results (much more than 1000). With modern optimizers there should be no difference, but I think about it a s a hint to the human reader of the query, what the expected amount of values is. I use semi joins (WHERE EXISTS) when I don't care about the values in the temporary table in the further query. Again this is more for the human reader than for the SQL optimizer.
Solution: Use your database's native collection type
When you database has a native collection type, you can also use this in your query (e.g. TYPE nested_type IS TABLE OF VARCHAR2(20)
This will make your code non portable (usually not a big problem, because people switch their database engine very rarely in an established project).
This might make your code hard to read (at least for developers to being that experienced with your brand of SQL database).
An example for Oracle:
DECLARE
TYPE ILIST IS TABLE OF INTEGER;
temp ILIST := ILIST();
result VARCHAR2(20);
BEGIN
temp.extend(3);
temp(1) := 1;
temp(2) := 2;
temp(3) := 3;
SELECT a.y INTO result
FROM a WHERE a.x IN (select * from TABLE(temp));
END;
/

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.

Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?

The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.

It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

Proving SQL query equivalency

How would you go about proving that two queries are functionally equivalent, eg they will always both return the same result set.
As I had a specific query in mind when I was doing this, I ended up doing as #dougman suggested, over about 10% of rows the tables concerned and comparing the results, ensuring there was no out of place results.

The best you can do is compare the 2 query outputs based on a given set of inputs looking for any differences. To say that they will always return the same results for all inputs really depends on the data.
For Oracle one of the better if not best approaches (very efficient) is here (Ctrl+F Comparing the Contents of Two Tables):
http://www.oracle.com/technetwork/issue-archive/2005/05-jan/o15asktom-084959.html
Which boils down to:
select c1,c2,c3,
count(src1) CNT1,
count(src2) CNT2
from (select a.*,
1 src1,
to_number(null) src2
from a
union all
select b.*,
to_number(null) src1,
2 src2
from b
)
group by c1,c2,c3
having count(src1) <> count(src2);

1) Real equivalency proof with Cosette:
Cosette checks (with a proof) if 2 SQL query's are equivalent and counter examples when not equivalent. It's the only way to be absolutely sure, well almost ;) You can even throw in 2 query's on their website and check (formal) equivalence right away.
Link to Cosette:
https://cosette.cs.washington.edu/
Link to article that gives a good explanation of how Cosette works: https://medium.com/#uwdb/introducing-cosette-527898504bd6
2) Or if you're just looking for a quick practical fix:
Try this stackoverflow answer: [sql - check if two select's are equal]
Which comes down to:
(select * from query1 MINUS select * from query2)
UNION ALL
(select * from query2 MINUS select * from query1)
This query gives you all rows that are returned by only one of the queries.

This sounds to me like a an NP complete problem. I'm not sure there is a sure fire way to prove this kind of thing

This is pretty easy to do.
Lets assume your queries are named a and b
a
minus
b
should give you an empty set. If it does not. then the queries return different sets, and the result set shows you the rows that are different.
then do
b
minus
a
that should give you an empty set. If it does, then the queries do return the same sets.
if it is not empty, then the queries are different in some respect, and the result set shows you the rows that are different.

The DBMS vendors have been working on this for a very, very long time. As Rik said, it's probably an intractable problem, but I don't think any formal analysis on the NP-completeness of the problem space has been done.
However, your best bet is to leverage your DBMS as much as possible. All DBMS systems translate SQL into some sort of query plan. You can use this query plan, which is an abstracted version of the query, as a good starting point (the DBMS will do LOTS of optimization, flattening queries into more workable models).
NOTE: modern DBMS use a "cost-based" analyzer which is non-deterministic across statistics updates, so the query planner, over time, may change the query plan for identical queries.
In Oracle (depending on your version), you can tell the optimizer to switch from the cost based analyzer to the deterministic rule based analyzer (this will simplify plan analysis) with a SQL hint, e.g.
SELECT /*+RULE*/ FROM yourtable
The rule-based optimizer has been deprecated since 8i but it still hangs around even thru 10g (I don't know 'bout 11). However, the rule-based analyzer is much less sophisticated: the error rate potentially is much higher.
For further reading of a more generic nature, IBM has been fairly prolific with their query-optimization patents. This one here on a method for converting SQL to an "abstract plan" is a good starting point:
http://www.patentstorm.us/patents/7333981.html

Perhaps you could draw (by hand) out your query and the results using Venn Diagrams, and see if they produce the same diagram. Venn diagrams are good for representing sets of data, and SQL queries work on sets of data. Drawing out a Venn Diagram might help you to visualize if 2 queries are functionally equivalent.

This will do the trick. If this query returns zero rows the two queries are returning the same results. As a bonus, it runs as a single query, so you don't have to worry about setting the isolation level so that the data doesn't change between two queries.
select * from ((<query 1> MINUS <query 2>) UNION ALL (<query 2> MINUS <query 1>))
Here's a handy shell script to do this:
#!/bin/sh
CONNSTR=$1
echo query 1, no semicolon, eof to end:; Q1=`cat`
echo query 2, no semicolon, eof to end:; Q2=`cat`
T="(($Q1 MINUS $Q2) UNION ALL ($Q2 MINUS $Q1));"
echo select 'count(*)' from $T | sqlplus -S -L $CONNSTR

CAREFUL! Functional "equivalence" is often based on the data, and you may "prove" equivalence of 2 queries by comparing results for many cases and still be wrong once the data changes in a certain way.
For example:
SQL> create table test_tabA
(
col1 number
)
Table created.
SQL> create table test_tabB
(
col1 number
)
Table created.
SQL> -- insert 1 row
SQL> insert into test_tabA values (1)
1 row created.
SQL> commit
Commit complete.
SQL> -- Not exists query:
SQL> select * from test_tabA a
where not exists
(select 'x' from test_tabB b
where b.col1 = a.col1)
COL1
----------
1
1 row selected.
SQL> -- Not IN query:
SQL> select * from test_tabA a
where col1 not in
(select col1
from test_tabB b)
COL1
----------
1
1 row selected.
-- THEY MUST BE THE SAME!!! (or maybe not...)
SQL> -- insert a NULL to test_tabB
SQL> insert into test_tabB values (null)
1 row created.
SQL> commit
Commit complete.
SQL> -- Not exists query:
SQL> select * from test_tabA a
where not exists
(select 'x' from test_tabB b
where b.col1 = a.col1)
COL1
----------
1
1 row selected.
SQL> -- Not IN query:
SQL> select * from test_tabA a
where col1 not in
(select col1
from test_tabB b)
**no rows selected.**

You don't.
If you need a high level of confidence that a performance change, for example, hasn't changed the output of a query then test the hell out it.
If you need a really high level of confidence .. then errrm, test it even more.
Massive level's of testing aren't that hard to cobble together for a SQL query. Write a proc which will iterate around a large/complete set of possible paramenters, and call each query with each set of params, and write the outputs to respective tables. Compare the two tables and there you have it.
It's not exactly scientific, which I guess was the OP's question, but I'm not aware of a formal method to prove equivalency.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas