SQL EXISTS Why does selecting rownum cause inefficient execution plan?

SQL EXISTS Why does selecting rownum cause inefficient execution plan? - sql

Problem
I'm trying to understand why what seems like a minor difference in these two Oracle Syntax Update queries is causing a radically different execution plan.
Query 1:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select *
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
Query 2:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select rownum
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
As you can see the only difference between the two is that the subquery in Query 2 returns a rownum instead of the values of every row.
The execution plans for these two couldn't be more different:
Query1 - Pulls the total results from both tables and uses a sort and a hashjoin to return the results. This peforms well with a favorable 2,346 cost (despite the use of the EXISTS clause and the cohesive subquery).
Query2 - Pulls both table results as well but uses a count and a filter to accomplish the same task and returns an execution plan with an astonishing 77,789,696 cost! I should note that his query just hangs on me so I'm not actually positive this returns the same results (though I believe it should).
From my understanding of the Exists clause it is just a simple boolean check that runs per line of the main table. It doesn't matter if a single row is returned in my EXISTS condition or 100,000 rows... if any results are returned for the row that it is being run, then you've passed the exist check. So why would it matter what my subquery SELECT statement returns?
--------------------EDIT----------------------
Per request, below are the execution plans I'm running in TOAD... please note I edited the table names in my example above for ease - In these plans ALSS_SALES2 = sales above and SALESEXT_TMP = tempTABLE above.
Also should have mentioned but neither of the two tables has indices at this point.. I haven't yet added them to my tempTable and I'm testing with a cheap copy of the sales table which only contains the fields and data but no indices, constraints or security.
Thanks for the assistance everyone!
Query 1 Execution Plan
Query 2 Execution Plan
------------------------------------------------
Questions
1) Why did the call for rownum cause the execution plan to change?
2) What is it about the Filter that is so incredibally inefficient?
3) Am I missing something fundamental with the way the Exists clause works that is causing this change?

Posting the actual query plans would be quite helpful.
In general, though, when the optimizer sees a subquery with rownum, that radically limits its ability to transform the query and merge the results from the subquery with the main query because doing so potentially affects the results. That can be a quick way to force Oracle to materialize a subquery if that happens to be more efficient than the plan chosen by the optimizer. In this case, though, it is probably causing the optimizer to forego a transform step that makes the query more efficient.
Occasionally, you'll see someone take a query like
SELECT b.*
FROM (SELECT <<columns>>
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
and tack on a rownum to the a subquery
SELECT b.*
FROM (SELECT <<columns>>, rownum
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
in order to force the optimizer to evaluate the a subquery before executing the join. Normally, of course, the optimizer should do this by default if it is more efficient. But if the optimizer makes a mistake, adding rownum can be quicker than figuring out the right set of hints to force a plan or digging in to the underlying problem to figure out the right solution.
Of course, in the particular case that you have a subquery in a WHERE EXISTS where the only use of rownum comes in the SELECT list, we humans can detect that the rownum shouldn't prevent any query transform step that the optimizer would care to use. The optimizer, though, is probably using a more general rule that says that subqueries that reference a function like rownum must be completely executed (this may depend on the exact Oracle version and/or the optimizer settings). So the optimizer is realistically doing a bunch of extra work because it's not smart enough to recognize that the rownum you added cannot possibly affect the results of the query.

Just a question, what's the execution plan for this query:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select NULL
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3);
It visualize what is needed in an EXISTS (...) expression - actually nothing! As already stated Oracle just have to check if anything is returned, not what is returned in Sub-Query.

Related

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?

I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.

In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

Oracle how to avoid merge cartesian join in execution plan?

The following query is sometimes resulting in a merge cartesian join in the execution plan, we're trying to rewrite the query (in the simplest fashion) in order to ensure the merge cartesian join will not occur anymore.
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2)
After reviewing a similar question "Why would this query cause a Merge Cartesian Join in Oracle", the problem seems to be "Oracle doesn't know that (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2) returns one single result. So it's assuming that it will generate lots of rows."
Is there some way to tell the Oracle optimizer that the sub-select will only return one row?
Would using a function that returns a datetime value in place of the sub-select help, assuming that the optimizer would then know that the function can only return one value?
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > SCHEMA.FN_GET_DATE_VAL()
The Oracle DBA recommended using a WITH statement, which seems like it will work, but we were wondering if there were any shorter options.
with mx_dt as (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2)
SELECT COL1
FROM SCHEMA.VIEW_NAME1, mx_dt a
WHERE DATE_VAL > a.DATE_VAL

I wouldn't worry about the Cartesian join, because the subquery is only returning one row (at most). Otherwise, you would get a "subquery returns too many rows" error.
It is possible that Oracle runs that the subquery once for each comparison -- possible, but the Oracle optimizer is smart and I doubt that would happen. However, it is easy enough to phrase this as a JOIN:
SELECT n1.COL1
FROM SCHEMA.VIEW_NAME1 n1 JOIN
SCHEMA.VIEW_NAME2 n2
ON n1.DATE_VAL > n2.DATE_VAL;
However, it is possible that this execution plan is even worse, because you have not specified that n2 is only supposed to return (at most) one value.

An aggregate function in the sub-select ensures a single row is returned. Probably would be a good hint to the optimizer and if there is only 1 row in VIEW_NAME2 then the result of the sub-select is the same.
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT MIN(DATE_VAL) FROM SCHEMA.VIEW_NAME2)

Try adding WHERE ROWNUM >= 1:
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2 WHERE ROWNUM >= 1)
That predicate looks totally extraneous, or the kind of thing Oracle would just ignore, but the ROWNUM pseudo-function is special. When Oracle sees it, it thinks "these rows must be returned in order, I better not do any query transformations". Which means that it won't try to push predicates, merge views, etc. Which means VIEW_NAME1 will be run separately from VIEW_NAME2 and they now both run just as fast as before.
You'll probably still see a Cartesian product in the explain plan but hopefully only near the top, between the two view result sets. If there really is only one row returned then a Cartesian product is probably the right operation.

Join Subquery result Performance

I'm looking to optimize my SQL query and would like to know, in terms of optimization performance, if
SELECT A.this, B.another FROM A
JOIN B
ON A.this = B.that
WHERE B.another > 6
AND A.something < 3;
is better than:
SELECT A.this, B.another
FROM (SELECT this FROM A WHERE A.something < 3) AS A
JOIN (SELECT another FROM B WHERE B.another > 6) AS B
ON A.this = B.that;

The two queries will be identical when run. Try running the two queries preceded by explain analyze, and you should get the exact same query plan.
In a nutshell, Postgres will parse the query and come up with a query tree.
It'll then rewrite the query tree when appropriate, to remove redundant things such as an 1 = 1 where clause, replacing a small IN () clause or with its equivalent using ANY, or, in your case, collapsing your two subqueries into the parent query. The reason it'll do so is that it'll basically view them as select (select ...) with the inner select not being subjected to an aggregate, group by or limit clause.
Only then does it spend some time analyzing the query tree itself in search for a query plan it deems sufficiently optimal. And finally execute the query plan to return the rows you've asked for.
For the gory details you'll want to check out the Postgres manual on performance tips, as well as the explain and explain analyze syntax:
http://www.postgresql.org/docs/current/static/sql-explain.html
http://www.postgresql.org/docs/current/static/performance-tips.html
http://www.postgresql.org/docs/current/static/planner-stats-details.html
http://www.postgresql.org/docs/current/static/geqo.html
https://wiki.postgresql.org/wiki/Performance_Optimization
Also note the Postgres performance mailing list, whose archives you'll find here:
http://www.postgresql.org/list/pgsql-performance/
Studying them is well worth the effort, in the sense that doing so will give you a wealth of insights as to what is going on under the hood. Give attention to messages written by Tom Lane in particular -- he occasionally gives a first hand account of what's going on in the query rewriter and planner.

Why subselect makes the SQL-request slower?

I have the following code:
select
*
from
table_1
join
table_2
on
table_1.col1 = table_2.col1
where
table_2.col2 = 1
This query works and give me the results that I expect. Now I would like to optimize this query. The idea is that I try to reduce the second query before the joining the two table. In other words, I suppose that "removing" rows and joining smaller tables should be faster that joining big tables and then selecting from them what I need. I implement my idea in the following way:
select
*
from
table_1
join
(
select
*
from
table_2
where
table_2.col2 = 1
)
on
table_1.col1 = table_2.col1
Surprisingly the second query is significantly slower than the first one. What am I doing wrong?

You can see difference in query execution plan.
Without plan i can only assume:
In your first example, you have 2 tables. Mysql optimizer have some data statistic and can correctly choose and use index.
In your second query you don't have a table, only query result and optimizer haven't data statistic. May be in your case, optimizer execute query without index or something like this.
I think, subquery in your case, is bad practice. You have simple query, you must use your first example.

Why does IF (query) take over an hour, when query takes less than a second?

I have the following SQL:
IF EXISTS
(
SELECT
1
FROM
SomeTable T1
WHERE
SomeField = 1
AND SomeOtherField = 1
AND NOT EXISTS(SELECT 1 FROM SomeOtherTable T2 WHERE T2.KeyField = T1.KeyField)
)
RAISERROR ('Blech.', 16, 1)
The SomeTable table has around 200,000 rows, and the SomeOtherTable table has about the same.
If I execute the inner SQL (the SELECT), it executes in sub-second time, returning no rows. But, if I execute the entire script (IF...RAISERROR) then it takes well over an hour. Why?
Now, obviously, the execution plan is different - I can see that in Enterprise Manager - but again, why?
I could probably do something like SELECT #num = COUNT(*) WHERE ... and then IF #num > 0 RAISERROR but... I think that's missing the point somewhat. You can only code around a bug (and it sure looks like a bug to me) if you know that it exists.
EDIT:
I should mention that I already tried re-jigging the query into an OUTER JOIN as per #Bohemian's answer, but this made no difference to the execution time.
EDIT 2:
I've attached the query plan for the inner SELECT statement:
... and the query plan for the whole IF...RAISERROR block:
Obviously these show the real table/field names, but apart from that the query is exactly as shown above.

The IF does not magically turn off optimizations or damage the plan. The optimizer just noticed that EXISTS only needs one row at most (like a TOP 1). This is called a "row goal" and it normally happens when you do paging. But also with EXISTS, IN, NOT IN and such things.
My guess: if you write TOP 1 to the original query you get the same (bad) plan.
The optimizer tries to be smart here and only produce the first row using much cheaper operations. Unfortunately, it misestimates cardinality. It guesses that the query will produce lots of rows although in reality it produces none. If it estimated correctly you'd just get a more efficient plan, or it would not do the transformation at all.
I suggest the following steps:
fix the plan by reviewing indexes and statistics
if this didn't help, change the query to IF (SELECT COUNT(*) FROM ...) > 0 which will give the original plan because the optimizer does not have a row goal.

It's probably because the optimizer can figure out how to turn your query into a more efficient query, but somehow the IF prevents that. Only an EXPLAIN will tell you why the query is taking so long, but I can tell you how to make this whole thing more efficient... Indtead of using a correlated subquery, which is incredibly inefficient - you get "n" subqueries run for "n" rows in the main table - use a JOIN.
Try this:
IF EXISTS (
SELECT 1
FROM SomeTable T1
LEFT JOIN SomeOtherTable T2 ON T2.KeyField = T1.KeyField
WHERE SomeField = 1
AND SomeOtherField = 1
AND T2.KeyField IS NULL
) RAISERROR ('Blech.', 16, 1)
The "trick" here is to use s LEFT JOIN and filter out all joined rows by testing for a null in the WHERE clause, which is executed after the join is made.

Please try SELECT TOP 1 KeyField. Using primary key will work faster in my guess.
NOTE: I posted this as answer as I couldn't comment.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas