Why subselect makes the SQL-request slower? - sql

I have the following code:
select
*
from
table_1
join
table_2
on
table_1.col1 = table_2.col1
where
table_2.col2 = 1
This query works and give me the results that I expect. Now I would like to optimize this query. The idea is that I try to reduce the second query before the joining the two table. In other words, I suppose that "removing" rows and joining smaller tables should be faster that joining big tables and then selecting from them what I need. I implement my idea in the following way:
select
*
from
table_1
join
(
select
*
from
table_2
where
table_2.col2 = 1
)
on
table_1.col1 = table_2.col1
Surprisingly the second query is significantly slower than the first one. What am I doing wrong?

You can see difference in query execution plan.
Without plan i can only assume:
In your first example, you have 2 tables. Mysql optimizer have some data statistic and can correctly choose and use index.
In your second query you don't have a table, only query result and optimizer haven't data statistic. May be in your case, optimizer execute query without index or something like this.
I think, subquery in your case, is bad practice. You have simple query, you must use your first example.

Related

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?
I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.
In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

SQL EXISTS Why does selecting rownum cause inefficient execution plan?

Problem
I'm trying to understand why what seems like a minor difference in these two Oracle Syntax Update queries is causing a radically different execution plan.
Query 1:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select *
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
Query 2:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select rownum
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
As you can see the only difference between the two is that the subquery in Query 2 returns a rownum instead of the values of every row.
The execution plans for these two couldn't be more different:
Query1 - Pulls the total results from both tables and uses a sort and a hashjoin to return the results. This peforms well with a favorable 2,346 cost (despite the use of the EXISTS clause and the cohesive subquery).
Query2 - Pulls both table results as well but uses a count and a filter to accomplish the same task and returns an execution plan with an astonishing 77,789,696 cost! I should note that his query just hangs on me so I'm not actually positive this returns the same results (though I believe it should).
From my understanding of the Exists clause it is just a simple boolean check that runs per line of the main table. It doesn't matter if a single row is returned in my EXISTS condition or 100,000 rows... if any results are returned for the row that it is being run, then you've passed the exist check. So why would it matter what my subquery SELECT statement returns?
--------------------EDIT----------------------
Per request, below are the execution plans I'm running in TOAD... please note I edited the table names in my example above for ease - In these plans ALSS_SALES2 = sales above and SALESEXT_TMP = tempTABLE above.
Also should have mentioned but neither of the two tables has indices at this point.. I haven't yet added them to my tempTable and I'm testing with a cheap copy of the sales table which only contains the fields and data but no indices, constraints or security.
Thanks for the assistance everyone!
Query 1 Execution Plan
Query 2 Execution Plan
------------------------------------------------
Questions
1) Why did the call for rownum cause the execution plan to change?
2) What is it about the Filter that is so incredibally inefficient?
3) Am I missing something fundamental with the way the Exists clause works that is causing this change?
Posting the actual query plans would be quite helpful.
In general, though, when the optimizer sees a subquery with rownum, that radically limits its ability to transform the query and merge the results from the subquery with the main query because doing so potentially affects the results. That can be a quick way to force Oracle to materialize a subquery if that happens to be more efficient than the plan chosen by the optimizer. In this case, though, it is probably causing the optimizer to forego a transform step that makes the query more efficient.
Occasionally, you'll see someone take a query like
SELECT b.*
FROM (SELECT <<columns>>
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
and tack on a rownum to the a subquery
SELECT b.*
FROM (SELECT <<columns>>, rownum
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
in order to force the optimizer to evaluate the a subquery before executing the join. Normally, of course, the optimizer should do this by default if it is more efficient. But if the optimizer makes a mistake, adding rownum can be quicker than figuring out the right set of hints to force a plan or digging in to the underlying problem to figure out the right solution.
Of course, in the particular case that you have a subquery in a WHERE EXISTS where the only use of rownum comes in the SELECT list, we humans can detect that the rownum shouldn't prevent any query transform step that the optimizer would care to use. The optimizer, though, is probably using a more general rule that says that subqueries that reference a function like rownum must be completely executed (this may depend on the exact Oracle version and/or the optimizer settings). So the optimizer is realistically doing a bunch of extra work because it's not smart enough to recognize that the rownum you added cannot possibly affect the results of the query.
Just a question, what's the execution plan for this query:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select NULL
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3);
It visualize what is needed in an EXISTS (...) expression - actually nothing! As already stated Oracle just have to check if anything is returned, not what is returned in Sub-Query.

Why does IF (query) take over an hour, when query takes less than a second?

I have the following SQL:
IF EXISTS
(
SELECT
1
FROM
SomeTable T1
WHERE
SomeField = 1
AND SomeOtherField = 1
AND NOT EXISTS(SELECT 1 FROM SomeOtherTable T2 WHERE T2.KeyField = T1.KeyField)
)
RAISERROR ('Blech.', 16, 1)
The SomeTable table has around 200,000 rows, and the SomeOtherTable table has about the same.
If I execute the inner SQL (the SELECT), it executes in sub-second time, returning no rows. But, if I execute the entire script (IF...RAISERROR) then it takes well over an hour. Why?
Now, obviously, the execution plan is different - I can see that in Enterprise Manager - but again, why?
I could probably do something like SELECT #num = COUNT(*) WHERE ... and then IF #num > 0 RAISERROR but... I think that's missing the point somewhat. You can only code around a bug (and it sure looks like a bug to me) if you know that it exists.
EDIT:
I should mention that I already tried re-jigging the query into an OUTER JOIN as per #Bohemian's answer, but this made no difference to the execution time.
EDIT 2:
I've attached the query plan for the inner SELECT statement:
... and the query plan for the whole IF...RAISERROR block:
Obviously these show the real table/field names, but apart from that the query is exactly as shown above.
The IF does not magically turn off optimizations or damage the plan. The optimizer just noticed that EXISTS only needs one row at most (like a TOP 1). This is called a "row goal" and it normally happens when you do paging. But also with EXISTS, IN, NOT IN and such things.
My guess: if you write TOP 1 to the original query you get the same (bad) plan.
The optimizer tries to be smart here and only produce the first row using much cheaper operations. Unfortunately, it misestimates cardinality. It guesses that the query will produce lots of rows although in reality it produces none. If it estimated correctly you'd just get a more efficient plan, or it would not do the transformation at all.
I suggest the following steps:
fix the plan by reviewing indexes and statistics
if this didn't help, change the query to IF (SELECT COUNT(*) FROM ...) > 0 which will give the original plan because the optimizer does not have a row goal.
It's probably because the optimizer can figure out how to turn your query into a more efficient query, but somehow the IF prevents that. Only an EXPLAIN will tell you why the query is taking so long, but I can tell you how to make this whole thing more efficient... Indtead of using a correlated subquery, which is incredibly inefficient - you get "n" subqueries run for "n" rows in the main table - use a JOIN.
Try this:
IF EXISTS (
SELECT 1
FROM SomeTable T1
LEFT JOIN SomeOtherTable T2 ON T2.KeyField = T1.KeyField
WHERE SomeField = 1
AND SomeOtherField = 1
AND T2.KeyField IS NULL
) RAISERROR ('Blech.', 16, 1)
The "trick" here is to use s LEFT JOIN and filter out all joined rows by testing for a null in the WHERE clause, which is executed after the join is made.
Please try SELECT TOP 1 KeyField. Using primary key will work faster in my guess.
NOTE: I posted this as answer as I couldn't comment.

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

Slow query with left outer join and is null condition

I've got a simple query (postgresql if that matters) that retrieves all items for some_user excluding the ones she has on her wishlist:
select i.*
from core_item i
left outer join core_item_in_basket b on (i.id=b.item_id and b.user_id=__some_user__)
where b.on_wishlist is null;
The above query runs in ~50000ms (yep, the number is correct).
If I remove the "b.on_wishlist is null" condition or make it "b.on_wishlist is not null", the query runs in some 50ms (quite a change).
The query has more joins and conditions but this is irrelevant as only this one slows it down.
Some info on the database size:
core_items has ~ 10.000 records
core_user has ~5.000 records
core_item_in_basket has ~2.000
records (of which some 50% has
on_wishlist = true, the rest is null)
I don't have any indexes (except for ids and foreign keys) on those two tables.
The question is: what should I do to make this run faster? I've got a few ideas myself to check out this evening, but I'd like you guys to help if possible, as well.
Thanks!
try using not exists:
select i.*
from core_item i
where not exists (select * from core_item_in_basket b where i.id=b.item_id and b.user_id=__some_user__)
Sorry for adding 2nd answer, but stackoverflow doesn't let me format comments properly, and since formatting is essential, I have to post answer.
Couple of options:
CREATE INDEX q ON core_item_in_basket (user_id, item_id) WHERE on_wishlist is null;
same index, but change order of columns in it.
SELECT i.* FROM core_item i WHERE i.id not in (select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__); (this query can benefit from index from point #1, but will not benefit from index #2.
SELECT * from core_item where id in (select id from core_item EXCEPT select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__);
Let us know the results :)
You might want to explain more about the purpose of this query - as some techniques make and some don't make sense, depending on use case.
How often are you running it?
Is it run for only 1 user, or you run it for all users in some kind of loop?
Do: explain analyze and put the output on explain.depesz.com so you will see why it is so slow.
Have you tried adding an index on on_wishlist?
It seems that this column needs to be checked for every row in the query. If your tables are that big, this might have quite a significant impact on the query speed.
As you put the on_wishlist condition in the where clause, which will cause it (depending on the what the query planer decides) to be evaluated after the join has been performed, that comparison has to be done for potentially every row resulting from the join. Both the core_items and core_item_in_basket tables are pretty big, and you don't have an index for that column, so there is very little for the query optimizer to do, which probably leads to the excessive query time.
The size of core_user should have no influence (as it is not referenced in the query).