is IN(SELECT ...) bad for performance? - sql

Suppose I have the following code:
SELECT *
FROM [myTable]
WHERE [myColumn] IN (SELECT [otherColumn] FROM [myOtherTable])
Will the subquery be executed again and again for every row?
If so, can I execute it and store its results and use them for every row instead? For example:
SELECT [otherColumn]
INTO #Results
FROM [myOtherTable]
SELECT *
FROM [myTable]
WHERE [myColumn] IN (#Results)

SQL server query optimizer is smart enough to not run the same subquery over and over again. If anything, the temp table is less optimal because of additional steps after getting the results.
You can see this by looking at the SQL query execution plan.
Edit: After looking into this further, it can also be more than once. Apparently query optimizer can also do a lot of interesting things like convert your IN to a JOIN to increase performance. There's lots of information on it here: Number of times a nested query is executed
None the less, view your execution plan to see what your RDMS's query optimizer decided to do.

Have you considered using a join instead? I think that could be best in terms of performance.
SELECT * FROM [myTable] INNER JOIN [myOtherTable]
ON ([myTable][myColumn] = [myOtherTable][otherColumn]);
This however will only work if you don't expect duplicates to be in myOtherTable.

Related

Why do nested select statements take longer to process than temporary tables?

Forgive me if this is a repeat and/or obvious question, but I can't find a satisfactory answer either on stackoverflow or elsewhere online.
Using Microsoft SQL Server, I have a nested select query that looks like this:
select *
into FinalTable
from
(select * from RawTable1 join RawTable2)
join
(select * from RawTable3 join RawTable4)
Instead of using nested selects, the query can be written using temporary tables, like this:
select *
into Temp1
from RawTable1 join RawTable2
select *
into Temp2
from RawTable3 join RawTable4
select *
into FinalTable
from Temp1 join Temp2
Although equivalent, the second (non-nested) query runs several order of magnitude faster than the first (nested) query. This is true both on my development server and a client's server. Why?
The database engine is holds subqueries in requisite memory at execution time, since they are virtual and not physical, the optimiser can't select the best route, or at least not until a sort in the plan. Also this means the optimiser will be doing multiple full table scans on each operation rather than a possible index seek on a temporary table.
Consider each subquery to be a juggling ball. The more subqueries you give the db engine, the more things it's juggling at one time. If you simplify this in batches of code with a temp table, the optimiser finds a clear route, in most cases regardless of indexes too, at least for more recent versions of SQL Server.

Why does my query in oracle take longer when I use temporary tables?

I am facing a peculiar issue when using an inner query in ORACLE DB. I am fetching data from a table which is having huge number of records.
The query I am using contains an inner query.
When I provide the values directly in the inner query it is much
faster.
But when I use exactly the same values from another (temporary) table
by either inner query or JOIN, it takes too longer.
Below is the query:
Faster performance
SELECT assembly_item_id menuItemId,
location_id restId,
bill_sequence_id,
bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v
WHERE assembly_item_id = 8321
AND location_id IN (82, 85, 116, .........)
Low in performance when used select query in inner section
Without JOIN
SELECT assembly_item_id menuItemId,
location_id restId,
bill_sequence_id,
bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v
WHERE assembly_item_id = 8321
AND location_id IN (SELECT temp_id FROM global_temp_ids)
With JOIN
SELECT assembly_item_id menuItemId, location_id restId, bill_sequence_id, bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v t1
join global_temp_ids t2
on t1.location_id = t2.temp_id
WHERE t1.assembly_item_id = 8321
Note: zil_ibat_resolve_bmi_ai_max_v is a view.
What is wrong with this query? Why is it taking so much time when I query table instead of putting the IDs directly in the inner section? Is there an alternate for this?
Explain Plan
usedSelectQueryInInnerSection.png
usedJoin
enterNumbersInInnerQuery
The second and third query are slow because of the NESTED LOOP join between the view results and the temporary table. Changing it to a HASH join, perhaps through better optimizer statistics or a USE_HASH hint, should speed up the query.
Problem
This part at the top of the execution plan:
NESTED LOOPS
zil_ibat_resolve_bmi_ai_max_v
global_temp_ids
is similar to this pseudo-code:
for each row of zil_ibat_resolve_bmi_ai_max_v
search index of global_temp_ids
Based on the images the execution plan for the view does not change between queries, that part must be relatively fast. And the look-up of the temporary table uses a unique index search, that must also be fast. But it is only fast to do it once. And we can tell from the the Cardinality 1 that the Oracle optimizer thinks it will only execute the inner part of the join once.
NESTED LOOPs are great when joining a small number of rows. HASH JOINs work much better when joining a large number of rows.
Solutions
There are many ways to change the join method, here are the two to try first:
1. Gather statistics. Better optimizer statistics will improve the cardinality estimates, which will usually improve execution plans. There are many ways to gather stats but usually the default settings are the best. In this case they can be gathered by running a procedure like this: exec dbms_stats.gather_schema_stats('SMART'); Repeat that for the schemas ZILADMIN and XCBAIRAG. If the statistics were missing or stale it would also be a good idea to investigate why the default statistics gathering job did not run.
2. Hint. Hints should generally be avoided in production code but they can still at least be helpful to diagnose the problem. Run the query with the hint SELECT /*+ USE_HASH(t1 t2) */ ... and see if that improves things. If that works you can either keep the hint or consider using some other form of plan management. For example, a SQL Profile may solve this and other problems in a cleaner way. Check with other developers or DBAs to find out what types of plan management features are common in your system.

PostgreSQL query is slow when using NOT IN

I have a PostgreSQL function that returns a query result to pgadmin results grid REALLY FAST.
Internally, this is a simple function that uses a dblink to connect to another database and does a query return so that I can simply run
SELECT * FROM get_customer_trans();
And it runs just like a basic table query.
The issue is when I use the NOT IN clause. So I want to run the following query, but it takes forever:
SELECT * FROM get_customer_trans()
WHERE user_email NOT IN
(SELECT do_not_email_address FROM do_not_email_tbl);
How can I speed this up? Anything faster than a NOT IN clause for this scenario?
get_customer_trans() is not a table - probably some stored procedure, so query is not really trivial. You'd need to look at what this stored procedure really does to understand why it might work slow.
However, regardless of stored procedure behavior, adding following index should help a lot:
CREATE INDEX do_not_email_tbl_idx1
ON do_not_email_tbl(do_not_email_address);
This index lets NOT IN query to quickly return answer. However, NOT IN is known to have issues in older PostgreSQL versions - so make sure that you are running at least PostgreSQL 9.1 or later.
UPDATE. Try to change your query to:
SELECT t.*
FROM get_customer_trans() AS t
WHERE NOT EXISTS (
SELECT 1
FROM do_not_email_tbl
WHERE do_not_email_address = t.user_email
LIMIT 1
)
This query does not use NOT IN, and should work fast.
I think that in PostgreSQL 9.2 this query should work as fast as one with NOT IN though.
Just do it this way:
SELECT * FROM get_customer_trans() as t1 left join do_not_email_tbl as t2
on user_email = do_not_email_address
where t2.do_not_email_address is null

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

SQL Where Clause Against View

I have a view (actually, it's a table valued function, but the observed behavior is the same in both) that inner joins and left outer joins several other tables. When I query this view with a where clause similar to
SELECT *
FROM [v_MyView]
WHERE [Name] like '%Doe, John%'
... the query is very slow, but if I do the following...
SELECT *
FROM [v_MyView]
WHERE [ID] in
(
SELECT [ID]
FROM [v_MyView]
WHERE [Name] like '%Doe, John%'
)
it is MUCH faster. The first query is taking at least 2 minutes to return, if not longer where the second query will return in less than 5 seconds.
Any suggestions on how I can improve this? If I run the whole command as one SQL statement (without the use of a view) it is very fast as well. I believe this result is because of how a view should behave as a table in that if a view has OUTER JOINS, GROUP BYS or TOP ##, if the where clause was interpreted prior to vs after the execution of the view, the results could differ. My question is why wouldn't SQL optimize my first query to something as efficient as my second query?
EDIT
So, I was working on coming up with an example and was going to use the generally available AdventureWorks database as a backbone. While replicating my situation (which is really debugging a slow process that someone else developed, aren't they all?) I was unable to get the same results. Looking further into the query I am debugging, I realized the issue might be related to the extensive use of User Defined Scalar Valued Functions. There is heavy use of a "GetDisplayName" function that depending upon the values you pass in, it will format lastname, firstname or firstname lastname etc. If I simply omit that function and do the string formatting in the main query/TVF/view or whatever, performance is great. When looking at the execution plan, it didn't give me a clue to look at this as the issue which is why I initially ignored it.
The scalar UDFs are very likely the issue. As soon as they go into your query you've got a RBAR execution plan. It's tolerable if they're in the SELECT but if they're being used in a WHERE or JOIN clause....
A pity because they can be very useful but they're performance killers in big SELECTs and I'd suggest trying to rewrite either the UDFs to table valued or the query to avoid the UDFs, if at all possible.
Though I'm not SQL guru but most probably it is due to fact that in second query you are selecting only one column that makes it faster and secondly ID column seems to be some key and thus indexed. This can be the reason why it is faster the second way.
First Query:
SELECT * FROM [v_MyView] WHERE [Name] like '%Doe, John%'
Second query:
SELECT * FROM [v_MyView] WHERE [ID] in
(SELECT [ID] FROM [v_MyView] WHERE [Name] like '%Doe, John%')