I have a query that creates several temporary tables and then inserts data into them. From what I understand this is a potential cause of Table Spool. When I look at my execution plan the bulk of my processing is spent on Table Spool. Are there any good techniques for improving these types of performance problems? Would using a view or a CTE offer me any benefits over the temp tables?
I also noticed that when I mouse over each table spool the output list is from the same temporary table.
Well, with the information you gave I can tell only that: the query optimizer has chosen the best possible plan. It uses table spools to speed up execution. Alternatives that don't use table spools would be even slower.
How about showing the query, the table(s) schema and cardinality, and the plan.
Update
I certainly understand if you cannot show us the query. But is really hard to guess why the spooling is preffered by the engine whithout knowing any specifics. I recommend you go over Craig Freedman's blog, he is an engineer in the query optimizer team and has explained a lot of the inner workings of SQL 2005/2008 optimizer. Here are some entries I could quickly find that touch the topic of spooling in one form or another:
Recursive CTEs
Ranking Functions: RANK, DENSE_RANK, and NTILE
More on TOP
SQL customer support team also has an interesting blog at http://blogs.msdn.com/psssql/
And 'sqltips' (the relational engine's team blog) has some tips, like Spool operators in query plan...
Related
Is there a way to view the query plan, then change it and set it for a particular query in Sybase DBMS. A query can have multiple query plans and the DBMS might be using one of them, which might be inefficient. Can I see it and then change it and set it, so that the DBMS starts using the new query plan for the given query
This article explains how to view an execution plan.
There is an entire chapter in the online docs called Controlling Optimization that explains how to influence the query plan. I won't repeat it all here, but in Summary you can control:
The order of tables in a
join
The number of tables evaluated at one time during join
optimization
The index used for a table
access
The I/O size
The cache strategy
The degree of parallelism
One of my jobs it to maintain our database, usually we have troubles with lack of performance while getting reports and working whit that base.
When I start looking at queries which our ERP sending to database I see a lot of totally needlessly subselect queries inside main queries.
As I am not member of developers which is creator of program we using, they do not like much when I criticize they code and job. Let say they do not taking my review as serious statements.
So I asking you few questions about subselect in SQL
Does subselect is taking a lot of more time then left outer joins?
Does exists any blog, article or anything where I subselect is recommended not to use ?
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Our database server is MSSQL2005
"Show, Don't Tell" - Examine and compare the query plans of the queries identified using SQL Profiler. Particularly look out for table scans and bookmark lookups (you want to see index seeks as often as possible). The 'goodness of fit' of query plans depends on up-to-date statistics, what indexes are defined, the holistic query workload.
Execution Plan Basics
Understanding More Complex Query Plans
Using SQL Server Profiler (2005 Version)
Run the queries in SQL Server Management Studio (SSMS) and turn on Query->Include Actual Execution Plan (CTRL+M)
Think yourself lucky they're only subselects (which in some cases the optimiser will produce equivalent 'join plans') and not correlated sub-queries!
Identify a query that is performing a high number of logical reads, re-write it using your preferred technique and then show how few logicals reads it does by comparison.
Here's a tip. To get the total number of logical reads performed, wrap a query in question with:
SET STATISTICS IO ON
GO
-- Run your query here
SET STATISTICS IO OFF
GO
Run your query, and switch to the messages tab in the results pane.
If you are interested in learning more, there is no better book than SQL Server 2008 Query Performance Tuning Distilled, which covers the essential techniques for monitoring, interpreting and fixing performance issues.
One thing you can do is to load SQL Profiler and show them the cost (in terms of CPU cycles, reads and writes) of the sub-queries. It's tough to argue with cold, hard statistics.
I would also check the query plan for these queries to make sure appropriate indexes are being used, and table/index scans are being held to a minimum.
In general, I wouldn't say sub-queries are bad, if used correctly and the appropriate indexes are in place.
I'm not very familiar with MSSQL, as we are using postrgesql in most of our applications. However there should exist something like "EXPLAIN" which shows you the execution plan for the query. There you should be able to see the various steps that a query will produce in order to retrieve the needed data.
If you see there a lot of table scans or loop join without any index usage it is definitely a hint for a slow query execution. With such a tool you should be able to compare the two queries (one with the join, the other without)
It is difficult to state which is the better ways, because it really highly depends on the indexes the optimizer can take in the various cases and depending on the DBMS the optimizer may be able to implicitly rewrite a subquery-query into a join-query and execute it.
If you really want to show which is better you have to execute both and measure the time, cpu-usage and so on.
UPDATE:
Probably it is this one for MSSQL -->QueryPlan
From my own experience both methods can be valid, as for example an EXISTS subselect can avoid a lot of treatment with an early break.
Buts most of the time queries with a lot of subselect are done by devs which do not really understand SQL and use their classic-procedural-programmer way of thinking on queries. Then they don't even think about joins, and makes some awfull queries. So I prefer joins, and I always check subqueries. To be completly honnest I track slow queries, and my first try on slow queries containing subselects is trying to do joins. Works a lot of time.
But there's no rules which can establish that subselect are bad or slower than joins, it's just that bad sql programmer often do subselects :-)
Does subselect is taking a lot of more time then left outer joins?
This depends on the subselect and left outer joins.
Generally, this construct:
SELECT *
FROM mytable
WHERE mycol NOT IN
(
SELECT othercol
FROM othertable
)
is more efficient than this:
SELECT m.*
FROM mytable m
LEFT JOIN
othertable o
ON o.othercol = m.mycol
WHERE o.othercol IS NULL
See here:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
Does exists any blog, article or anything where subselect is recommended not to use ?
I would steer clear of the blogs which blindly recommend to avoid subselects.
They are implemented for a reason and, believe it or not, the developers have put some effort into optimizing them.
How I can prove that if we avoid subselesct in query that query is going to be faster ?
Write a query without the subselects which runs faster.
If you post your query here we possibly will be able to improve it. However, a version with the subselects may turn out to be faster.
Try rewriting some of the queries to elminate the sub-select and compare runtimes.
Share and enjoy.
Suddenly (but unfortunately I don't know when "suddenly" was; I know it ran fine at some point in the past) one of my queries started taking 7+ seconds instead of milliseconds to execute. I have 1 local table and 3 tables being accessed via a DB link. The 3 remote tables are joined together, and one of them is joined with my local table.
The local table's where clause only takes a few millis to execute on its own, and only returns a few (10's or 100's at the most) records. The 3 remote tables have many hundreds of thousands, possibly millions, of records between them, and if I join them appropriately I get tens or hundreds of thousands of records.
I am only joining with the remote tables so that I can pull out a few pieces of data related to each record in my local table.
What appears to be happening, however, is that Oracle joins the remote tables together first and then my local table to that mess at the end. This is always going to be a bad idea, especially given the data set that exists right now, so I added a /*+ LEADING(local_tab remote_tab_1) */ hint to my query and it now returns in milliseconds.
I compared the explain plans and they are almost identical, save for a single BUFFER SORT on one of the remote tables.
I'm wondering what might cause Oracle to approach this the wrong way? Is it an index issue? What should I be looking for?
When choosing an execution plan, oracle estimates costs for the different plans. One crucial information for that estimate is the amount of rows will get returned from a step of the execution plan. Oracle tries to estimate those using 'statistics', i.e. information about how many rows a table contains, how many different values a column contains; How evenly these values are distributed.
These statistics are just that statistics, and they might be wrong, which is one of the most important reasons for misjudgments of the oracle optimizer.
So gathering new statistics as described in a comment might help. Have a look at the documentation on that dbms_stats package. There are many different ways to call that package.
A common problem I've come across is a query that joins many tables, where the joins form a chain from one end to another, e.g.:
SELECT *
FROM tableA, tableB, tableC, tableD, tableE
WHERE tableA.ID0 = :bind1
AND tableA.ID1 = tableB.ID1
AND tableB.ID2 = tableC.ID2
AND tableC.ID3 = tableD.ID3
AND tableD.ID4 = tableE.ID4
AND tableE.ID5 = :bind2;
Notice how the optimiser might choose to drive the query from tableA (e.g. if the index on ID0 is nicely selective) or from tableE (if the index on tableE.ID5 is more selective).
The statistics on the tables might cause the choice between these two plans to balance on a knife-edge; one day it's working fine (driving from tableA), next day new stats are gathered and all of a sudden the alternative plan driving from tableE has a lower cost and is chosen.
In this circumstance, adding a LEADING hint is one way to nudge it back to the original plan (i.e. drive from tableA) without dictating too much to the optimiser (i.e. it doesn't force the optimiser to choose any particular join methods).
You're doing distributed query optimization, and that's a tricky beast. It could be that the your table's statistics are current, but now the tables at the remote system are out-of-whack or have changed. Or the remote system added/removed/modified indexes, and that broke your plan. (This is an excellent reason to consider replication -- so you can control indexes and statistics against it.)
That said, Oracle's estimate of cardinality is a primary driver in execution plan. A 10053 trace analysis (Jonathan Lewis' Cost-Based Oracle Fundamentals book has wonderful examples from 8i to 10.1) can help shed light on why your statement's now broken and how the LEADING hint fixes it.
The DRIVING_SITE hint might be a better choice if you know you always want the local tables to be joined first before going after the remote site; it clarifies your intention without driving the plan the way a LEADING hint would.
Might not be relevant but I had a similar situation once where the remote table had been replaced by a single-table view. When it was a table the distributed query optimizer 'saw' that it had an index. When it became a view it couldn't see the index anymore and couldn't cost a plan that used an index on the remote object.
That was a few years ago. I documented my analysis at the time here.
RI,
It's hard to be sure about the cause of the performance problems without seeing the SQL.
When an Oracle query was performing well before, and suddenly starts performing badly, it is usually related to one of two issues:
A) Statistics are out of date. This is the easiest and quickest thing to check, even if you have a housekeeping batch process that's supposed to take care of it ... always double-check.
B) Data volume / data pattern change.
In your case, running a distributed query across multiple databases makes it 10x harder for Oracle to manage performance between them. Is it possible to put these tables in one database, perhaps separate schema owners in one database?
Hints are notoriously fragile, as Oracle is under no obligations to follow the hint. When the data volume or pattern changes some more, Oracle may just ignore the hint and do what it thinks is best (ie. worst ;-).
If you cannot put these tables all in one database, then I recommend you look to break your query up into two statements:
INSERT on sub-SELECT to copy external data to a global temporary table in your current database.
SELECT from the global temporary table to join with your other table.
You will have complete control over performance of step 1 above without resorting to hints. This approach typically scales well, providing you take time to do the performance tuning. I've seen this approach solve many complex performance problems.
The overhead for Oracle to create a whole new table, or insert a heap of records, is much smaller than most people expect. Defining a global temporary table further reduces that overhead.
Matthew
How does one performance tune a SQL Query?
What tricks/tools/concepts can be used to change the performance of a SQL Query?
How can the benefits be Quantified?
What does one need to be careful of?
What tricks/tools/concepts can be used to change the performance of a SQL Query?
Using Indexes? How do they work in practice?
Normalised vs Denormalised Data? What are the performance vs design/maintenance trade offs?
Pre-processed intermediate tables? Created with triggers or batch jobs?
Restructure the query to use Temp Tables, Sub Queries, etc?
Separate complex queries into multiples and UNION the results?
Anything else?
How can performance be Quantified?
Reads?
CPU Time?
"% Query Cost" when different versions run together?
Anything else?
What does one need to be careful of?
Time to generate Execution Plans? (Stored Procs vs Inline Queries)
Stored Procs being forced to recompile
Testing on small data sets (Do the queries scale linearly, or square law, etc?)
Results of previous runs being cached
Optimising "normal case", but harming "worst case"
What is "Parameter Sniffing"?
Anything else?
Note to moderators:
This is a huge question, should I have split it up in to multiple questions?
Note To Responders:
Because this is a huge question please reference other questions/answers/articles rather than writing lengthy explanations.
I really like the book "Professional SQL Server 2005 Performance Tuning" to answer this. It's Wiley/Wrox, and no, I'm not an author, heh. But it explains a lot of the things you ask for here, plus hardware issues.
But yes, this question is way, way beyond the scope of something that can be answered in a comment box like this one.
Writing sargable queries is one of the things needed, if you don't write sargable queries then the optimizer can't take advantage of the indexes. Here is one example Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code this query went from over 24 hours to 36 seconds
Of course you also need to know the difference between these 3 join
loop join,
hash join,
merge join
see here: http://msdn.microsoft.com/en-us/library/ms173815.aspx
Here some basic steps that need to follow:
Define business requirements first
SELECT fields instead of using SELECT *
Avoid SELECT DISTINCT
Create joins with INNER JOIN (not WHERE)
Use WHERE instead of HAVING to define filters
Proper indexing
Here are some basic steps which we can follow to increase the performance:
Check for indexes in pk and fk for the tables involved if it is still taking time index the columns present in the query.
All indexes are modified after every operation so kindly do not index each and every column
Before batch insertion delete the indexes and then recreate the indexes.
Select sparingly
Use if exists instead of count
Before accusing dba first check network connections
What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).