Question on how to read a SQL Execution plan - sql

I have executed a query and included the Actual Execution Plan. There is one Hash Match that is of interest to me because it's subtree uses a Index Scan instead of an index seek. When I mouse over this Hash Match there is a section called "Probe Residual". I had assumed that this is whatever values I am joining on. Am I correct here or is there a better explanation of what that means?
The second question I had is regarding the indexes it uses. In my example I am pretty sure this particular join is joining on two columns. The index that it is Scanning has both of these columns in it as well as another column that is not used in the join. I was under the impression that this would result in an Index Seek rather than a Scan. Am I mistaken on this?

A Hash Join will generally (always?) use a scan or at least a range scan. A hash join works by scanning both left and right join tables (or a range in the tables) and building an in-memory hash table that contains all values 'seen' by the scans.
What happened in your case is this: the QO noticed that it can obtain all the values of a column C from a non-clustered index that happens to contain this column (as a key or as an included column). Being a non-clustered index is probably fairly narrow, so the total amount of IO to scan the entire non-clustered index is not exaggerate. The QO also considered that the system has enough RAM to store a hash table in memory. When compared the cost of this query (a scan of a non-clustered index end-to-end for, say, 10000 pages) with the cost of a nested loop that used seeks (say 5000 probes at 2-3 pages each) the scan won as requiring less IO. Of course, is largely speculation on my part, but I'm trying to present the case from the QO point of view, and the plan is likely optimal.
Factors that contributed to this particular plan choice would be:
a large number of estimated candidates on the right side of the join
availability of the join column in a narrow non-clustered index for the left side
plenty of RAM
For a large estimate of the number of candidates, a better choice than the hash join is only the merge-join, and that one requires the input to be presorted. If both the left side can offer an access path that guarantees an order on the joined column and the right side has a similar possibility then you may end up with the merge join, which is the fastest join.

This blog post will probably answer your first question.
As for your second, index scans might be selected by the optimizer in a number of situations. Off the top of my head:
If the index is very small
If most of the rows in the index will be selected by the query
If you are using functions in the where clause of your query
For the first two cases, it's more efficient to do a scan, so the optimizer chooses it over a seek. For the third case, the optimizer has no choice.

1/ A Hash Match means that it takes a hash of columns used in an equality join, but needs to include all the other columns involved in the join (for >, etc) so that they can be checked too. This is where residual columns come in.
2/ An Index Seek can be done if it can go straight to the rows you want. Perhaps you're applying a calculation to the columns and using that? Then it will use the index as a smaller version of the data, but will still need to check every row (applying the calculation on each one).

Check out those excellent articles on execution plans on simple-talk.com:
Execution Plan Basics
SQL Server Execution Plans
Graphical Execution Plans for simple SQL queries
Understanding more complex execution plans
They also have a free e-book SQL Server execution plans for download.

Related

Optimizing a query that uses index seek based on wrong estimates

I am joining two tables called Zasilka and Kapitola. Each one has a clustered index and Kapitola has also non clustered index with a column with which I am joining.
The query uses index seek because it expects only 1 row to be returned.
The statistics on both tables are updated.
I have tried to disable the index, then it uses merge join but it has to first order about 40000 rows, which takes a lot of resources.
Index column is mostly ordered but there are some cases when not. I just try to think about what would be the best strategy to join these tables and avoid order or seek.
And I do not know why it does not use non clustered index to join using merge.
Exectuion plan
io statistics seek
io statistics merge
You are misreading the information in the showplan, I believe. The estimate is per execution of the subtree. It estimates it will return 1 row per subtree execution and that it will execute the subtree 71,000 times. (It doesn't estimate less than one). Due to the containment assumption, it believes it will find a row when seeking (assumption of the optimizer based on usual customer behavior). In actuality, you get 46,000 rows or so back. So, the optimizer is working as expected in this case.
In the future, please post the query text, schema, and whole plan shape. It is very hard to do more than guess when you take a screenshot with most of the plan shape covered up.

Oracle SQL when to use the full hint

I understand that the full hint:
/*+ FULL(alias) */
will force the optimizer to perform a full table scan.
Are there any scenarios where this would be more useful than using an index besides the obvious one where the query selects the majority of the rows in the table?
Very wide topic.
From Oracle documentation "When using hints, in some cases, you might need to specify a full set of hints in order to ensure the optimal execution plan. For example, if you have a very complex query, which consists of many table joins, and if you specify only the INDEX hint for a given table, then the optimizer needs to determine the remaining access paths to be used, as well as the corresponding join methods. Therefore, even though you gave the INDEX hint, the optimizer might not necessarily use that hint, because the optimizer might have determined that the requested index cannot be used due to the join methods and access paths selected by the optimizer."
It avoids indexes and go for a FTS.
This has exceptions, like all general concepts. A full table scan can be less expensive than an index scan followed by table access by rowid - sometimes much less expensive.
An excerpt from:
Index blocks need to be read too, in addition to the table's blocks. If the index is large (a sizable percentage of the table's size), the pressure on the SGA (and associates latches) might be noticeable
If the sort order of the index doesn't match the way the data is stored in the actual table, the number of logical I/Os necessary to fulfill the query could be (potentially much) more than the number of LIOs to do the full table scan. (The index's clustering factor is one of the indicators you can look at to estimate this.)
Essentially, if scanning via your index makes the data block I/O requests jump all over your table, the overall cost is going to be higher than if you could do large sequential reads. An FTS is more likely to use multi-block direct-path reads, bypassing the SGA entirely, which is also potentially good - no "cache thrashing", less latching.
If you have a covering index, chances are that's going to beat a full table scan all the time. If not, it's going to depend on what percentage of actual blocks (data + index) will need to be processed (the index's selectivity for that query), and how well they're "physically sorted" with respect to each-other.
As to why the optimizer is picking the "wrong" path for you here: hard to tell. Stale statistics on the index or table could be an issue like always, the estimate calculated base on the LIKE might be off for certain patterns, instance parameters could favor indexes a bit too much, ... If this is the only query that's misbehaving, and your stats are up to date, using a /*+ full */ hint doesn't sound too bad.
Also read Stack documentation

Avoiding full table scan

I have the following query which obtains transactions from the transactions table and transaction detail. Both tables have a big amount of entries, so this query takes a while to return results.
SELECT * FROM transactions t LEFT JOIN transac_detail tidts ON (tidts.id_transac = t.id);
However, I'm more worried because of the fact that Oracle does a full table scan on both tables, according to explain plan, even though t.id and tidts.id_transac have indexes.
Is there any way to optimize this without touching table structure?
I don't think it's true that the SQL neccessarily would be best served with full table scans. This would really be most obviously true where there is a foreign key between the two, but even then there could be exceptions.
I think the key question is this: "what proportion of the rows from each table are expected to be included in the result set?". If the answer is "100% from each" then you have a clear case for full table scans (and a hash join).
However consider the case where tables A and B are joined, with table A containing 5 rows with a foreign key to table B (the parent) containing a million rows. Obviously here you'd look for a full scan of A with a nested loop join to B (the join column on table B would be indexed because it would have to be a primary or unique key).
In the OP's case though it looks like you'd expect 100% rows to be returned from each table. I'd expect to see a full scan of both tables, and a hash join with TRANSACTIONS(probably the smaller table) being accessed first and built into the hash table. This would be the optimum join method, and I'd just be looking out for a situation where TRANSACTIONS is too big for a single-pass hash join. If the join spills to disk then that could be a performance worry and you'd have to look at increasing the memory allocation or equi-partitioning the two tables to reduce the memory requirement.
Since the query as given just returns everything anyway, full table scan may be actually the fastest way of arriving at the final result. Since I/O is so expensive compared to CPU time, it may be more efficient to just pull everything into memory and making the final join there, than to make an index seek in a loop over one table.
To determine whether the query could actually run any faster, you can try the following approaches:
look at query plan on only a subset of the data (for example, a range of id's)
try the query on subsets of varying sizes, see what kind of curve you're plotting here
You don't have a WHERE clause, therefore Oracle is thinking that since it has to return all records from both tables a full table scan will be most efficient.
If you would add a WHERE clause that uses the index, I think you will find that EXPLAIN PLAN will no longer use a full table scan.
Full table scans are not necessarily bad - it appears that you're not limiting the result set in any way and this might be the most efficient way to execute the query. You can always verify this by using an index hint and determining the resulting performance change.

What should I do to get an Clustered Index Seek instead of Clustered Index Scan?

I've got a Stored Procedure in SQL Server 2005 and when I run it and I look at its Execution Plan I notice it's doing a Clustered Index Scan, and this is costing it the 84%. I've read that I've got to modify some things to get a Clustered Index Seek there, but I don't know what to modify.
I'll appreciate any help with this.
Thanks,
Brian
W/o any detail is hard to guess what the problem is, and even whether is a problem at all. The choice of a scan instead of a seek could be driven by many factors:
The query expresses a result set that covers the entire table. Ie. the query is a simple SELECT * FROM <table>. This is a trivial case that would be perfectly covered by a clustred index scan with no need to consider anything else.
The optimizer has no alternatives:
the query expresses a subset of the entire table, but the filtering predicate is on columns that are not part of the clustered key and there are no non-clustred indexes on those columns either. These is no alternate plan other than a full scan.
The query has filtering predicates on columns in the clustred index key, but they are not SARGable. The filtering predicate usually needs to be rewritten to make it SARGable, the proper rewrite depends from case to case. A more subtle problem can appear due to implicit conversion rules, eg. the filtering predicate is WHERE column = #value but column is VARCHAR (Ascii) and #value is NVARCHAR (Unicode).
The query has SARGale filtering predicates on columns in the clustered key, but the leftmost column is not filtered. Ie. clustred index is on columns (foo, bar) but the WHERE clause is on bar alone.
The optimizer chooses a scan.
When the alternative is a non-clustered index then scan (or range seek) but the choice is a to use the clustered index the cause can be usually tracked down to the index tipping point due to lack of non-clustered index coverage for the query projection. Note that this is not your question, since you expect a clustered index seek, not a non-clustred index seek (assumming the question is 100% accurate and documented...)
Cardinality estimates. The query cost estimate is based on the clustered index key(s) statistics which provide an estimate of the cardinality of the result (ie. how many rows will match). On a simple query This cannot happen, as any estimate for a seek or range seek will be lower than the one for a scan, no matter how off the statistics are, but on a complex query, with joins and filters on multiple tables, things are more complex and the plan may include a scan where a seek was expected because the query optimizer may choose plan on which the join evaluation order is reversed to what the observer expects. The reverse order choice may e correct (most times) or may be problematic (usually due to statistics being obsolete or to parameter sniffing).
An ordering guarantee. A scan will produce results in a guaranteed order and elements higher on the execution tree may benefit from this order (eg. a sort or spool may be eliminated, or a merge join can be used instead of hash/nested joins). Overall the query cost is better as a result of choosing an apparently slower access path.
These are some quick pointers why a clustered index scan may be present when a clustered index seek is expected. The question is extremly generic and is impossible to give an answer 'why', other than relying on an 8 ball. Now if I take your question to be properly documented and correctly articulated, then to expect a clustered index seek it means you are searching an unique record based on a clustred key value. In this case the problem has to be with the SARGability of the WHERE clause.
If the Query incldues more than a certain percentage of the rows in the table, the optimizer will elect to do a scan instead of a seek, because it predicts that it will require fewer disk IOs in that case (For a Seek, It needs one Disk IO per level in the index for each row it returns), whereas for a scan there is only one disk IO per row in the entire table.
So if there are, say 5 levels in the b-tree Index, then if the query will generate more than 20% of the rows in the table, it is cheaper to read the whole table than make 5 IOs for each of the 20% rows...
Can you narrow the output of the query a bit more, to reduce the number of rows returned by this step in the process? That would help it choose the seek over the scan.

Which is better: Bookmark/Key Lookup or Index Seek

I know that an index seek is better than an index scan, but which is preferable in SQL Server explain plans: Index seek or Key Lookup (Bookmark in SQL Server 2000)?
Please tell me they didn't change the name again for SQL Server 2008...
Index seek, every time.
Lookups are expensive, so this is covering indexes and especially the INCLUDE clause was added to make them better.
Saying that if you expect exactly one row, for example, a seek followed a lookup can be better than trying to cover a query. We rely on this to avoid yet another index in certain situations.
Edit: Simple talk article: Using Covering Indexes to Improve Query Performance
Edit, Aug 2012
Lookups happen per row which is why they scale badly. Eventually, the optimiser will choose a clustered index scan instead of a seek+lookup because it's more efficient than many lookups.
Key lookup is very similar to a clustered index seek (pre 2005 SP2 was named 'seek with lookup'). I think the only difference is that the Key Lookup may specify an additional PRE-FETCH argument instructing the execution engine to prefetch more keys in the cluster (ie. do a clustered index seek followed by scan).
Seeing a Key Lookup should not scare you. Is the normal operator used in Nested Loops, and Nested Loops is the run-of-the-mill join operator. If you want to improve a plan, try improving on the join and see if it can use a merge join instead (ie. both sides of join can provide rows on the same key order, fastest join) or a hash-join (have enough memory for the QO to consider a hash join, or reduce the cardinality by filtering rows before the join rather than after).
This SO question mentions that key lookups are something to avoid. An index seek is going to definitely be the better performing operation.