Which is better: Bookmark/Key Lookup or Index Seek - sql

I know that an index seek is better than an index scan, but which is preferable in SQL Server explain plans: Index seek or Key Lookup (Bookmark in SQL Server 2000)?
Please tell me they didn't change the name again for SQL Server 2008...

Index seek, every time.
Lookups are expensive, so this is covering indexes and especially the INCLUDE clause was added to make them better.
Saying that if you expect exactly one row, for example, a seek followed a lookup can be better than trying to cover a query. We rely on this to avoid yet another index in certain situations.
Edit: Simple talk article: Using Covering Indexes to Improve Query Performance
Edit, Aug 2012
Lookups happen per row which is why they scale badly. Eventually, the optimiser will choose a clustered index scan instead of a seek+lookup because it's more efficient than many lookups.

Key lookup is very similar to a clustered index seek (pre 2005 SP2 was named 'seek with lookup'). I think the only difference is that the Key Lookup may specify an additional PRE-FETCH argument instructing the execution engine to prefetch more keys in the cluster (ie. do a clustered index seek followed by scan).
Seeing a Key Lookup should not scare you. Is the normal operator used in Nested Loops, and Nested Loops is the run-of-the-mill join operator. If you want to improve a plan, try improving on the join and see if it can use a merge join instead (ie. both sides of join can provide rows on the same key order, fastest join) or a hash-join (have enough memory for the QO to consider a hash join, or reduce the cardinality by filtering rows before the join rather than after).

This SO question mentions that key lookups are something to avoid. An index seek is going to definitely be the better performing operation.

Related

Why "IN " query tag is so costly in sql stored procedures?

How can I improve my performance issue? I have an sql query with 'IN' I guess 'IN' making some costly performance issue. But I need index my sql query?
My sql query:
SELECT [p].[ReferencedxxxId]
FROM [Common].[xxxReference] AS [p]
WHERE ([p].[IsDeleted] = 0)
AND (([p].[ReferencedxyzType] = #__refxyzType_0)
AND [p].[ReferencedxxxId] IN ('42342','ffsdfd','5345345345'))
My solution: (BUT I NEED YOUR HELP FOR BETTER ADVISE) Whichone is correct clustered or nonclustred index?
USE [xxx]
GO
CREATE NONCLUSTERED INDEX IX_NonClusteredIndexDemo_xxxId
ON [Common].[xxxReference](xxxId)
INCLUDE ([ID],[ReferencedxxxId])
WITH (DROP_EXISTING=ON, ONLINE=ON, FILLFACTOR=90)
GO
Second:
CREATE INDEX xxxReference_ReferencedxxxId_index
ON [Common].[xxxReference] (ReferencedxxxId)[/code]
Whichone is correct or do you have better solution?
The performance problem of this query is not the result of using the IN operator.
This operator performs very well with small lists (say, less than 1000 members).
The performance bottle neck here is the fact that SQL Server performs an index scan instead of an index seek (which is very costly), and the key lookup, which is 20% of the query cost.
To avoid both problems, you can add an index on IsDeleted, ReferencedxyzType and ReferencedxxxId - probably in this exact order.
SQL Performance tuning is a science that tends to look a little like art or magic - either way you look at it it requires a good knowledge of both the theory and practice of index settings and the relevant systems requirements.
Therefor, my suggestion is this: Do not attempt to solve it yourself with the help of strangers on the internet. Get an expert for a consulting job for a couple of hours/days to analyze the system and help you fine-tune it.
Learn whatever you can during this process. Ask questions about everything that is not trivial. This will be money well spent.
Couple of things:
If you have a SELECT statement inside the IN, that should be avoided
and should be replaced with an EXISTS clause. But in your above
example, that is not relevant as you have direct values inside IN.
Using EXISTS and NOT EXISTS instead of IN and NOT IN helps SQL
Server to not needing to scan each value of the column for each
values inside the IN / NOT IN and rather can short circuit the
search once a match or non-match found.
Avoid the implicit conversion. They degrade the performance due to
many reasons including i> SQL Server not able to find proper
statistics on an index and hence not able to leverage an index and
would rather go make use of a clustered index available in the table
(which may not be covering your query), ii> Not assigning proper
required RAM during memory allocation phase of the query by storage
engine, iii> Cardinality estimation becomes wrong as SQL Server
would not have statistics on the computed value of that column, and
rather probably had statistics on that column.
If you look at your execution plan posted above, you will see a
yellow mark in your 'SELECT'. If you hover over it, you will see
one/more warning messages. If your warning is related to implicit
conversion, try to use proper datatypes during comparison.
Eg. What is the datatype of the column '[ReferencedxxxId]'? If it
is not an NVARCHAR and is rather a VARCHAR, then I would suggest:
Make the values inside the IN as VARCHAR (currently you are making them NVARCHAR). This way you will still be able to take full advantage of the rowstore index created on [ReferencedxxxId] column.
If you must have the values as NVARCHAR inside the IN clause, then you should:
CONVERT/CAST the column [ReferencedxxxId] in your IN clause. This is going to get rid of the Implicit conversion but you will no longer be able to take full advantage of the rowstore index on [ReferencedxxxId] column.
+
Rather create a clustered/nonclustered columnstore index on the table covering the columns used in the query. This should significantly enhance the performance of your SELECT query.
If you decided to go with the route of using rowstore index by correcting the values inside the IN, you need to make sure that you create a clustered/nonclustered index which covers the query. Meaning the index covers the columns on which you are doing search ([ReferencedxxxId], [ReferencedxxxType], [IsDeleted]) and then including the columns used in SELECT statement under INCLUDE clause (if it is a nonclustered index)
Also, when you are creating a composite rowstore index, try to keep the order of columns inside the index high cardinality to low cardinality from left to right to make the best use of that index.
On the basis of assuming an OLTP based system and not OLAP, my first pass would be an NC Index - given isDeleted is likely to have the least selectivity, I would place it last, first pass would be an NC index ReferencedxyzType, ReferencedxxxId, IsDeleted
I might even be tempted in a higher volume scenario to move the IsDeleted out of the index onto an include instead, since it provides so little selectivity to the index itself.
There is clearly already a clustered index in place on the table (from the query plan we can see it), we don't have the details of what is in it.
The question around clustered vs non-clustered is more complex and requires a lot more knowledge of the system and usage.

Sql indexes vs full table scan

While writing complex SQL queries, how do we ensure that we are using proper indexes and avoiding full table scans? I do it by making sure I only join on columns that have indexes(primary key, unique key etc). Is this enough?
Ask the database for the execution plan for your query, and proceed from there.
Don't forget to index the columns that appear in your where clause as well.
Look at the execution plan of the query to see how the query optimizer thinks things must be retrieved. The plan is generally based on the statistics on the tables, the selectivity of the indices and the order of the joins. Note that the optimizer can decide that performing a full table scan is 'cheaper' than index lookup.
other Things to look for:
avoid subqueries if possible.
minimize the use of 'OR'-predicates
in the where clause
It is hard to say what is the best indexing because there are different strategies depend on situation. Still there are coupe things you should now about indexes.
Index SOMETIMES increase performance on select statement and ALWAYS decrease performance on insert and update.
To index table it is not necessary to make it as key on certain field. Also, real life indexes almost always include several fields.
Don't create any indexes for "future purposes" if your performance is satisfactory. Even if you don't have indexes at all.
Always try to analyze execution plan when tuning indexes. Don't be afraid to experiment.
+
Table scan is not always bad thing.
That is all from me.
Use Database Tuning Advisor(SQL Server) to analyse your query. It will suggest necessary indexes to add to tune your query performance

What should I do to get an Clustered Index Seek instead of Clustered Index Scan?

I've got a Stored Procedure in SQL Server 2005 and when I run it and I look at its Execution Plan I notice it's doing a Clustered Index Scan, and this is costing it the 84%. I've read that I've got to modify some things to get a Clustered Index Seek there, but I don't know what to modify.
I'll appreciate any help with this.
Thanks,
Brian
W/o any detail is hard to guess what the problem is, and even whether is a problem at all. The choice of a scan instead of a seek could be driven by many factors:
The query expresses a result set that covers the entire table. Ie. the query is a simple SELECT * FROM <table>. This is a trivial case that would be perfectly covered by a clustred index scan with no need to consider anything else.
The optimizer has no alternatives:
the query expresses a subset of the entire table, but the filtering predicate is on columns that are not part of the clustered key and there are no non-clustred indexes on those columns either. These is no alternate plan other than a full scan.
The query has filtering predicates on columns in the clustred index key, but they are not SARGable. The filtering predicate usually needs to be rewritten to make it SARGable, the proper rewrite depends from case to case. A more subtle problem can appear due to implicit conversion rules, eg. the filtering predicate is WHERE column = #value but column is VARCHAR (Ascii) and #value is NVARCHAR (Unicode).
The query has SARGale filtering predicates on columns in the clustered key, but the leftmost column is not filtered. Ie. clustred index is on columns (foo, bar) but the WHERE clause is on bar alone.
The optimizer chooses a scan.
When the alternative is a non-clustered index then scan (or range seek) but the choice is a to use the clustered index the cause can be usually tracked down to the index tipping point due to lack of non-clustered index coverage for the query projection. Note that this is not your question, since you expect a clustered index seek, not a non-clustred index seek (assumming the question is 100% accurate and documented...)
Cardinality estimates. The query cost estimate is based on the clustered index key(s) statistics which provide an estimate of the cardinality of the result (ie. how many rows will match). On a simple query This cannot happen, as any estimate for a seek or range seek will be lower than the one for a scan, no matter how off the statistics are, but on a complex query, with joins and filters on multiple tables, things are more complex and the plan may include a scan where a seek was expected because the query optimizer may choose plan on which the join evaluation order is reversed to what the observer expects. The reverse order choice may e correct (most times) or may be problematic (usually due to statistics being obsolete or to parameter sniffing).
An ordering guarantee. A scan will produce results in a guaranteed order and elements higher on the execution tree may benefit from this order (eg. a sort or spool may be eliminated, or a merge join can be used instead of hash/nested joins). Overall the query cost is better as a result of choosing an apparently slower access path.
These are some quick pointers why a clustered index scan may be present when a clustered index seek is expected. The question is extremly generic and is impossible to give an answer 'why', other than relying on an 8 ball. Now if I take your question to be properly documented and correctly articulated, then to expect a clustered index seek it means you are searching an unique record based on a clustred key value. In this case the problem has to be with the SARGability of the WHERE clause.
If the Query incldues more than a certain percentage of the rows in the table, the optimizer will elect to do a scan instead of a seek, because it predicts that it will require fewer disk IOs in that case (For a Seek, It needs one Disk IO per level in the index for each row it returns), whereas for a scan there is only one disk IO per row in the entire table.
So if there are, say 5 levels in the b-tree Index, then if the query will generate more than 20% of the rows in the table, it is cheaper to read the whole table than make 5 IOs for each of the 20% rows...
Can you narrow the output of the query a bit more, to reduce the number of rows returned by this step in the process? That would help it choose the seek over the scan.

Question on how to read a SQL Execution plan

I have executed a query and included the Actual Execution Plan. There is one Hash Match that is of interest to me because it's subtree uses a Index Scan instead of an index seek. When I mouse over this Hash Match there is a section called "Probe Residual". I had assumed that this is whatever values I am joining on. Am I correct here or is there a better explanation of what that means?
The second question I had is regarding the indexes it uses. In my example I am pretty sure this particular join is joining on two columns. The index that it is Scanning has both of these columns in it as well as another column that is not used in the join. I was under the impression that this would result in an Index Seek rather than a Scan. Am I mistaken on this?
A Hash Join will generally (always?) use a scan or at least a range scan. A hash join works by scanning both left and right join tables (or a range in the tables) and building an in-memory hash table that contains all values 'seen' by the scans.
What happened in your case is this: the QO noticed that it can obtain all the values of a column C from a non-clustered index that happens to contain this column (as a key or as an included column). Being a non-clustered index is probably fairly narrow, so the total amount of IO to scan the entire non-clustered index is not exaggerate. The QO also considered that the system has enough RAM to store a hash table in memory. When compared the cost of this query (a scan of a non-clustered index end-to-end for, say, 10000 pages) with the cost of a nested loop that used seeks (say 5000 probes at 2-3 pages each) the scan won as requiring less IO. Of course, is largely speculation on my part, but I'm trying to present the case from the QO point of view, and the plan is likely optimal.
Factors that contributed to this particular plan choice would be:
a large number of estimated candidates on the right side of the join
availability of the join column in a narrow non-clustered index for the left side
plenty of RAM
For a large estimate of the number of candidates, a better choice than the hash join is only the merge-join, and that one requires the input to be presorted. If both the left side can offer an access path that guarantees an order on the joined column and the right side has a similar possibility then you may end up with the merge join, which is the fastest join.
This blog post will probably answer your first question.
As for your second, index scans might be selected by the optimizer in a number of situations. Off the top of my head:
If the index is very small
If most of the rows in the index will be selected by the query
If you are using functions in the where clause of your query
For the first two cases, it's more efficient to do a scan, so the optimizer chooses it over a seek. For the third case, the optimizer has no choice.
1/ A Hash Match means that it takes a hash of columns used in an equality join, but needs to include all the other columns involved in the join (for >, etc) so that they can be checked too. This is where residual columns come in.
2/ An Index Seek can be done if it can go straight to the rows you want. Perhaps you're applying a calculation to the columns and using that? Then it will use the index as a smaller version of the data, but will still need to check every row (applying the calculation on each one).
Check out those excellent articles on execution plans on simple-talk.com:
Execution Plan Basics
SQL Server Execution Plans
Graphical Execution Plans for simple SQL queries
Understanding more complex execution plans
They also have a free e-book SQL Server execution plans for download.

Better to use a Clustered index or a Non-Clustered index with included columns?

When I look at the execution plan for a particular query I find that 77% of my cost is in a Clustered Index seek.
Does the fact that I'm using a clustered index mean that I won't see performance issues due to the columns that I am outputting?
Would it be better for me to create a Non-Clustered version of this and include all of the columns that are being output?
UPDATE:
The clustered index uses a composite key. Not sure if this makes a difference.
The reason you use include columns on a non-clustered index is to avoid "bookmark-lookups" into the clustered data. The thing is that if SQL Server could theoretically use a particular non-clustered index but the Optimiser estimates there will be 'too many' bookmark-lookups then said index will be ignored. However, if all selected columns are accessible directly from the index, there'll be no need for a bookmark-lookup.
In your case the fact that you're accessing the data via "clustered index seek" is very promising. It will be very hard to improve on its performance. A non-clustered index including all selected columns may be slightly faster, but only because the raw data is a little less. (But don't forget the cost of increased insert/update time.)
However, you should check the detail...
If you're using a composite key and the seek is actually only on the beginning of the key, you might not be so lucky. You may find the seek is only narrowing down to 500,000 rows and is then searching that based on other criteria. In this case experiment with some non-clustered indexes.
The clustered index seek itself may be fine; but if it is being done 100,000 times in your query because some other aspect is inefficiently returning too many rows - then you won't gain much by improving the performance of the clustered index seek.
Finally, to elaborate on davek's comment: "cost is relative". Just because the clustered is 77% of your query cost doesn't mean there's a problem. It is possible to write a trivial 1 table query that returns a sinlge row and clustered index seek cost at 100%. (But of course, being the only 'work' done, it will be 100% of the work... and 100% of instant is still instant.
So: "Don't worry; be happy!"
You have a seek already, so benefits may be minimal.
You can try it though. Options:
2nd non-clustered index, keep original
move clustered index to another column(s)
Are there other good cluster candidates? Note, you always need a unique clustered index (because of uniquifiers + here SO + here). And of course, what is your PK please?
It depends on how many columns you are talking about, if a couple then yes, a non clustered index will perform better, if you are selecting most of the columns the clustered index is better.