Key Lookup using columns outside of the index in SQL Query

Key Lookup using columns outside of the index in SQL Query - sql-server-2012

I have a query as follows
SELECT ActivityId,
AnotherId,
PersonId,
StartTime AS MyAlias
FROM Activity
WHERE DeletedStatus='Active' AND
StartTime>='2018-02-01'AND StartTime<='2018-02-08'
The execution plan being used is here
Execution Plan
Index1 is defined as:
CREATE NONCLUSTERED INDEX Index1 ON Activity
(
StartTime
)
Index 2 is defined as:
CREATE CLUSTERED INDEX Index2 ON Activity
(
EndTime
StartTime
)
The optimiser is using an index seek on Index1 and is then using a key lookup because ActivityId,AnotherId,PersonId, are in the SELECT list but not in the index. This makes sense to me.
However, the following things puzzle me:
Why is the optimiser able to use Index1 to do an index seek when DeletedStatus is not in the index but is in the WHERE clause?
Why does the output list in Index1 include EndTime when that column is not present in Index1?
How is Index2 being used to output ActivityId,AnotherId,PersonId when none of those columns are in Index2?
Apologies, I have pseudo-anonymised the plan and the query so I hope I have done it correctly!

Why is the optimiser able to use Index1 to do an index seek when
DeletedStatus is not in the index but is in the WHERE clause?
The WHERE clause also includes StartDate so a seek can be performed using the provided StartDate values followed by a range scan. The key lookup includes the 'Active' predicate to filter the rows per the WHERE clause since that column is not included in the index.
Why does the output list in Index1 include EndTime when that column is
not present in Index1?
All non-clustered indexes implicitly include the clustered index key as the row locator, similarly to explicitly included columns.
How is Index2 being used to output ActivityId,AnotherId,PersonId when
none of those columns are in Index2?
The clustered index leaf nodes are the actual data rows so all columns are available.

Related

Why isn't SQL Server using my clustered index and doing a non-clustered index scan?

I have a patient table with a few columns, and a clustered index on column ID and a non-clustered index on column birth.
create clustered index CI_patient on dbo.patient (ID)
create nonclustered index NCI_patient on dbo.patient (birth)
Here are my queries:
select * from patient
select ID from patient
select birth from patient
Looking at the execution plan, the first query is 'clustered index scan' (which is understandable because the table is a clustered table), the third one is 'index scan nonclustered' (which is also understandable because this column has a nonclustered index)
My question is why the second one is 'index scan nonclustered'? This column suppose to have a clustered index, in this sense, should that be clustered index scan? Any thoughts on this?

Basically, your second query wants to get all ID values from the table (no WHERE clause or anything).
SQL Server can do this two ways:
clustered index scan - basically a full table scan to read all the data from all rows, and extract the ID from each row - would work, but it loads the WHOLE table, one by one
do a scan across the non-clustered index, because each non-clustered index also includes the clustering column(s) on its leaf level. Since this is a index that is much smaller than the full table, to do this, SQL Server will need to load fewer data pages and thus can provide the answer - all ID values from all rows - faster than when doing a full table scan (clustered index scan)
The cost-based optimizer in SQL Server just picks the more efficient route to get the answer to the question you've asked with your second query.

Execution plan showing missing non-clustered index on already partitioned clustered indexes

We have a query where the table is partitioned on column Adate.
Row count: 56595943, partition scheme - yearly, no of partitions - 300
Clustered index columns : empid, Adate
Query :
select top 1 Adate
from emp
where empid = 134556 and Adate <= {ts '7485-09-01 00:00:00.0'}
order by Adate desc
The actual execution plan returns a clustered index seek operation with 93% of the total query cost on clustered index key.
But why is the optimizer recommending a missing index with 92% of cost?
missing index details: Improve query cost:92%
create nonclustered index IDX_NC on dbo.emp([empid], [Adate])
The missing index has an improvement measure of 14755268, as per Microsoft the improvement measure baseline is 1,000,000
Why is this happening? Do you recommend to have a nonclustered index on already clustered index columns?

Well - consider this:
you do have the clustered index on (empid, adate)
the clustered index contains the whole data, e.g. the leaf level pages of the clustered index contain the whole data records (all the columns in your table)
If you are searching and the query uses the clustered index, it might still need to load much more data than is actually needed.... the whole record, as many times as your criteria is found.
If you have a non-clustered index on just (empid, Adate), and your query really only requires Adate (in its SELECT list of columns), then this index will be much smaller - it contains only those two columns (none of the overhead of all the other columns, which are not needed for your current query). So scanning this index, or loading these index pages, will load much less data compared to the clustered index.
From that point of view, yes, even having a nonclustered index on the same columns that make up your clustered index can be beneficial for certain query scenarios - that's probably what the SQL Server query optimizer picks up here.

If I have a single nonclustered index on a table, will the number of columns I include change the slow down when writing to it?

On the exact same table, if I was to put one index on it, either:
CREATE INDEX ix_single ON MyTable (uid asc) include (columnone)
or:
CREATE INDEX ix_multi ON MyTable (uid asc) include (
columnone,
columntwo,
columnthree,
....
columnX
)
Would the second index cause an even greater lag on how long it takes to write to the table than the first one? And why?

Included columns will need more diskspace as well as time on data manipulation...
If there is a clustered index on this table too (ideally on a implicitly sorted column like an IDENTITY column to avoid fragmentation) this will serve as fast lookup on all columns (but you must create the clustered index before the other one...)
To include columns into an index is a usefull approach in extremely performance related issues only...

SQL Server multiple index order optimization

I have a table with a nonclustered index1 on ID1 and ID2, in that order.
Select count(distinct(id1)) from table
returns 1
and Select count(distinct(id2)) from table has all the values of the table.
The querys to that table uses ... where id1= XX and id2 = XX
Could it make any performance improvement if I switch the order of the fields of index1 ?
I know it SHOULD be better but maybe: is it indifferent because id1 has only 1 value?

If I understand correctly, you are comparing these two statements:
where id1= XX and id2 = XX
Under most circumstances, this would use either an index on table(id1, id2) or table(id2, id1). The order of the comparisons in the where (or on) clauses has no impact on which indexes can be used.
Whether you should include a column that has only a single value in the unique index is a different matter. There is a minor performance effect to having a more complex index -- the tree structure has to store more bytes for each key. However, the query:
select count(distinct id2)
from table
where id1 = xx and idx = xx
will actually run faster with a composite index than with a singleton index table(id2). The reason is that the composite index can be used to entirely satisfy the query (in the jargon, it is a "covering index for the query"). The singleton index would need to look up the value of id1 in the table data, which requires extra processing.

The order you define the columns in your Index matters. If your column ID1 will always only have 1 value, then there is no point in putting it into the index, unless you are using it in a Covering Index in a Non-Clustered Index (meaning an Index not the physical ordering of the Table itself). In general, your first column defined in your Index should be the column with the most Varying Values that you need to search through. Visualize it this way, if you had a table of 1 million rows, and the first Column in your Index only had 1 (or small number) of varying values, then would that Index help you in finding the rows you want among the 1 million? Or would it be better to have ID2 first, which would be more efficient for the search, and which would be more frequently used, is what you have to ask yourself. Below is also more info on your question.
SQL Server Clustered Index - Order of Index Question
If you are using a Non-Clustered index, it may appear to not make a Different if your first Column in your Index is all the same values. However it does matter, the reason being is a Non-Clustered Index is stored on a number of Pages. The more entries you can store on a Page which helps you search faster the better. If you include a Column on a Page which adds no value to the Search, then it will requires the same Index to span more Pages. Meaning more Pages to flip through and Longer Lookups. It also means less Room to add new entries to an Existing Page during Inserts when the index is updated, causing more Page Splits. So there are side effects to the decision to add a Column of only 1 value to the Index. If you are using the Column to "cover" retrieved values in common selects, then you can also use Included Columns in your Index, which has the added benefit of not reordering your Index and yet acts like a Covered Index. If that was the intended purpose originally for adding a Column which only has 1 value.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?

Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?

Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas