Why NonClustered index scan faster than Clustered Index scan? - sql

As I know, heap tables are tables without clustered index and has no physical order.
I have a heap table "scan" with 120k rows and I am using this select:
SELECT id FROM scan
If I create a non-clustered index for the column "id", I get 223 physical reads.
If I remove the non-clustered index and alter the table to make "id" my primary key (and so my clustered index), I get 515 physical reads.
If the clustered index table is something like this picture:
Why Clustered Index Scans workw like the table scan? (or worse in case of retrieving all rows). Why it is not using the "clustered index table" that has less blocks and already has the ID that I need?

SQL Server indices are b-trees. A non-clustered index just contains the indexed columns, with the leaf nodes of the b-tree being pointers to the approprate data page. A clustered index is different: its leaf nodes are the data page itself and the clustered index's b-tree becomes the backing store for the table itself; the heap ceases to exist for the table.
Your non-clustered index contains a single, presumably integer column. It's a small, compact index to start with. Your query select id from scan has a covering index: the query can be satisfied just by examining the index, which is what is happening. If, however, your query included columns not in the index, assuming the optimizer elected to use the non-clustered index, an additional lookup would be required to fetch the data pages required, either from the clustering index or from the heap.
To understand what's going on, you need to examine the execution plan selected by the optimizer:
See Displaying Graphical Execution Plans
See Red Gate's SQL Server Execution Plans, by Grant Fritchey

A clustered index generally is about as big as the same data in a heap would be (assuming the same page fullness). It should use just a little more reads than a heap would use because of additional B-tree levels.
A CI cannot be smaller than a heap would be. I don't see why you would think that. Most of the size of a partition (be it a heap or a tree) is in the data.
Note, that less physical reads does not necessarily translate to a query being faster. Random IO can be 100x slower than sequential IO.

When to use Clustered Index-
Query Considerations:
1) Return a range of values by using operators such as BETWEEN, >, >=, <, and <= 2) Return large result sets
3) Use JOIN clauses; typically these are foreign key columns
4) Use ORDER BY, or GROUP BY clauses. An index on the columns specified in the ORDER BY or GROUP BY clause may remove the need for the Database Engine to sort the data, because the rows are already sorted. This improves query performance.
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Are unique or contain many distinct values
2) Defined as IDENTITY because the column is guaranteed to be unique within the table
3) Used frequently to sort the data retrieved from a table
Clustered indexes are not a good choice for the following attributes:
1) Columns that undergo frequent changes
2) Wide keys
When to use Nonclustered Index-
Query Considerations:
1) Use JOIN or GROUP BY clauses. Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
2) Queries that do not return large result sets
3) Contain columns frequently involved in search conditions of a query, such as WHERE clause, that return exact matches
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Cover the query. For more information, see Index with Included Columns
2) Lots of distinct values, such as a combination of last name and first name, if a clustered index is used for other columns
3) Used frequently to sort the data retrieved from a table
Database Considerations:
1) Databases or tables with low update requirements, but large volumes of data can benefit from many nonclustered indexes to improve query performance.
2) Online Transaction Processing applications and databases that contain heavily updated tables should avoid over-indexing. Additionally, indexes should be narrow, that is, with as few columns as possible.

Try running
DBCC DROPCLEANBUFFERS
Before the queries...
If you really want to compare them.
Physical reads don't mean the same as logical reads when optimizing a query

Related

What INCLUDE() function does when creating index in MS SQL Server?

What is the difference between creating an index using INCLUDE function vs not?
What would be the difference between the following two indexes?
CREATE NONCLUSTERED INDEX SomeName ON SomeTable (
ColumnA
,ColumnB
,ColumnC
,ColumnD
) INCLUDE (
ColumnE
,ColumnF
,ColumnG
)
vs
CREATE INDEX SomeName ON SomeTable (
ColumnA
,ColumnB
,ColumnC
,ColumnD
,ColumnE
,ColumnF
,ColumnG
)
The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree. This makes the index smaller because it's not part of the tree.
INCLUDE columns are not key columns in the index, so they are not ordered. This means it isn't really useful for predicates, sorting etc.. However, it may be useful if you have a residual lookup in a few rows from the key columns.
INCLUDE columns are not key columns in the index, so they are not ordered. This makes them not typically useful for JOINs or sorting. And because they are not key columns, they don't sit in the whole B-tree structure like key columns
By adding Include (or nonkey)columns, you can create nonclustered indexes that cover more queries. This is because the nonkey columns have the following benefits:
They can be data types not allowed as index key columns.
They are not considered by the Database Engine when calculating the number of index key columns or index key size.
An index with Included columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.
For more info refer Microsoft docs: Create Indexes with Included Columns
When an execution plan uses an index, it has access to all the columns in the index. If all the columns from a given table are in the index, there is no need to refer to the original data pages. Eliminating that data page lookup is a gain in efficiency.
However, including columns in indexes has overhead for the indexing structure itself (this is in addition to duplicating the values).
The INCLUDE keyword allows for column values to be in the index, without incurring the overhead of the additional indexing structure. The purpose is to resolve queries without having to look up the column information on the original data pages.

SQL Server non-clustered index

I have two different queries in SQL Server and I want to clarify
how the execution plan would be different, and
which of them is more efficient
Queries:
SELECT *
FROM table_name
WHERE column < 2
and
SELECT column
FROM table_name
WHERE column < 2
I have a non-clustered index on column.
I used to use Postgresql and I am not familiar with SQL Server and these kind of indexes.
As I read many questions here I kept two notes:
When I have a non-clustered index, I need one more step in order to have access to data
With a non-clustered index I could have a copy of part of the table and I get a quicker response time.
So, I got confused.
One more question is that when I have "SELECT *" which is the influence of a non-clustered index?
1st query :
Depending on the size of the data you might face lookup issues such as Key lookup and RID lookups .
2nd query :
It will be faster because it will not fetch columns that are not part of the index , though i recommend using covering index ..
I recommend you check this blog post
The first select will use the non-clustered index to find the clustering key [clustered index exists] or page and slot [no clustered index]. Then that will be used to get the row. The query plan will be different depending on your STATS (the data).
The second query is "covered" by the non-clustered index. What that means is that the non-clustered index contains all of the data that you are selecting. The clustering key is not needed, and the clustered index and/or heap is not needed to provide data to the select list.

What "Clustered Index Scan (Clustered)" means on SQL Server execution plan?

I have a query that fails to execute with "Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'".
On the way of trouble shooting I am examining the execution plan. There are two costly steps labeled "Clustered Index Scan (Clustered)". I have a hard time find out what this means?
I would appreciate any explanations to "Clustered Index Scan (Clustered)" or suggestions on where to find the related document?
I would appreciate any explanations to "Clustered Index Scan
(Clustered)"
I will try to put in the easiest manner, for better understanding you need to understand both index seek and scan.
SO lets build the table
use tempdb GO
create table scanseek (id int , name varchar(50) default ('some random names') )
create clustered index IX_ID_scanseek on scanseek(ID)
declare #i int
SET #i = 0
while (#i <5000)
begin
insert into scanseek
select #i, 'Name' + convert( varchar(5) ,#i)
set #i =#i+1
END
An index seek is where SQL server uses the b-tree structure of the index to seek directly to matching records
you can check your table root and leaf nodes using the DMV below
-- check index level
SELECT
index_level
,record_count
,page_count
,avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),OBJECT_ID('scanseek'),NULL,NULL,'DETAILED')
GO
Now here we have clustered index on column "ID"
lets look for some direct matching records
select * from scanseek where id =340
and look at the Execution plan
you've requested rows directly in the query that's why you got a clustered index SEEK .
Clustered index scan: When Sql server reads through for the Row(s) from top to bottom in the clustered index.
for example searching data in non key column. In our table NAME is non key column so if we will search some data in the name column we will see clustered index scan because all the rows are in clustered index leaf level.
Example
select * from scanseek where name = 'Name340'
please note: I made this answer short for better understanding only, if you have any question or suggestion please comment below.
Expanding on Gordon's answer in the comments, a clustered index scan is scanning one of the tables indexes to find the values you are doing a where clause filter, or for a join to the next table in your query plan.
Tables can have multiple indexes (one clustered and many non-clustered) and SQL Server will search the appropriate one based upon the filter or join being executed.
Clustered Indexes are explained pretty well on MSDN. The key difference between clustered and non-clustered is that the clustered index defines how rows are stored on disk.
If your clustered index is very expensive to search due to the number of records, you may want to add a non-clustered index on the table for fields that you search for often, such as date fields used for filtering ranges of records.
A clustered index is one in which the terminal (leaf) node of the index is the actual data page itself. There can be only one clustered index per table, because it specifies how records are arranged within the data page. It is generally (and with some exceptions) considered the most performant index type (primarily because there is one less level of indirection before you get to your actual data record).
A "clustered index scan" means that the SQL engine is traversing your clustered index in search for a particular value (or set of values). It is one of the most efficient methods for locating a record (beat by a "clustered index seek" in which the SQL Engine is looking to match a single selected value).
The error message has absolutely nothing to do with the query plan. It just means that you are out of space on TempDB.
I have been having issues with performance and timeouts due to a clustered index scan. However another seemingly identical database did not have the same issue.
Turns out the COMPATIBILITY_LEVEL flag on the db was different... the version with COMPATIBILITY_LEVEL 100 was using the scan, the db with level 130 wasn't. Performance difference is huge (from more than 1 minute to less that 1 second for same query)
ALTER DATABASE [mydb] SET COMPATIBILITY_LEVEL = 130
If you hover over the step in the query plan, SSMS displays a description of what the step does. That will give you a baseline understanding of "Clustered Index Scan (Clustered)" and all other steps involved.

Why is there a non-clustered index scan when counting all rows in a table?

As far as I understand it, each transaction sees its own version of the database, so the system cannot get the total number of rows from some counter and thus needs to scan an index. But I thought it would be the clustered index on the primary key, not the additional indexes. If I had more than one additional index, which one will be chosen, anyway?
When digging into the matter, I've noticed another strange thing. Suppose there are two identical tables, Articles and Articles2, each with three columns: Id, View_Count, and Title. The first has only a clustered PK-based index, while the second one has an additional non-clustered, non-unique index on view_count. The query SELECT COUNT(1) FROM Articles runs 2 times faster for the table with the additional index.
SQL Server will optimize your query - if it needs to count the rows in a table, it will choose the smallest possible set of data to do so.
So if you consider your clustered index - it contains the actual data pages - possibly several thousand bytes per row. To load all those bytes just to count the rows would be wasteful - even just in terms of disk I/O.
Therefore, it there is a non-clustered index that's not filtered or restricted in any way, SQL Server will pick that data structure to count - since the non-clustered index basically contains the columns you've put into the NC index (plus the clustered index key) - much less data to load just to count the number of rows.

What's the difference between a Table Scan and a Clustered Index Scan?

Since both a Table Scan and a Clustered Index Scan essentially scan all records in the table, why is a Clustered Index Scan supposedly better?
As an example - what's the performance difference between the following when there are many records?:
declare #temp table(
SomeColumn varchar(50)
)
insert into #temp
select 'SomeVal'
select * from #temp
-----------------------------
declare #temp table(
RowID int not null identity(1,1) primary key,
SomeColumn varchar(50)
)
insert into #temp
select 'SomeVal'
select * from #temp
In a table without a clustered index (a heap table), data pages are not linked together - so traversing pages requires a lookup into the Index Allocation Map.
A clustered table, however, has it's data pages linked in a doubly linked list - making sequential scans a bit faster. Of course, in exchange, you have the overhead of dealing with keeping the data pages in order on INSERT, UPDATE, and DELETE. A heap table, however, requires a second write to the IAM.
If your query has a RANGE operator (e.g.: SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100), then a clustered table (being in a guaranteed order) would be more efficient - as it could use the index pages to find the relevant data page(s). A heap would have to scan all rows, since it cannot rely on ordering.
And, of course, a clustered index lets you do a CLUSTERED INDEX SEEK, which is pretty much optimal for performance...a heap with no indexes would always result in a table scan.
So:
For your example query where you select all rows, the only difference is the doubly linked list a clustered index maintains. This should make your clustered table just a tiny bit faster than a heap with a large number of rows.
For a query with a WHERE clause that can be (at least partially) satisfied by the clustered index, you'll come out ahead because of the ordering - so you won't have to scan the entire table.
For a query that is not satisified by the clustered index, you're pretty much even...again, the only difference being that doubly linked list for sequential scanning. In either case, you're suboptimal.
For INSERT, UPDATE, and DELETE a heap may or may not win. The heap doesn't have to maintain order, but does require a second write to the IAM. I think the relative performance difference would be negligible, but also pretty data dependent.
Microsoft has a whitepaper which compares a clustered index to an equivalent non-clustered index on a heap (not exactly the same as I discussed above, but close). Their conclusion is basically to put a clustered index on all tables. I'll do my best to summarize their results (again, note that they're really comparing a non-clustered index to a clustered index here - but I think it's relatively comparable):
INSERT performance: clustered index wins by about 3% due to the second write needed for a heap.
UPDATE performance: clustered index wins by about 8% due to the second lookup needed for a heap.
DELETE performance: clustered index wins by about 18% due to the second lookup needed and the second delete needed from the IAM for a heap.
single SELECT performance: clustered index wins by about 16% due to the second lookup needed for a heap.
range SELECT performance: clustered index wins by about 29% due to the random ordering for a heap.
concurrent INSERT: heap table wins by 30% under load due to page splits for the clustered index.
http://msdn.microsoft.com/en-us/library/aa216840(SQL.80).aspx
The Clustered Index Scan logical and physical operator scans the clustered index specified in the Argument column. When an optional WHERE:() predicate is present, only those rows that satisfy the predicate are returned. If the Argument column contains the ORDERED clause, the query processor has requested that the rows' output be returned in the order in which the clustered index has sorted them. If the ORDERED clause is not present, the storage engine will scan the index in the optimal way (not guaranteeing the output to be sorted).
http://msdn.microsoft.com/en-us/library/aa178416(SQL.80).aspx
The Table Scan logical and physical operator retrieves all rows from the table specified in the Argument column. If a WHERE:() predicate appears in the Argument column, only those rows that satisfy the predicate are returned.
A table scan has to examine every single row of the table. The clustered index scan only needs to scan the index. It doesn't scan every record in the table. That's the point, really, of indices.