Understanding how primary key columns are included in a non-clustered index - sql

Assume I have a table called 'demo' with 4 columns; 'a', 'b', 'c' and 'd'. The primary key clustered index for the 'demo' table contains columns 'a' and 'b' in that order.
The 'Actual Execution Plan' from a query referencing table 'demo' has suggested that a new non-unique non-clustered index is required for column 'b' and should include column 'a'.
If I create a non-unique non-clustered index on column 'b' do I need to include column 'a' or will it already be part of the non-clustered index because it is in the primary key?
If primary key column 'a' is already part of the non-clustered index, is column 'a' stored as an include column or is it part of the non-clustered key?

The 'Actual Execution Plan' from a query referencing table 'demo' has
suggested that a new non-unique non-clustered index is required for
column 'b' and should include column 'a'.
...
If primary key column 'a' is already part of the non-clustered index,
is column 'a' stored as an include column or is it part of the
non-clustered key?
In your case column a will be presented on all levels of non-clustered index as the part of clustered index key. The index suggested to you is non-unique so it needs uniquefier and the clustered index key will be used for this purpose.
If the offered index was unique, column a would be stored on the leaf level of this index as the part of row locator that in case of a clustered table is clustered index key.
Column a will not be stored twice if you include it explicitly as included column of your index, so I advice you to include it. It will make difference when one day someone decides to turn your clustered table to a heap (by dropping clustered index). In this case if you did not include column a explicitly in your non clustered index, it will be lost and not contained in your non-clustered index anymore

Including the column a in non-clustered index is useless since it is part of the clustered index key. Therefore, it is part of the data in leaf pages of non-clustered index. Having a query like this
SELECT a FROM tab WHERE b = <value>
then the a value will be naturally part of the leaf data in the non-clustered index.

The PK fields are always part of the key of the index, not part of the included columns.
What I'm thinking here is perhaps it wants to seek by column B; that's something that it can only do if column B is the first key in the index. If you define an index with column B first, followed by column A, perhaps it'll be able to do just that. It seems that it'll be happy as long as both keys are in the index, as you have a compound PK, though they may currently be in the wrong order (first A, then B) thereby preventing a seek.
Reference on PK fields automatically showing up in indexes: https://www.brentozar.com/archive/2013/07/how-to-find-secret-columns-in-nonclustered-indexes/

Try this and watch execution plan. You can see DB uses only INDEX. So, as far as I know, you shouldn't include column A in your index (as, as you said, Clust. index key is already included).
CREATE TABLE DEMO (COLA VARCHAR(10) NOT NULL, COLB VARCHAR(10) NOT NULL, COLC VARCHAR(10), COLD VARCHAR(10));
ALTER TABLE DEMO ADD CONSTRAINT DEMO_PK PRIMARY KEY (COLA, COLB);
CREATE INDEX DEMO_IX1 ON DEMO (COLB);
INSERT INTO DEMO VALUES ('A','B','C','D');
INSERT INTO DEMO VALUES ('A1','B1','C1','D1');
INSERT INTO DEMO VALUES ('A2','B2','C2','D2');
SELECT COLA,COLB FROM DEMO WHERE COLB='B1'

Non-clustered indexes implicitly include the clustered index keys automatically.
In the documentation you could get a lot of information about this, but especially this part explains exactly this:
Nonclustered Index Architecture
The leaf layer of a nonclustered index is made up of index pages
instead of data pages. The row locators in nonclustered index rows are
either a pointer to a row or are a clustered index key for a row.
If your table is a heap, then the row locator would point directly to the data row that contains the key value but if your table is not a heap (which is the case, because you have already a clustered key on that table) then the row locator points to the clustered index key.
Take a look at clustered and nonclustered indexes described as well.
This thread discusses the same: Necessary to include clustered index columns in non-clustered indexes?

Related

If a table has 'id' column as it's clustered index, is there any benefit in adding 'id' column as an included column in any other non-clustered index

If a table has 'id' (the primary key) column as its clustered index, is there any benefit in adding 'id' column as an included column in any other non-clustered index in Microsoft SQLServer?
eg:- Table 'xyz'
id
name
status
date
1
abc
active
2021-06-23
CREATE NONCLUSTERED INDEX [NonClusteredIndex_status_Date]
ON [xyz]
(
[status] ASC,
[date] ASC
)
INCLUDE
( [id],
[name]
)
And this non-clustered index is targeted for a query similar to bellow on a large data set. In the actual case there could be some other queries as well.
select * from xyz where status='active' and date > '2021-06-20'
The answer to your question is essentially No, there is no benefit.
When you create a non-clustered index on a table, each row in the index needs to be able to point to the row in the base table.
If the base table is a heap, each row in the index will contain a pointer to the rid (row identifier) which is what SQL Server uses to uniquely identify each row.
When the table is defined with a clustered index, every non-clustered index will automatically contain the clustered index column(s) as keys to the row in the base table.
You can see this in an execution plan, where a non-clustered index is used and SQL Server has to retrieve additional columns from the base table; if the table is a heap, it will be an RID lookup, if the table has a clustered index it will be a Key lookup.
Additionally, if the clustered index is not unique, SQL Server adds its own uniquifier value to ensure uniqueness, and this is also included in non-clustered indexes.
So when it comes to non-clustered indexes, it does not matter if you specify the clustered index column(s) - you can, and there is no harm in doing so, but it/they are always included.
This answer assumes you are planning to run the following query:
SELECT * FROM xyz WHERE status = 'active' AND date > '2021-06-20';
If you only created a non clustered index on (status, date), then it would cover the WHERE clause, but not the SELECT clause. What this means is that SQL Server might choose to use the index to find the matching records in the query. But when it gets to evaluating the SELECT clause, it would be forced to seek back to the clustered index to find the values for the columns not included in the index, other than the id clustered index column (which includes the name column in this case). There is performance penalty in doing this, and SQL Server might, depending on your data, even choose to not use the index because it does not completely cover the query.
To mitigate this, you can define the index in your question, where you include the name value in the leaf nodes. Note that id will be included by default in the index, so we do not need to explicitly INCLUDE it. By following this approach, the index itself is said to completely cover the query, meaning that SQL Server may use the index for the entire query plan. This can lead to fast performance in many cases.

Are indexes basically a copy of the tables?

today I went to a job interview and while I was there I heard that "Indexes are bascially a clones of the tables, on which they're made".
Could someone relate to this statement? Honestly I've never heard this kind of Index definition
Not really, although they could be.
Every index (including the clustered index) will be using the index keys in all of its internal nodes. What's different is what happens when we reach the leaves of the index.
In a normal, old-school non-clustered index in SQL Server, what you'll find in the leaves are the key values for the clustered index (or some form of row ID for heap tables). Whereas in the clustered index, you'll find the values for all columns, not just those which are the clustered keys and (for that index) it's specific keys.
INCLUDE in indexes muddies the water somewhat by including extra columns at the leaf level in non-clustered indexes.
If the total set of columns in (index keys, clustered-index keys, included columns) for a non-clustered index is the same as the set of all columns in the table, then to an extent the non-clustered index does seem to be a copy of the table - at least to the extent that any query making use of this index will not have to perform any table-lookups to retrieve all data.
If the set of columns above isn't the same as the set of all columns in the table then it's not a copy of the table. It's a copy of a subset of columns of the table. Of course, if this subset of columns are all of the columns required by a particular query then a table lookup can still be avoided.
If you spoke about a clustered index then it's true. Just check documentation:
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the index
definition. There can be only one clustered index per table, because
the data rows themselves can be stored in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table has
no clustered index, its data rows are stored in an unordered structure
called a heap.
But if you spoke about non-clustered index then it's false coz table store as a heap and index separate from table. In this case index is another object which looks like a data structure.
Nonclustered indexes have a structure separate from the data rows. A
nonclustered index contains the nonclustered index key values and each
key value entry has a pointer to the data row that contains the key
value.
The pointer from an index row in a nonclustered index to a data row is
called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table. For
a heap, a row locator is a pointer to the row. For a clustered table,
the row locator is the clustered index key.
You can add nonkey columns to the leaf level of the nonclustered index
to by-pass existing index key limits, and execute fully covered,
indexed, queries. For more information, see Create Indexes with
Included Columns. For details about index key limits see Maximum
Capacity Specifications for SQL Server.

How to calculate the Num_Key_Cols for a Non-Clustered Index with included columns in it?

I'm trying to estimate the index size using the information provided at the MSDN website.
Let us consider a table "Table1" with three columns in it.
The columns are listed below,
Id int, not null
Marks int, not null
SubmitDate Date, not null
Initially, I have created a clustered primary key on Id column and then planned to create a non-clustered index with "Id" as "index key column" where as, "Marks" and "SubmitDate" columns will be used as "Included Columns" in the index.
Based on the above plan, I was trying to estimate the Non-Clustered index key size before creating it. While going through the MSDN site , I have lot of confusions to be clarified. There are four steps to estimate the nonclustered index key size and at the first step, 1.2 and 1.3 explains about, how to calculate the Num_Key_Cols, Fixed_Key_Size, Num_Variable_Key_Cols and Max_Var_Key_Size. But at 1.2 and 1.3, we should calculate based on the Index key type, do we have clustered index key already or not and do we have included columns or not; it seems to be confusing. Can anybody help me based on the info that I have provided about the sample table (Table1) and the Non-Clustered Index Key structure that I would like to create.
In my case, I have included columns, index key column was already a primary key, and all the columns are not null fields. how to calculate the index size for it?
Thanks in advance.
Initially, I have created a clustered primary key on Id column and then planned to create a non-clustered index with "Id" as "index key column" where as, "Marks" and "SubmitDate" columns will be used as "Included Columns" in the index.
I wouldn't do this. By making Id both the primary key (i.e. unique) and the clustering key of the table (with only 3 columns), there is little point adding a further NC Index on Id - the Clustered index is adequate.
If you had a large number of (on page) columns in the table then there could be a reason to add another NC Index on Id with included columns on (Marks, SubmitDate) as the NC index density would be higher. But this is clearly not the case here.
Some clarification of the MSDN NC sizing link:
Direct index columns need to be considered in all levels of the tree.
INCLUDE columns are present only in the leaf nodes of the tree
Num_Key_Cols = Num_Key_Cols + 1 is only needed when Sql Server adds a 4 byte uniquifier if the Clustering key isn't unique. You've clustered by the Primary Key, so its unique, so not +1.
Remember to also do empirical measurements on real data
Not all tables and indexes will have fixed column widths - many have variable length index columns like (N)VARCHAR, in which case the actual storage consumed by the table and indices is highly dependent on the average lengths of such fields.
In this case, I would suggest creating the table and populating it with approximate data and then measure the actual data storage with tools like sp_spaceused and sys.dm_db_index_physical_stats, e.g.:
select * from sys.dm_db_index_physical_stats (DB_ID(),
OBJECT_ID(N'dbo.Table1'), NULL, NULL , 'DETAILED');
(Page size is 8192)
Edit
Just to illustrate, if you already have this in place:
CREATE TABLE Table1
(
Id INT IDENTITY(1,1),
Marks INT NOT NULL,
SubmitDate DATE NOT NULL
);
ALTER TABLE Table1 ADD CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Id);
Then there is no point in doing this as well:
CREATE NONCLUSTERED INDEX IX_Table1 on Table1(Id) INCLUDE (Marks, SubmitDate)
Since the Clustered Index will already seek / scan on ID just as performantly as the NC index - you will just double the storage requirement.
As an aside, also note that in general that the clustered index key is included in all non-clustered indexes automatically, and does not need to be explicitly added to Non clustered indexes.
SELECT
OBJECT_NAME(i.OBJECT_ID) AS TableName,
i.name AS IndexName,
i.index_id AS IndexID,
8 * SUM(a.used_pages) AS 'Indexsize(KB)'
FROM sys.indexes AS i
JOIN sys.partitions AS p ON p.OBJECT_ID = i.OBJECT_ID AND p.index_id = i.index_id
JOIN sys.allocation_units AS a ON a.container_id = p.partition_id
GROUP BY i.OBJECT_ID,i.index_id,i.name
ORDER BY OBJECT_NAME(i.OBJECT_ID),i.index_id

SQL: Add Primary Key to Non-Unique Index

Let's say a query is filtering on two fields and returning primary key values.
SELECT RowIdentifier
FROM Table
WHERE QualifierA = 'exampleA' AND QualifierB = 'exampleB'
Assuming the clustered index is not the PrimaryKey would a non-unique index that contains QualifierA and QualiferB be best served via the addition of the RowIdentifier(Scenario A & Scenario B). Or would it be more appropriate to simply include it(Scenario C)?
Scenario A: Non-Unique, Non-Clustered
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario B: Unique, Non-Clustered
CREATE UNIQUE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario C:
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
Finally I'm assuming that if the PrimaryKey were the clustered index that neither is necessary, is this accurate?
If there is a CLUSTERED index, it is automatically included in all indexes on the table. You can explicitly include it but it is not required.
The UNIQUE index simply enforces uniqueness. The PK should already have this constraint. You do not need to re-enforce it in every index.
If you are including the PK in your where clause, it will almost certainly use the PK index to find that row because it is guaranteed to return the fewest results, so including in your index gains you nothing for lookups. It could also potentially skew the cardinality engine and make SQL think the index is more distinct than it really is.
For the above reasons, I would select Option C
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
I would use this regardless of what column is clustered. This will give you the performance, insure the index will continue to perform regardless of the CLUSTERED INDEX, and make it explicit what the index is used for.
I'm wondering what's more appropriate? A non-clustered unique index incorporating all three fields, or a non-clustered non-unique index incorporating just the two fields(QualifierA & QualifierB) but including the PrimaryKey.
There's a third option. A non-clustered, non-unique index incorporating all three fields.
When you make an index, the fields in the index are duplicated to another place in memory so the server can go after those fields with ease. If you only have QualiferA and Qualifier B in the index it will find the rows in that index that meet your criteria and then go back to the main table to pick up the RowIdentifier. Instead, include all three in there to improve performance.
Remember, make sure you put QualifierA and QualifierB before RowIdentifier in your index. The order of the columns determine how the data is ordered.
Try it out with some test data if you like, and look at the query plan to see what it's doing.

What is advantages of non clustered index over primary key (clustered index)

i have got a table (stores data of forum, means normally no edit and update just insert) on which i have a primary key column which is as we know a clustered index.
please tell me, will i get any advantage if i creates a non-clustered index on that column (primary key column)?
EDIT: my table has got currently around 60000 records, what will be better to place non-clustered index on it or create a same new table and create index and then copy records from old to new table.
Thanks
Every table should have a clustered index
Non-clustered indexes allow INCLUDEs which is very useful
Non-clustered indexes allow filtering in SQL Server 2008+
Notes:
Primary key is a constraint which happens to be a clustered index by default
One clustered index only, many non-clustered indexes
One advantage: you can INCLUDE other columns in the index.
A clustered index specifies the physical storage order of the table data (this is why there can only be one clustered index per table).
If there is no clustered index, inserts will typically be faster since the data doesn't have to be stored in a specific order but can just be appended at the end of the table.
On the other hand, index searches on the key column will typically be slower, since the searches cannot use the advantages of the clustered index.
The only possible advantage that I can see could be from the fact that the entries on leaf pages of nonclustered index are not as wide. They only contain index columns while the clustered index' leaf pages are the actual rows of data. Therefore, if you need something like select count(your_column_name) from your_table then scanning the nonclustered index will involve considerably smaller number of data pages. Or if the number of index columns is greater than one and you run any query which does not need data from non-indexed columns then again, nonclustered index scan will be faster.