Does non-clustered index contain any data? - sql

So every resource online seems to just copy-paste same phrase that "non-clustered index does not conataint data instead it contains pointers to the actual data"
But in this case - how does rdbms know how to sort it? Does not make sense to my smooth brain.
Let's say I have employees table with ID and LastName. Isn't it so that if I create NCI on LastName leaf page will contain that name value and pointer to the row in the table?
What does it mysterious index page actually contain?
employee
ID, LastName
1, CrookedTeeth
2, Bob
Index
Bob, pointer ->2
CrookedTeeth, pointer -> 1

Lot of comments on this one but the summary is:
Nonclustered indexes do contain data
The data contained in a nonclustered index are the columns that index is defined on (covers)
Also any columns defined as INCLUDES, the data is persisted to the index
And finally the row locator is also contained in the nonclustered index
The row locator will be the clustered index key when the table has a clustered index (and therefore stored in a B-Tree data structure)
The larger the clustered index key, the larger the nonclustered index will be
Otherwise the row locator will be the RID (Row ID) when there is no clustered index on the table, meaning the table is stored in a Heap data structure

Related

If a table has 'id' column as it's clustered index, is there any benefit in adding 'id' column as an included column in any other non-clustered index

If a table has 'id' (the primary key) column as its clustered index, is there any benefit in adding 'id' column as an included column in any other non-clustered index in Microsoft SQLServer?
eg:- Table 'xyz'
id
name
status
date
1
abc
active
2021-06-23
CREATE NONCLUSTERED INDEX [NonClusteredIndex_status_Date]
ON [xyz]
(
[status] ASC,
[date] ASC
)
INCLUDE
( [id],
[name]
)
And this non-clustered index is targeted for a query similar to bellow on a large data set. In the actual case there could be some other queries as well.
select * from xyz where status='active' and date > '2021-06-20'
The answer to your question is essentially No, there is no benefit.
When you create a non-clustered index on a table, each row in the index needs to be able to point to the row in the base table.
If the base table is a heap, each row in the index will contain a pointer to the rid (row identifier) which is what SQL Server uses to uniquely identify each row.
When the table is defined with a clustered index, every non-clustered index will automatically contain the clustered index column(s) as keys to the row in the base table.
You can see this in an execution plan, where a non-clustered index is used and SQL Server has to retrieve additional columns from the base table; if the table is a heap, it will be an RID lookup, if the table has a clustered index it will be a Key lookup.
Additionally, if the clustered index is not unique, SQL Server adds its own uniquifier value to ensure uniqueness, and this is also included in non-clustered indexes.
So when it comes to non-clustered indexes, it does not matter if you specify the clustered index column(s) - you can, and there is no harm in doing so, but it/they are always included.
This answer assumes you are planning to run the following query:
SELECT * FROM xyz WHERE status = 'active' AND date > '2021-06-20';
If you only created a non clustered index on (status, date), then it would cover the WHERE clause, but not the SELECT clause. What this means is that SQL Server might choose to use the index to find the matching records in the query. But when it gets to evaluating the SELECT clause, it would be forced to seek back to the clustered index to find the values for the columns not included in the index, other than the id clustered index column (which includes the name column in this case). There is performance penalty in doing this, and SQL Server might, depending on your data, even choose to not use the index because it does not completely cover the query.
To mitigate this, you can define the index in your question, where you include the name value in the leaf nodes. Note that id will be included by default in the index, so we do not need to explicitly INCLUDE it. By following this approach, the index itself is said to completely cover the query, meaning that SQL Server may use the index for the entire query plan. This can lead to fast performance in many cases.

Are indexes basically a copy of the tables?

today I went to a job interview and while I was there I heard that "Indexes are bascially a clones of the tables, on which they're made".
Could someone relate to this statement? Honestly I've never heard this kind of Index definition
Not really, although they could be.
Every index (including the clustered index) will be using the index keys in all of its internal nodes. What's different is what happens when we reach the leaves of the index.
In a normal, old-school non-clustered index in SQL Server, what you'll find in the leaves are the key values for the clustered index (or some form of row ID for heap tables). Whereas in the clustered index, you'll find the values for all columns, not just those which are the clustered keys and (for that index) it's specific keys.
INCLUDE in indexes muddies the water somewhat by including extra columns at the leaf level in non-clustered indexes.
If the total set of columns in (index keys, clustered-index keys, included columns) for a non-clustered index is the same as the set of all columns in the table, then to an extent the non-clustered index does seem to be a copy of the table - at least to the extent that any query making use of this index will not have to perform any table-lookups to retrieve all data.
If the set of columns above isn't the same as the set of all columns in the table then it's not a copy of the table. It's a copy of a subset of columns of the table. Of course, if this subset of columns are all of the columns required by a particular query then a table lookup can still be avoided.
If you spoke about a clustered index then it's true. Just check documentation:
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the index
definition. There can be only one clustered index per table, because
the data rows themselves can be stored in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table has
no clustered index, its data rows are stored in an unordered structure
called a heap.
But if you spoke about non-clustered index then it's false coz table store as a heap and index separate from table. In this case index is another object which looks like a data structure.
Nonclustered indexes have a structure separate from the data rows. A
nonclustered index contains the nonclustered index key values and each
key value entry has a pointer to the data row that contains the key
value.
The pointer from an index row in a nonclustered index to a data row is
called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table. For
a heap, a row locator is a pointer to the row. For a clustered table,
the row locator is the clustered index key.
You can add nonkey columns to the leaf level of the nonclustered index
to by-pass existing index key limits, and execute fully covered,
indexed, queries. For more information, see Create Indexes with
Included Columns. For details about index key limits see Maximum
Capacity Specifications for SQL Server.

Understanding how primary key columns are included in a non-clustered index

Assume I have a table called 'demo' with 4 columns; 'a', 'b', 'c' and 'd'. The primary key clustered index for the 'demo' table contains columns 'a' and 'b' in that order.
The 'Actual Execution Plan' from a query referencing table 'demo' has suggested that a new non-unique non-clustered index is required for column 'b' and should include column 'a'.
If I create a non-unique non-clustered index on column 'b' do I need to include column 'a' or will it already be part of the non-clustered index because it is in the primary key?
If primary key column 'a' is already part of the non-clustered index, is column 'a' stored as an include column or is it part of the non-clustered key?
The 'Actual Execution Plan' from a query referencing table 'demo' has
suggested that a new non-unique non-clustered index is required for
column 'b' and should include column 'a'.
...
If primary key column 'a' is already part of the non-clustered index,
is column 'a' stored as an include column or is it part of the
non-clustered key?
In your case column a will be presented on all levels of non-clustered index as the part of clustered index key. The index suggested to you is non-unique so it needs uniquefier and the clustered index key will be used for this purpose.
If the offered index was unique, column a would be stored on the leaf level of this index as the part of row locator that in case of a clustered table is clustered index key.
Column a will not be stored twice if you include it explicitly as included column of your index, so I advice you to include it. It will make difference when one day someone decides to turn your clustered table to a heap (by dropping clustered index). In this case if you did not include column a explicitly in your non clustered index, it will be lost and not contained in your non-clustered index anymore
Including the column a in non-clustered index is useless since it is part of the clustered index key. Therefore, it is part of the data in leaf pages of non-clustered index. Having a query like this
SELECT a FROM tab WHERE b = <value>
then the a value will be naturally part of the leaf data in the non-clustered index.
The PK fields are always part of the key of the index, not part of the included columns.
What I'm thinking here is perhaps it wants to seek by column B; that's something that it can only do if column B is the first key in the index. If you define an index with column B first, followed by column A, perhaps it'll be able to do just that. It seems that it'll be happy as long as both keys are in the index, as you have a compound PK, though they may currently be in the wrong order (first A, then B) thereby preventing a seek.
Reference on PK fields automatically showing up in indexes: https://www.brentozar.com/archive/2013/07/how-to-find-secret-columns-in-nonclustered-indexes/
Try this and watch execution plan. You can see DB uses only INDEX. So, as far as I know, you shouldn't include column A in your index (as, as you said, Clust. index key is already included).
CREATE TABLE DEMO (COLA VARCHAR(10) NOT NULL, COLB VARCHAR(10) NOT NULL, COLC VARCHAR(10), COLD VARCHAR(10));
ALTER TABLE DEMO ADD CONSTRAINT DEMO_PK PRIMARY KEY (COLA, COLB);
CREATE INDEX DEMO_IX1 ON DEMO (COLB);
INSERT INTO DEMO VALUES ('A','B','C','D');
INSERT INTO DEMO VALUES ('A1','B1','C1','D1');
INSERT INTO DEMO VALUES ('A2','B2','C2','D2');
SELECT COLA,COLB FROM DEMO WHERE COLB='B1'
Non-clustered indexes implicitly include the clustered index keys automatically.
In the documentation you could get a lot of information about this, but especially this part explains exactly this:
Nonclustered Index Architecture
The leaf layer of a nonclustered index is made up of index pages
instead of data pages. The row locators in nonclustered index rows are
either a pointer to a row or are a clustered index key for a row.
If your table is a heap, then the row locator would point directly to the data row that contains the key value but if your table is not a heap (which is the case, because you have already a clustered key on that table) then the row locator points to the clustered index key.
Take a look at clustered and nonclustered indexes described as well.
This thread discusses the same: Necessary to include clustered index columns in non-clustered indexes?

SQL: Add Primary Key to Non-Unique Index

Let's say a query is filtering on two fields and returning primary key values.
SELECT RowIdentifier
FROM Table
WHERE QualifierA = 'exampleA' AND QualifierB = 'exampleB'
Assuming the clustered index is not the PrimaryKey would a non-unique index that contains QualifierA and QualiferB be best served via the addition of the RowIdentifier(Scenario A & Scenario B). Or would it be more appropriate to simply include it(Scenario C)?
Scenario A: Non-Unique, Non-Clustered
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario B: Unique, Non-Clustered
CREATE UNIQUE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario C:
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
Finally I'm assuming that if the PrimaryKey were the clustered index that neither is necessary, is this accurate?
If there is a CLUSTERED index, it is automatically included in all indexes on the table. You can explicitly include it but it is not required.
The UNIQUE index simply enforces uniqueness. The PK should already have this constraint. You do not need to re-enforce it in every index.
If you are including the PK in your where clause, it will almost certainly use the PK index to find that row because it is guaranteed to return the fewest results, so including in your index gains you nothing for lookups. It could also potentially skew the cardinality engine and make SQL think the index is more distinct than it really is.
For the above reasons, I would select Option C
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
I would use this regardless of what column is clustered. This will give you the performance, insure the index will continue to perform regardless of the CLUSTERED INDEX, and make it explicit what the index is used for.
I'm wondering what's more appropriate? A non-clustered unique index incorporating all three fields, or a non-clustered non-unique index incorporating just the two fields(QualifierA & QualifierB) but including the PrimaryKey.
There's a third option. A non-clustered, non-unique index incorporating all three fields.
When you make an index, the fields in the index are duplicated to another place in memory so the server can go after those fields with ease. If you only have QualiferA and Qualifier B in the index it will find the rows in that index that meet your criteria and then go back to the main table to pick up the RowIdentifier. Instead, include all three in there to improve performance.
Remember, make sure you put QualifierA and QualifierB before RowIdentifier in your index. The order of the columns determine how the data is ordered.
Try it out with some test data if you like, and look at the query plan to see what it's doing.

What is advantages of non clustered index over primary key (clustered index)

i have got a table (stores data of forum, means normally no edit and update just insert) on which i have a primary key column which is as we know a clustered index.
please tell me, will i get any advantage if i creates a non-clustered index on that column (primary key column)?
EDIT: my table has got currently around 60000 records, what will be better to place non-clustered index on it or create a same new table and create index and then copy records from old to new table.
Thanks
Every table should have a clustered index
Non-clustered indexes allow INCLUDEs which is very useful
Non-clustered indexes allow filtering in SQL Server 2008+
Notes:
Primary key is a constraint which happens to be a clustered index by default
One clustered index only, many non-clustered indexes
One advantage: you can INCLUDE other columns in the index.
A clustered index specifies the physical storage order of the table data (this is why there can only be one clustered index per table).
If there is no clustered index, inserts will typically be faster since the data doesn't have to be stored in a specific order but can just be appended at the end of the table.
On the other hand, index searches on the key column will typically be slower, since the searches cannot use the advantages of the clustered index.
The only possible advantage that I can see could be from the fact that the entries on leaf pages of nonclustered index are not as wide. They only contain index columns while the clustered index' leaf pages are the actual rows of data. Therefore, if you need something like select count(your_column_name) from your_table then scanning the nonclustered index will involve considerably smaller number of data pages. Or if the number of index columns is greater than one and you run any query which does not need data from non-indexed columns then again, nonclustered index scan will be faster.