Are indexes basically a copy of the tables? - sql

today I went to a job interview and while I was there I heard that "Indexes are bascially a clones of the tables, on which they're made".
Could someone relate to this statement? Honestly I've never heard this kind of Index definition

Not really, although they could be.
Every index (including the clustered index) will be using the index keys in all of its internal nodes. What's different is what happens when we reach the leaves of the index.
In a normal, old-school non-clustered index in SQL Server, what you'll find in the leaves are the key values for the clustered index (or some form of row ID for heap tables). Whereas in the clustered index, you'll find the values for all columns, not just those which are the clustered keys and (for that index) it's specific keys.
INCLUDE in indexes muddies the water somewhat by including extra columns at the leaf level in non-clustered indexes.
If the total set of columns in (index keys, clustered-index keys, included columns) for a non-clustered index is the same as the set of all columns in the table, then to an extent the non-clustered index does seem to be a copy of the table - at least to the extent that any query making use of this index will not have to perform any table-lookups to retrieve all data.
If the set of columns above isn't the same as the set of all columns in the table then it's not a copy of the table. It's a copy of a subset of columns of the table. Of course, if this subset of columns are all of the columns required by a particular query then a table lookup can still be avoided.

If you spoke about a clustered index then it's true. Just check documentation:
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the index
definition. There can be only one clustered index per table, because
the data rows themselves can be stored in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table has
no clustered index, its data rows are stored in an unordered structure
called a heap.
But if you spoke about non-clustered index then it's false coz table store as a heap and index separate from table. In this case index is another object which looks like a data structure.
Nonclustered indexes have a structure separate from the data rows. A
nonclustered index contains the nonclustered index key values and each
key value entry has a pointer to the data row that contains the key
value.
The pointer from an index row in a nonclustered index to a data row is
called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table. For
a heap, a row locator is a pointer to the row. For a clustered table,
the row locator is the clustered index key.
You can add nonkey columns to the leaf level of the nonclustered index
to by-pass existing index key limits, and execute fully covered,
indexed, queries. For more information, see Create Indexes with
Included Columns. For details about index key limits see Maximum
Capacity Specifications for SQL Server.

Related

Does non-clustered index contain any data?

So every resource online seems to just copy-paste same phrase that "non-clustered index does not conataint data instead it contains pointers to the actual data"
But in this case - how does rdbms know how to sort it? Does not make sense to my smooth brain.
Let's say I have employees table with ID and LastName. Isn't it so that if I create NCI on LastName leaf page will contain that name value and pointer to the row in the table?
What does it mysterious index page actually contain?
employee
ID, LastName
1, CrookedTeeth
2, Bob
Index
Bob, pointer ->2
CrookedTeeth, pointer -> 1
Lot of comments on this one but the summary is:
Nonclustered indexes do contain data
The data contained in a nonclustered index are the columns that index is defined on (covers)
Also any columns defined as INCLUDES, the data is persisted to the index
And finally the row locator is also contained in the nonclustered index
The row locator will be the clustered index key when the table has a clustered index (and therefore stored in a B-Tree data structure)
The larger the clustered index key, the larger the nonclustered index will be
Otherwise the row locator will be the RID (Row ID) when there is no clustered index on the table, meaning the table is stored in a Heap data structure

Are there performance differences in queries with UNIQUE NON NULL indexes and Primary keys?

I want to search a DB with either the PK or a unique non null field that is indexed. Are there any performance differences between those? I am using Postgres as my DB. But a general DB-independent answer would be good too.
In postgreSQL, all indexes are secondary or unclustered indexes. That means the the index points to the heap, the data structure holding the actual column data. So, a primary key's index doesn't have any structural advantage over a UNIQUE index: SELECTs using the index for filtering must then bounce over to the heap for the data.
In fact, it might be the other way around, because postgreSQL indexes can have INCLUDES clauses.
For example consider a table with uniqueid, a, b, and c columns. If your workload is heavy with SELECT b FROM tbl WHERE uniqueid = something queries, you can declare this covering index.
CREATE UNIQUE INDEX uniq ON tbl(uniqueid) INCLUDE (b);
Your whole query can then be satisfied from the index. That saves the extra trip to the heap, and so saves IO and CPU time.
MySQL and SQL Server, on the other hand, use clustered indexes for their primary keys. That is, the table's data is stored in the primary key's index. So, the PK is, automatically, basically an index created like this.
CREATE UNIQUE INDEX pk ON tbl(uniqueid) INCLUDE (a, b, c);
In those databases the PK's index does have an advantage over a separate UNIQUE index, which necessarily is a secondary or unclustered index. (Note: MySQL's indexes don't have INCLUDE() clauses.)

SQL: Add Primary Key to Non-Unique Index

Let's say a query is filtering on two fields and returning primary key values.
SELECT RowIdentifier
FROM Table
WHERE QualifierA = 'exampleA' AND QualifierB = 'exampleB'
Assuming the clustered index is not the PrimaryKey would a non-unique index that contains QualifierA and QualiferB be best served via the addition of the RowIdentifier(Scenario A & Scenario B). Or would it be more appropriate to simply include it(Scenario C)?
Scenario A: Non-Unique, Non-Clustered
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario B: Unique, Non-Clustered
CREATE UNIQUE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario C:
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
Finally I'm assuming that if the PrimaryKey were the clustered index that neither is necessary, is this accurate?
If there is a CLUSTERED index, it is automatically included in all indexes on the table. You can explicitly include it but it is not required.
The UNIQUE index simply enforces uniqueness. The PK should already have this constraint. You do not need to re-enforce it in every index.
If you are including the PK in your where clause, it will almost certainly use the PK index to find that row because it is guaranteed to return the fewest results, so including in your index gains you nothing for lookups. It could also potentially skew the cardinality engine and make SQL think the index is more distinct than it really is.
For the above reasons, I would select Option C
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
I would use this regardless of what column is clustered. This will give you the performance, insure the index will continue to perform regardless of the CLUSTERED INDEX, and make it explicit what the index is used for.
I'm wondering what's more appropriate? A non-clustered unique index incorporating all three fields, or a non-clustered non-unique index incorporating just the two fields(QualifierA & QualifierB) but including the PrimaryKey.
There's a third option. A non-clustered, non-unique index incorporating all three fields.
When you make an index, the fields in the index are duplicated to another place in memory so the server can go after those fields with ease. If you only have QualiferA and Qualifier B in the index it will find the rows in that index that meet your criteria and then go back to the main table to pick up the RowIdentifier. Instead, include all three in there to improve performance.
Remember, make sure you put QualifierA and QualifierB before RowIdentifier in your index. The order of the columns determine how the data is ordered.
Try it out with some test data if you like, and look at the query plan to see what it's doing.

Unique vs NonUnique Clustered Index to speed searches on a NonUnique field

I have a million row dataset for which I regularly join on Column A.
To speed up joining I'm going to create a clustered index on Column A.
Column A is non unique but (Column A, Column B) is a unique pairing.
I will never use Column B in a where clause or join.
Am I better to create a non unique clustered index on just Column A or to create a unique clustered index on (Column A, Column B)?
You would create a unique index on A,B to enforce uniqueness of the values. This is enforced at the database level, so you will be prevented from inserting duplicate values into the database.
A unique index can be used for resolving queries that need the first columns in the index but not necessarily all of them. So, the unique index is fine for queries on A.
I would say create the unique index. There are two things to keep in mind. The first is if B is a large data type -- like char(500). These values are stored in the index, so including B might make the index rather large.
Second, if the data are not being inserted in A, B order, then making it a clustered index could incur performance overhead on inserts and deletes. New inserts would end up going on a random page, which would likely be filled and then require splitting (or you can use the fill factors of pages to reserve extra space for inserts, at the cost of making the table initially bigger).

What is advantages of non clustered index over primary key (clustered index)

i have got a table (stores data of forum, means normally no edit and update just insert) on which i have a primary key column which is as we know a clustered index.
please tell me, will i get any advantage if i creates a non-clustered index on that column (primary key column)?
EDIT: my table has got currently around 60000 records, what will be better to place non-clustered index on it or create a same new table and create index and then copy records from old to new table.
Thanks
Every table should have a clustered index
Non-clustered indexes allow INCLUDEs which is very useful
Non-clustered indexes allow filtering in SQL Server 2008+
Notes:
Primary key is a constraint which happens to be a clustered index by default
One clustered index only, many non-clustered indexes
One advantage: you can INCLUDE other columns in the index.
A clustered index specifies the physical storage order of the table data (this is why there can only be one clustered index per table).
If there is no clustered index, inserts will typically be faster since the data doesn't have to be stored in a specific order but can just be appended at the end of the table.
On the other hand, index searches on the key column will typically be slower, since the searches cannot use the advantages of the clustered index.
The only possible advantage that I can see could be from the fact that the entries on leaf pages of nonclustered index are not as wide. They only contain index columns while the clustered index' leaf pages are the actual rows of data. Therefore, if you need something like select count(your_column_name) from your_table then scanning the nonclustered index will involve considerably smaller number of data pages. Or if the number of index columns is greater than one and you run any query which does not need data from non-indexed columns then again, nonclustered index scan will be faster.