what is index and can non-clustered index be non-unique? - sql

Subquestion to my question [1]:
All definitions of (MS SQL Server) index (that I could find) are ambiguous and all explanations, based on it, narrate something using undefined or ambiguously defined terms.
What is the definition of index?
For ex., the most common definition of index from wiki (http://en.wikipedia.org/wiki/Index_(database) ) :
1) "A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table..."
2) "SQL server creates a clustered index on a primary key by default[1]. The data is present in random order, but the logical ordering is specified by the index. The data rows may be randomly spread throughout the table. The non-clustered index tree contains the index keys in sorted order, with the leaf level of the index containing the pointer to the page and the row number in the data page"
Well, it is ambiguous. One can understand under index:
1) an ordered data structure, a tree, containing intermediate and leaf nodes;
2) leaf node data containing values from indexed columns + "pointer to the page and the row number in the data page"
Can non-clustered index be non-unique, considering 2)? or, even, 1) ?
It doesn't seem so to me ...
But does TSQL imply existence of non-unique non-clustered index?
If yes, then What is understood by non-clustered index in "CREATE INDEX (Transact-SQL)"[2] and to what the argument UNIQUE is applied there?
Is it:
3) leaf node data containing values from indexed columns? i.e. like in 2) but without pointer + row number ) ?
If it is 3), then again question 1) arises - why to apply constraints to copy of real data in "index", instead of real data in-situ?
Update:
Is not bookmark (pointer+row number) to a real data row unique (uniquely identify row)?
Doesn't this bookmark constitute part of the index and thereby makes the index unique?
Can you give me the definition of the index instead of explaining how to use it UNDEFINED? The latter part I already know (or can read myself).
[1]
"UNIQUE argument for INDEX creation - what's for?"
UNIQUE argument for INDEX creation - what's for?
[2]
[CREATE INDEX (Transact-SQL)]
http://msdn.microsoft.com/en-us/library/ms188783.aspx

An index is a data structure designed to optimize querying large data sets. As such, no claim is made about whether or not anything is unique at this point.
You can definitely have non-unique non-clustered indices - how else could you index on lastname, firstname ?? That's never going to be unique (e.g. on Facebook.....)
You can define an index as being unique - this just adds the extra check to it that no duplicate values are allowed. If you would make your index on (lastname, firstname) UNIQUE, then the second Brad Pitt to sign up on your site couldn't do so, since that unique index would reject his data.
One exception is the primary key on any given table. The primary key is the logical identifier used to uniquely and precisely identify each single row in your database. As such, it must be unique over all rows and cannot contain any NULL values.
The clustered index in SQL Server is special in that they do contain the actual data in their leaf nodes. There's no restriction up to this point - however: the clustered index is also being used to uniquely locate (physically locate) the data in your database, and thus, the clustered index must be unique - it must be able to tell Brad Pitt #1 and Brad Pitt #2 apart. If you don't take care and provide a unique set of columns to your clustered index, SQL Server will add a "uniquefier" (a 4-byte INT) to those rows that aren't unique, e.g. you'd get BradPitt001 and BradPitt002 (or something like that).
The clustered index is used as the "pointer" to the actual data row in your SQL Server table, so it's included in every single non-clustered index, too. So your non-clustered, non-unique index on (lastname, firstname) would not only contain these two fields, but in reality, it also contains the clustered key on that table - that's why it's important the clustered key on a SQL Server table is small, stable, and unique - typically an INT.
So your non-clustered index on (lastname, firstname) will really have (lastname, firstname, personID) and will have entries like (Pitt, Brad, 10176), (Pitt, Brad, 17665) and so forth. When you search for "Brad Pitt" in your non-clustered index, SQL Server will now find these two entries, and for both, it has the "physical pointer" to where to find the rest of the data for those two guys, so if you ask for more than just the first- and last name, SQL Server could now go grab the whole row for each of the two Brad Pitt entries and provide you with the data the query requires.

The definition of an index is the first part of Wikipedia definition "A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space."
Then you have unique indexes, as a special kind of index, which ensure that indexed values are unique.
How it's implemented... depends on the DBMS.
But it does not change the definition of index, or unique index.
As an implementation detail, MS SQL allows non-clustered (the usual kind, which is a tree with pointers to the actual row contents in a separate space, which you numbered 2.), and clustered (where rows are stored in the index, according the indexed value, which you numbered 1.) indexes.
So an non-unique non-clustered index is just (conceptualy) a tree of values with, for each value, a set of pointers to table rows containing this value.

Related

when to use clustered index and when to use nonclustered index [duplicate]

I have a limited exposure to DB and have only used DB as an application programmer. I want to know about Clustered and Non clustered indexes.
I googled and what I found was :
A clustered index is a special type of index that reorders the way
records in the table are physically
stored. Therefore table can have only
one clustered index. The leaf nodes
of a clustered index contain the data
pages. A nonclustered index is a
special type of index in which the
logical order of the index does not
match the physical stored order of
the rows on disk. The leaf node of a
nonclustered index does not consist of
the data pages. Instead, the leaf
nodes contain index rows.
What I found in SO was What are the differences between a clustered and a non-clustered index?.
Can someone explain this in plain English?
With a clustered index the rows are stored physically on the disk in the same order as the index. Therefore, there can be only one clustered index.
With a non clustered index there is a second list that has pointers to the physical rows. You can have many non clustered indices, although each new index will increase the time it takes to write new records.
It is generally faster to read from a clustered index if you want to get back all the columns. You do not have to go first to the index and then to the table.
Writing to a table with a clustered index can be slower, if there is a need to rearrange the data.
A clustered index means you are telling the database to store close values actually close to one another on the disk. This has the benefit of rapid scan / retrieval of records falling into some range of clustered index values.
For example, you have two tables, Customer and Order:
Customer
----------
ID
Name
Address
Order
----------
ID
CustomerID
Price
If you wish to quickly retrieve all orders of one particular customer, you may wish to create a clustered index on the "CustomerID" column of the Order table. This way the records with the same CustomerID will be physically stored close to each other on disk (clustered) which speeds up their retrieval.
P.S. The index on CustomerID will obviously be not unique, so you either need to add a second field to "uniquify" the index or let the database handle that for you but that's another story.
Regarding multiple indexes. You can have only one clustered index per table because this defines how the data is physically arranged. If you wish an analogy, imagine a big room with many tables in it. You can either put these tables to form several rows or pull them all together to form a big conference table, but not both ways at the same time. A table can have other indexes, they will then point to the entries in the clustered index which in its turn will finally say where to find the actual data.
In SQL Server, row-oriented storage both clustered and nonclustered indexes are organized as B trees.
(Image Source)
The key difference between clustered indexes and non clustered indexes is that the leaf level of the clustered index is the table. This has two implications.
The rows on the clustered index leaf pages always contain something for each of the (non-sparse) columns in the table (either the value or a pointer to the actual value).
The clustered index is the primary copy of a table.
Non clustered indexes can also do point 1 by using the INCLUDE clause (Since SQL Server 2005) to explicitly include all non-key columns but they are secondary representations and there is always another copy of the data around (the table itself).
CREATE TABLE T
(
A INT,
B INT,
C INT,
D INT
)
CREATE UNIQUE CLUSTERED INDEX ci ON T(A, B)
CREATE UNIQUE NONCLUSTERED INDEX nci ON T(A, B) INCLUDE (C, D)
The two indexes above will be nearly identical. With the upper-level index pages containing values for the key columns A, B and the leaf level pages containing A, B, C, D
There can be only one clustered index per table, because the data rows
themselves can be sorted in only one order.
The above quote from SQL Server books online causes much confusion
In my opinion, it would be much better phrased as.
There can be only one clustered index per table because the leaf level rows of the clustered index are the table rows.
The book's online quote is not incorrect but you should be clear that the "sorting" of both non clustered and clustered indices is logical, not physical. If you read the pages at leaf level by following the linked list and read the rows on the page in slot array order then you will read the index rows in sorted order but physically the pages may not be sorted. The commonly held belief that with a clustered index the rows are always stored physically on the disk in the same order as the index key is false.
This would be an absurd implementation. For example, if a row is inserted into the middle of a 4GB table SQL Server does not have to copy 2GB of data up in the file to make room for the newly inserted row.
Instead, a page split occurs. Each page at the leaf level of both clustered and non clustered indexes has the address (File: Page) of the next and previous page in logical key order. These pages need not be either contiguous or in key order.
e.g. the linked page chain might be 1:2000 <-> 1:157 <-> 1:7053
When a page split happens a new page is allocated from anywhere in the filegroup (from either a mixed extent, for small tables or a non-empty uniform extent belonging to that object or a newly allocated uniform extent). This might not even be in the same file if the filegroup contains more than one.
The degree to which the logical order and contiguity differ from the idealized physical version is the degree of logical fragmentation.
In a newly created database with a single file, I ran the following.
CREATE TABLE T
(
X TINYINT NOT NULL,
Y CHAR(3000) NULL
);
CREATE CLUSTERED INDEX ix
ON T(X);
GO
--Insert 100 rows with values 1 - 100 in random order
DECLARE #C1 AS CURSOR,
#X AS INT
SET #C1 = CURSOR FAST_FORWARD
FOR SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND 100
ORDER BY CRYPT_GEN_RANDOM(4)
OPEN #C1;
FETCH NEXT FROM #C1 INTO #X;
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT INTO T (X)
VALUES (#X);
FETCH NEXT FROM #C1 INTO #X;
END
Then checked the page layout with
SELECT page_id,
X,
geometry::Point(page_id, X, 0).STBuffer(1)
FROM T
CROSS APPLY sys.fn_PhysLocCracker( %% physloc %% )
ORDER BY page_id
The results were all over the place. The first row in key order (with value 1 - highlighted with an arrow below) was on nearly the last physical page.
Fragmentation can be reduced or removed by rebuilding or reorganizing an index to increase the correlation between logical order and physical order.
After running
ALTER INDEX ix ON T REBUILD;
I got the following
If the table has no clustered index it is called a heap.
Non clustered indexes can be built on either a heap or a clustered index. They always contain a row locator back to the base table. In the case of a heap, this is a physical row identifier (rid) and consists of three components (File:Page: Slot). In the case of a Clustered index, the row locator is logical (the clustered index key).
For the latter case if the non clustered index already naturally includes the CI key column(s) either as NCI key columns or INCLUDE-d columns then nothing is added. Otherwise, the missing CI key column(s) silently gets added to the NCI.
SQL Server always ensures that the key columns are unique for both types of indexes. The mechanism in which this is enforced for indexes not declared as unique differs between the two index types, however.
Clustered indexes get a uniquifier added for any rows with key values that duplicate an existing row. This is just an ascending integer.
For non clustered indexes not declared as unique SQL Server silently adds the row locator into the non clustered index key. This applies to all rows, not just those that are actually duplicates.
The clustered vs non clustered nomenclature is also used for column store indexes. The paper Enhancements to SQL Server Column Stores states
Although column store data is not really "clustered" on any key, we
decided to retain the traditional SQL Server convention of referring
to the primary index as a clustered index.
I realize this is a very old question, but I thought I would offer an analogy to help illustrate the fine answers above.
CLUSTERED INDEX
If you walk into a public library, you will find that the books are all arranged in a particular order (most likely the Dewey Decimal System, or DDS). This corresponds to the "clustered index" of the books. If the DDS# for the book you want was 005.7565 F736s, you would start by locating the row of bookshelves that is labeled 001-099 or something like that. (This endcap sign at the end of the stack corresponds to an "intermediate node" in the index.) Eventually you would drill down to the specific shelf labelled 005.7450 - 005.7600, then you would scan until you found the book with the specified DDS#, and at that point you have found your book.
NON-CLUSTERED INDEX
But if you didn't come into the library with the DDS# of your book memorized, then you would need a second index to assist you. In the olden days you would find at the front of the library a wonderful bureau of drawers known as the "Card Catalog". In it were thousands of 3x5 cards -- one for each book, sorted in alphabetical order (by title, perhaps). This corresponds to the "non-clustered index". These card catalogs were organized in a hierarchical structure, so that each drawer would be labeled with the range of cards it contained (Ka - Kl, for example; i.e., the "intermediate node"). Once again, you would drill in until you found your book, but in this case, once you have found it (i.e, the "leaf node"), you don't have the book itself, but just a card with an index number (the DDS#) with which you could find the actual book in the clustered index.
Of course, nothing would stop the librarian from photocopying all the cards and sorting them in a different order in a separate card catalog. (Typically there were at least two such catalogs: one sorted by author name, and one by title.) In principle, you could have as many of these "non-clustered" indexes as you want.
Find below some characteristics of clustered and non-clustered indexes:
Clustered Indexes
Clustered indexes are indexes that uniquely identify the rows in an SQL table.
Every table can have exactly one clustered index.
You can create a clustered index that covers more than one column. For example: create Index index_name(col1, col2, col.....).
By default, a column with a primary key already has a clustered index.
Non-clustered Indexes
Non-clustered indexes are like simple indexes. They are just used for fast retrieval of data. Not sure to have unique data.
Clustered Index
A clustered index determines the physical order of DATA in a table. For this reason, a table has only one clustered index(Primary key/composite key).
"Dictionary" No need of any other Index, its already Index according to words
Nonclustered Index
A non-clustered index is analogous to an index in a Book. The data is stored in one place. The index is stored in another place and the index has pointers to the storage location. this help in the fast search of data. For this reason, a table has more than 1 Nonclustered index.
"Biology Book" at starting there is a separate index to point Chapter location and At the "END" there is another Index pointing the common WORDS location
A very simple, non-technical rule-of-thumb would be that clustered indexes are usually used for your primary key (or, at least, a unique column) and that non-clustered are used for other situations (maybe a foreign key). Indeed, SQL Server will by default create a clustered index on your primary key column(s). As you will have learnt, the clustered index relates to the way data is physically sorted on disk, which means it's a good all-round choice for most situations.
Clustered Index
A Clustered Index is basically a tree-organized table. Instead of storing the records in an unsorted Heap table space, the clustered index is actually B+Tree index having the Leaf Nodes, which are ordered by the clusters key column value, store the actual table records, as illustrated by the following diagram.
The Clustered Index is the default table structure in SQL Server and MySQL. While MySQL adds a hidden clusters index even if a table doesn't have a Primary Key, SQL Server always builds a Clustered Index if a table has a Primary Key column. Otherwise, the SQL Server is stored as a Heap Table.
The Clustered Index can speed up queries that filter records by the clustered index key, like the usual CRUD statements. Since the records are located in the Leaf Nodes, there's no additional lookup for extra column values when locating records by their Primary Key values.
For example, when executing the following SQL query on SQL Server:
SELECT PostId, Title
FROM Post
WHERE PostId = ?
You can see that the Execution Plan uses a Clustered Index Seek operation to locate the Leaf Node containing the Post record, and there are only two logical reads required to scan the Clustered Index nodes:
|StmtText |
|-------------------------------------------------------------------------------------|
|SELECT PostId, Title FROM Post WHERE PostId = #P0 |
| |--Clustered Index Seek(OBJECT:([high_performance_sql].[dbo].[Post].[PK_Post_Id]), |
| SEEK:([high_performance_sql].[dbo].[Post].[PostID]=[#P0]) ORDERED FORWARD) |
Table 'Post'. Scan count 0, logical reads 2, physical reads 0
Non-Clustered Index
Since the Clustered Index is usually built using the Primary Key column values, if you want to speed up queries that use some other column, then you'll have to add a Secondary Non-Clustered Index.
The Secondary Index is going to store the Primary Key value in its Leaf Nodes, as illustrated by the following diagram:
So, if we create a Secondary Index on the Title column of the Post table:
CREATE INDEX IDX_Post_Title on Post (Title)
And we execute the following SQL query:
SELECT PostId, Title
FROM Post
WHERE Title = ?
We can see that an Index Seek operation is used to locate the Leaf Node in the IDX_Post_Title index that can provide the SQL query projection we are interested in:
|StmtText |
|------------------------------------------------------------------------------|
|SELECT PostId, Title FROM Post WHERE Title = #P0 |
| |--Index Seek(OBJECT:([high_performance_sql].[dbo].[Post].[IDX_Post_Title]),|
| SEEK:([high_performance_sql].[dbo].[Post].[Title]=[#P0]) ORDERED FORWARD)|
Table 'Post'. Scan count 1, logical reads 2, physical reads 0
Since the associated PostId Primary Key column value is stored in the IDX_Post_Title Leaf Node, this query doesn't need an extra lookup to locate the Post row in the Clustered Index.
Clustered Index
Clustered indexes sort and store the data rows in the table or view based on their key values. These are the columns included in the index definition. There can be only one clustered index per table, because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.
Nonclustered
Nonclustered indexes have a structure separate from the data rows. A nonclustered index contains the nonclustered index key values and each key value entry has a pointer to the data row that contains the key value.
The pointer from an index row in a nonclustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row locator is the clustered index key.
You can add nonkey columns to the leaf level of the nonclustered index to by-pass existing index key limits, and execute fully covered, indexed, queries. For more information, see Create Indexes with Included Columns. For details about index key limits see Maximum Capacity Specifications for SQL Server.
Reference: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described
Let me offer a textbook definition on "clustering index", which is taken from 15.6.1 from Database Systems: The Complete Book:
We may also speak of clustering indexes, which are indexes on an attribute or attributes such that all of tuples with a fixed value for the search key of this index appear on roughly as few blocks as can hold them.
To understand the definition, let's take a look at Example 15.10 provided by the textbook:
A relation R(a,b) that is sorted on attribute a and stored in that
order, packed into blocks, is surely clusterd. An index on a is a
clustering index, since for a given a-value a1, all the tuples with
that value for a are consecutive. They thus appear packed into
blocks, execept possibly for the first and last blocks that contain
a-value a1, as suggested in Fig.15.14. However, an index on b is
unlikely to be clustering, since the tuples with a fixed b-value
will be spread all over the file unless the values of a and b are
very closely correlated.
Note that the definition does not enforce the data blocks have to be contiguous on the disk; it only says tuples with the search key are packed into as few data blocks as possible.
A related concept is clustered relation. A relation is "clustered" if its tuples are packed into roughly as few blocks as can possibly hold those tuples. In other words, from a disk block perspective, if it contains tuples from different relations, then those relations cannot be clustered (i.e., there is a more packed way to store such relation by swapping the tuples of that relation from other disk blocks with the tuples the doesn't belong to the relation in the current disk block). Clearly, R(a,b) in example above is clustered.
To connect two concepts together, a clustered relation can have a clustering index and nonclustering index. However, for non-clustered relation, clustering index is not possible unless the index is built on top of the primary key of the relation.
"Cluster" as a word is spammed across all abstraction levels of database storage side (three levels of abstraction: tuples, blocks, file). A concept called "clustered file", which describes whether a file (an abstraction for a group of blocks (one or more disk blocks)) contains tuples from one relation or different relations. It doesn't relate to the clustering index concept as it is on file level.
However, some teaching material likes to define clustering index based on the clustered file definition. Those two types of definitions are the same on clustered relation level, no matter whether they define clustered relation in terms of data disk block or file. From the link in this paragraph,
An index on attribute(s) A on a file is a clustering index when: All tuples with attribute value A = a are stored sequentially (= consecutively) in the data file
Storing tuples consecutively is the same as saying "tuples are packed into roughly as few blocks as can possibly hold those tuples" (with minor difference on one talking about file, the other talking about disk). It's because storing tuple consecutively is the way to achieve "packed into roughly as few blocks as can possibly hold those tuples".
Clustered Index:
Primary Key constraint creates clustered Index automatically if no clustered Index already exists on the table. Actual data of clustered index can be stored at leaf level of Index.
Non Clustered Index:
Actual data of non clustered index is not directly found at leaf node, instead it has to take an additional step to find because it has only values of row locators pointing towards actual data.
Non clustered Index can't be sorted as clustered index. There can be multiple non clustered indexes per table, actually it depends on the sql server version we are using. Basically Sql server 2005 allows 249 Non Clustered Indexes and for above versions like 2008, 2016 it allows 999 Non Clustered Indexes per table.
Clustered Index - A clustered index defines the order in which data is physically stored in a table. Table data can be sorted in only way, therefore, there can be only one clustered index per table. In SQL Server, the primary key constraint automatically creates a clustered index on that particular column.
Non-Clustered Index - A non-clustered index doesn’t sort the physical data inside the table. In fact, a non-clustered index is stored at one place and table data is stored in another place. This is similar to a textbook where the book content is located in one place and the index is located in another. This allows for more than one non-clustered index per table.It is important to mention here that inside the table the data will be sorted by a clustered index. However, inside the non-clustered index data is stored in the specified order. The index contains column values on which the index is created and the address of the record that the column value belongs to.When a query is issued against a column on which the index is created, the database will first go to the index and look for the address of the corresponding row in the table. It will then go to that row address and fetch other column values. It is due to this additional step that non-clustered indexes are slower than clustered indexes
Differences between clustered and Non-clustered index
There can be only one clustered index per table. However, you can
create multiple non-clustered indexes on a single table.
Clustered indexes only sort tables. Therefore, they do not consume
extra storage. Non-clustered indexes are stored in a separate place
from the actual table claiming more storage space.
Clustered indexes are faster than non-clustered indexes since they
don’t involve any extra lookup step.
For more information refer to this article.

SQL Server indexes still needed when primary key defined on same columns?

When I have a primary key on a column, do I also need a non-clustered index on that same column for querying purposes? Primary keys ARE indexes, aren't they?
Also, if I have an aggregate primary key on two columns, do I need to create indexes on both of those columns for querying purposes?
And, finally, if I will be commonly querying for rows specifying two columns to match, is it best to have one index that includes both columns? Or two separate indexes, one on each?
When a Primary Key is created, a Clustered Index in Created automatically. If you have any JOINS or a WHERE condition on this column, the JOIN as well as the search is faster because the Engine would know the Physical location of the record you are looking for.
In your condition, I would say if you have a primary Key which is a combination of several columns and you would either SEARCH/ JOIN on individual columns you would need a Non Clustered Index..Else assigning a Primary Key will do the trick
Refer this for more information: https://www.simple-talk.com/sql/performance/tune-your-indexing-strategy-with-sql-server-dmvs/
When I have a primary key on a column, do I also need a non-clustered index on that same column for querying purposes? Primary keys ARE indexes, aren't they?
A regular index is a sorted copy of one (or multiple) columns. Being sorted it allows for fast searching. If its underlying values change, it will be re-sorted accordingly, but physical table order stays the same.
A clustered index on the other hand defines physical table order. That's why you only can have one - if its values change, the entire table will be re-sorted accordingly.
Often the primary key also is the clustered index of the table. But not necessarily - the defining property of a primary key is its uniqueness.
Having a clustered and a non-clustered index over the same column is redundant and you should not do it. It increases workload during insert/update/delete, but it does nothing for query performance.
if I have an aggregate primary key on two columns, do I need to create indexes on both of those columns for querying purposes?
That depends whether you ever want to query the second column on its own. An index over (A, B) will do nothing for queries that search for B only, so having a second index over B will be necessary in this case.
Include in the index any extra columns you want to return from the query. If set up smartly, a query can be satisfied by the index alone, saving the DB engine from having to look at the table at all.
Note that this applies to non-clustered indexes. Including extra columns for queries against the clustered index is not necessary, as the clustered index is the table. It naturally contains all columns.
if I will be commonly querying for rows specifying two columns to match, is it best to have one index that includes both columns? Or two separate indexes, one on each?
Have a single index that contains both columns, the most selective (highest variance on unique values) or one that you are most likely to query on its own first, the assisting value second. Sometimes it's necessary to have it both ways - (A, B) and (B, A), it entirely depends on how the table is used.

How can a Non-Clustered index output a column that is not included in the index

Viewing the Execution plan, i see "column A" in the Output List. The operation is an Index Scan on a Non-Clustered Index : "IX_name"
When i see the definition of this index. I do not see "column A" in either Index Key columns or Included columns.
How is a Non-Clustered index being used to output a column that is not present in the index. Shouldn't it use a Table Scan on the table or some other index which has "column A" present in it.
If the table itself is clustered1, then all secondary indexes contain a copy of the clustering key2 (a key that determines the physical order of rows in the clustered table).
The reason: rows in a clustered table are physically stored within a B-tree (not table heap), and therefore can move when B-tree nodes get split or coalesced, so the secondary index cannot just contain the row "pointer" (since it would be in danger of "dangling" after the row moves).
Often, that has detrimental effect on performance3 - querying through secondary index may require double-lookup:
First, search the secondary index and get the clustering key.
Second, based on the clustering key retrieved above, search the clustered table itself (which is B-tree).
However, if all you want are the fields of the clustering key, only the first lookup is needed.
1 Aka "clustered index" under MS SQL Server.
2 Usually, but not necessarily a PRIMARY KEY under MS SQL Server.
3 It is unfortunate that clustering is on by default under MS SQL Server - people often just leave the default without fully considering its effects. When clustering is not appropriate, you should specify NONCLUSTERED keyword explicitly to turn it off.
Internally, a non-clustered index contains all key columns of the clustered key. This is to support the key lookup operation and to ensure that internally each index row has a unique key. For heaps each NCI contains the row bookmark for the same reasons.

Clustered index - multi-part vs single-part index and effects of inserts/deletes

This question is about what happens with the reorganizing of data in a clustered index when an insert is done. I assume that it should be more expensive to do inserts on a table which has a clustered index than one that does not because reorganizing the data in a clustered index involves changing the physical layout of the data on the disk. I'm not sure how to phrase my question except through an example I came across at work.
Assume there is a table (Junk) and there are two queries that are done on the table, the first query searches by Name and the second query searches by Name and Something. As I'm working on the database I discovered that the table has been created with two indexes, one to support each query, like so:
--drop table Junk1
CREATE TABLE Junk1
(
Name char(5),
Something char(5),
WhoCares int
)
CREATE CLUSTERED INDEX IX_Name ON Junk1
(
Name
)
CREATE NONCLUSTERED INDEX IX_Name_Something ON Junk1
(
Name, Something
)
Now when I looked at the two indexes, it seems that IX_Name is redundant since IX_Name_Something can be used by any query that desires to search by Name. So I would eliminate IX_Name and make IX_Name_Something the clustered index instead:
--drop table Junk2
CREATE TABLE Junk2
(
Name char(5),
Something char(5),
WhoCares int
)
CREATE CLUSTERED INDEX IX_Name_Something ON Junk2
(
Name, Something
)
Someone suggested that the first indexing scheme should be kept since it would result in more efficient inserts/deletes (assume that there is no need to worry about updates for Name and Something). Would that make sense? I think the second indexing method would be better since it means one less index needs to be maintained.
I would appreciate any insight into this specific example or directing me to more info on maintenance of clustered indexes.
Yes, inserting into the middle of an existing table (or its page) could be expensive when you have a less than optimal clustered index. Worst case would be a page split : half the rows on the page would have to be moved elsewhere, and indices (including non-clustered indices on that table) need to be updated.
You can alleviate that problem by using the right clustered index - one that ideally is:
narrow (only a single field, as small as possible)
static (never changes)
unique (so that SQL Server doesn't need to add 4-byte uniqueifiers to your rows)
ever-increasing (like an INT IDENTITY)
You want a narrow key (ideally a single INT) since each and every entry in each and every non-clustered index will also contain the clustering key(s) - you don't want to put lots of columns in your clustering key, nor do you want to put things like VARCHAR(200) there!
With an ever increasing clustered index, you will never see the case of a page split. The only fragmentation you could encounter is from deletes ("swiss cheese" problem).
Check out Kimberly Tripp's excellet blog posts on indexing - most notably:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues... - this one actually shows that a good clustered index will speed up all operations - including inserts, delete etc., compared to a heap with no clustered index!
Ever-increasing clustering key - the Clustered Index Debate..........again!
Assume there is a table (Junk) and
there are two queries that are done on
the table, the first query searches by
Name and the second query searches by
Name and Something. As I'm working on
the database I discovered that the
table has been created with two
indexes, one to support each query,
like so:
That's definitely not necessary - if you have one index on (Name, Something), that index can also and just as well be used if you search and restrict on just WHERE Name = abc - having a separate index with just the Name column is totally not needed and only wastes space (and costs time to be kept up to date).
So basically, you only need a single index on (Name, Something), and I would agree with you - if you have no other indices on this table, then you should be able to make this the clustered key. Since that key won't be ever-increasing and could possibly change, too (right?), this might not be such a great idea.
The other option would be to introduce a surrogate ID INT IDENTITY and cluster on that - with two benefits:
it's all a good clustered key should be, including ever-increasing -> you'll never have any issues with page splits and performance for INSERT operations
you still get all the benefits of having a clustering key (see Kim Tripps' blog posts - clustered tables are almost always preferable to heaps)
Someone suggested that the first indexing scheme should be kept since it would result in more efficient inserts/deletes
That's a bogus claim. Ordered data is ordered data and the same IO would be performed.
SET STATISTICS IO ON
-- your insert statement here
You can create a clustered index only on one column, not two or more so choose the column which your app will mostly be querying on, like wildcard queries on customer fullnames, etc. (see discussion)

How to use MySQL index columns?

When do you use each MySQL index type?
PRIMARY - Primary key columns?
UNIQUE - Foreign keys?
INDEX - ??
For really large tables, do indexed columns improve performance?
Primary
The primary key is - as the name suggests - the main key of a table and should be a column which is commonly used to select the rows of this table. The primary key is always a unique key (unique identifier). The primary key is not limited to one column, for example in reference tables (many-to-many) it often makes sense to have a primary key including two or more columns.
Unique
A unique index makes sure your DBMS doesn't accept duplicate entries for this column. You ask 'Foreign keys?' NO! That would not be useful since foreign keys are per definition prone to be duplicates, (one-to-many, many-to-many).
Index
Additional indexes can be placed on columns which are often used for SELECTS (and JOINS) which is often the case for foreign keys. In many cases SELECT (and JOIN) queries will be faster, if the foreign keys are indexed.
Note however that - as SquareCog has clarified - Indexes get updated on any modifications to the data, so yes, adding more indexes can lead to degradation in INSERT/UPDATE performance. If indexes didn't get updated, you would get different information depending on whether the optimizer decided to run your query on an index or the raw table -- a highly undesirable situation.
This means, you should carefully assess the usage of indices. One thing is sure on the basis of that: Unused indices have to be avoided, resp. removed!
I'm not that familiar with MySQL, however I believe the following to be true across most database servers. An index is a balanced tree which is used to allow the database to scan the table for given data. For example say you have the following table.
CREATE TABLE person (
id SERIAL,
name VARCHAR(20),
dob VARCHAR(20)
);
If you created an index on the 'name' field this would create in a balanced tree for that data in the table for the name column. Balanced tree data structures allow for faster searching of results (see http://www.solutionhacker.com/tag/balanced-tree/).
You should note however indexing a column only allows you to search on the data as it is stored in the database. For example:
This would not be able to search on the index and would instead do a sequential scan on the table, calling UPPER() on each of the column:name rows in the table.
select *
from person
where UPPER(name) = "BOB";
This would also have the following effect, because the index will be sorted starting with the first letter. Replacing the search term with "%B" would however use the index.
select *
from person
where name like "%B"
Indexes will improve performance on larger tables. Normally, the primary key has an index based on the key. Usually unique.
It is useful to add indexes to fields that are used to search on a lot too such as Street Name or Surname as again it will improve perfomance. Don't need to be unique.
Foreign Keys and Unique Keys are more for keeping your data integrity in order. So that you cannot have duplicate primary keys and so that your child tables don't have data for a parent that has been deleted.
PRIMARY defines a primary key, yes.
UNIQUE simply defines that the specified field has to be unique, it has nothing to do with foreign keys.
INDEX creates an index for the specified column and, yes, it improves performance for large tables, sorting and finding something in this column can be much faster if you use indexing.
The bigger the table, the bigger is gain from using an index. Do note that indexes makes insert (and probably update) operations slower so make sure you don't index too many fields.