sql table optimization: primary and secondary indexes - sql

Do people usually make every column in a table a secondary index to be on the safe side in case the customer decides to use either field to search for a record?
Does the search first go through the secondary indexes and then to the primary key? Thus narrowing down to the requested data?
What is the point of having secondary index if you already have a column that is a primary key?

(The following response applies to Sql Server. Some parts may vary for other DBMSs.)
Last question first: "What is the point of having secondary keys if you already have a column that is a primary key?" I illustrate with the example of a table "People (Id int primary key, firstname varchar(40), middlename varchar(40), lastname varchar(40))". Now consider the query "select * from people where lastname = 'flynn'". If there is no index on the lastname column, the table will be scanned sequentially to find matches. Every row must be accessed. The primary key index does not help at all here. If you index the lastname column, the result can be found much more quickly.
You would normally index only those columns that would be useful to the queries your application issues. If your queries never have a join or where condition on a column named "MiddleName" then no benefit would come from indexing that column. You don't want to add unnecessary indexes because they increase the cost of data inserts and updates that involve that column.
We usually say that Sql Server uses only a single index per table instance in a query. So a query like "select * from people where firstname='Elroy' and lastname = 'Flynn' " would use at most one index, even if both firstname and lastname have indexes. Sql Server would choose one or the other index based on the statistics it has collected from the data values.
In full completeness, I have to get a little advanced here, and discuss clustered vs. non-clustered indexes. A table can have only one clustered index: the rest are non-clustered. The previous paragraph notwithstanding, when a non-clustered index is used for to resolve a query, the index lookup produces an intermediate result which is the full value of the key associated with whichever index is the clustered index (often, the primary key). That is, the leaves of every non-clustered index contain the clustered key value, not a row pointer. After finding this clustered key, the clustered index is then used to resolve the lookup to a specific database row. So, ultimately, ALL index lookups eventually use the clustered index.
Still, for practical purposes, it is usually adequate and simpler to say that only a single index is used per table instance. Note that if a table is aliased in a query so that it appears more than once, a different index could be used for the different references. e.g., "select * from people p1 join people p2 on p1.firstname = p2.lastname" could use a firstname index on the p1 instance and a lastname index on the p2 instance.
see http://msdn.microsoft.com/en-us/library/aa933131(v=SQL.80).aspx

Usually you only index columns that need to be. Adding additional indexes would normally be considered premature optimization.
Most optimizers will identify the fastest method to find the least number of records. This may be to use and index, but may be a full table scan. If there are multiple indexes that can be used, often only one is used, and the resulting records compared against the remaining criteria. If multiple indexes are used, then the resulting result sets need to be matched, and records which weren't found in both indexes eliminated.
It is common to use surrogate keys for tables where the natural key is subject to change, or very (purposely vague) long. The natural key in this case would be indexed as a secondary unique key. In some cases there may be competing natural keys, in which case all the natural keys would have unique indexes.

One other item not mentioned yet, every additional index has to be maintained. So if you have indexes covering all your columns in several different combinations, not only will they take up lots of space, every update/insert/delete has the potential to change one or more of those indexes. This will result in those operations being slowed way down in many situations.
It's always a tradeoff. The more indexes you have the more work the server has to do to keep them up to date, but the more likely it is that you'll have at least one that will cover any query you throw at that table.

"On the safe side"? No.
An index trades space and insert-time for select-time. Unnecessary keys chew up disk-space and slow inserts in return for speeding up a query that never occurs.
As with all optimizations, do query optimizations last -- build the system then observe its behavior.
The primary/secondary distinction in a highly technical one. All indices exist to speed up queries and/or enforce certain integrity constraints.

Related

Searching for record(s) in a table that has over 200 Million Rows

Which type of index should be used on the table? It is initially inserted (one a month) into a empty table. I then place a non clustered composite index on two of the columns. Wondering if merging the two fields into one would increase performance when searching. Or does it not matter? Should I be working with an identity column that has a primary key clustered index?
You should index the field(s) most likely to be used in the where clause as people query the table. Don't worry about the primary key - it already has an index.
If you can define a unique primary key that can be used when querying the table, this will be used as the clustered index and will be the fastest for selects.
If your select query has to use the two fields you mentioned, keep them separate. Performance will not be impacted and the schema is not spoiled.
"A clustered index is particularly efficient on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent."
With this in mind you probably won't see much benefit from haveing a clustered index on your primary key (ID) unless it have business meaning for your aplication. If you have a Date value that you are commonly querying, then it may make more sense to add a clustered index to that
select * from table where created > '2013-01-01' and created < '2013-02-01'
I have seen datawarehouses use a concatenated key approach. Whether this works for you depends on your queries. Obviously querying a single field value will be faster than multiple fields, particularly when there is one less lookup in the B-tree index.
Alternatively, if you have 200 million rows in a table you could look at breaking the data out into multiple tables if it makes sense to do so.
You're saying that you're loading all this data every month so I have to assume that all the data is relevant. If there was data in your table that is considered "old" and not relevant to searches, then you could move data out into a archive table (using the same schema) so your queries only run against "current" data.
Otherwise, you can look at a sharding approach as used by NoSQL like MongoDB. If MongoDB is not an option, you could achieve the same shard key like logic in your application. I doubt that your database SQL drivers will support sharding natively.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.
A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?
Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.
Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)
It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers
One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Clustered index dilemma - ID or sort?

I have a table with two very important fields:
id INT identity(1,1) PRIMARY KEY
identifiersortcode VARCHAR(900)
My app always sorts and pages search results in the UI based on identifiersortcode, but all table joins (and they are legion) are on the id field. (Aside: yes, the sort code really is that long. There's a strong BL reason.)
Also, due to O/RM use, most SELECT statements are going to pull almost every column.
Currently, the clustered index is on id, but I'm wondering if the TOP / ORDER BY portion of most queries would make identifiersortcode a more attractive option as the clustered key, even considering all of the table joins going on.
Inserts on the table and changes to the identifiersortcode are limited enough that changing my clustered index would be a problem for insert/update operations.
Trying to make the sort code's non-clustered index a covering index (using INCLUDE) is not a good option. There are a number of large columns, and some of them have a lot of update activity.
Kimberly L. Tripp's criteria for a clustered index are that it be:
Unique
Narrow
Static
Ever Increasing
Based on that, I'd stick with your integer identity id column, which satisfies all of the above. Your identifiersortcode would fail most, if not all, of those requirements.
To correctly determine which field will benefit most from the clustered index, you need to do some homework. The first thing that you should consider is the selectivity of your joins. If your execution plans filter rows from this table FIRST, then join on the other tables, then you are not really benefiting from having the clustered index on the primary key, and it makes more sense to have it on the sort key.
If however, your joins are selective on other tables (they are filtered, then an index seek is performed to select rows from this table), then you need to compare the performance of the change manually versus the status quo.
Currently, the clustered index is on id, but I'm wondering if the TOP / ORDER BY portion of most queries would make identifiersortcode a more attractive option as the clustered key, even considering all of the table joins going on.
Making identifiersortcode a CLUSTERED KEY will only help if it is used both in filtering and ordering conditions.
This means that it is chosen a leading table in all your joins and uses Clustered Index Scan or Clustered Index Range Scan access path.
Otherwise, it will only make the things worse: first, all secondary indexes will be larger in size; second, inserts in non-increasing order will result in page splits which will make them run longer and result in a larger table.
Why, for God's sake, does your identifier sort code need to be 900 characters long? If you really need 900 characters to be distinct for sorting, it should probably be broken up into multiple fields.
Appart from repeating what Chris B. said, I think you should really stick to your current PK, since - as you said - all joins are on the Id.
I guess you already have indexed the identifiersortcode....
Nevertheless, IF you have performance issues, would reaaly think twice about this ##"%$£ identifiersortcode !-)

Does clustered index on foreign key column increase join performance vs non-clustered?

In many places it's recommended that clustered indexes are better utilized when used to select range of rows using BETWEEN statement. When I select joining by foreign key field in such a way that this clustered index is used, I guess, that clusterization should help too because range of rows is being selected even though they all have same clustered key value and BETWEEN is not used.
Considering that I care only about that one select with join and nothing else, am I wrong with my guess ?
Discussing this type of issue in the absolute isn't very useful.
It is always a case-by-case situation !
Essentially, access by way of a clustered index saves one indirection, period.
Assuming the key used in the JOIN, is that of the clustered index, in a single read [whether from an index seek or from a scan or partial scan, doesn't matter], you get the whole row (record).
One problem with clustered indexes, is that you only get one per table. Therefore you need to use it wisely. Indeed in some cases, it is even wiser not to use any clustered index at all because of INSERT overhead and fragmentation (depending on the key and the order of new keys etc.)
Sometimes one gets the equivalent benefits of a clustered index, with a covering index, i.e. a index with the desired key(s) sequence, followed by the column values we are interested in. Just like a clustered index, a covering index doesn't require the indirection to the underlying table. Indeed the covering index may be slightly more efficient than the clustered index, because it is smaller.
However, and also, just like clustered indexes, and aside from the storage overhead, there is a performance cost associated with any extra index, during INSERT (and DELETE or UPDATE) queries.
And, yes, as indicated in other answers, the "foreign-key-ness" of the key used for the clustered index, has absolutely no bearing on the the performance of the index. FKs are constraints aimed at easing the maintenance of the integrity of the database but the underlying fields (columns) are otherwise just like any other field in the table.
To make wise decisions about index structure, one needs
to understands the way the various index types (and the heap) work
(and, BTW, this varies somewhat between SQL implementations)
to have a good image of the statistical profile of the database(s) at hand:
which are the big tables, which are the relations, what's the average/maximum cardinality of relation, what's the typical growth rate of the database etc.
to have good insight regarding the way the database(s) is (are) going to be be used/queried
Then and only then, can one can make educated guesses about the interest [or lack thereof] to have a given clustered index.
I would ask something else: would it be wise to put my clustered index on a foreign key column just to speed a single JOIN up? It probably helps, but..... at a price!
A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
So those are my criteria - put your clustering key on a narrow, stable, unique, hopefully ever-increasing column. If your foreign key column matches those - perfect!
However, I would not under any circumstances put my clustering key on a wide or even compound foreign key. Remember: the value(s) of the clustering key are being added to each and every non-clustered index entry on that table! If you have 10 non-clustered indices, 100'000 rows in your table - that's one million entries. It makes a huge difference whether that's a 4-byte integer, or a 200-byte VARCHAR - HUGE. And not just on disk - in server memory as well. Think very very carefully about what to make your clustered index!
SQL Server might need to add a uniquifier - making things even worse. If the values will ever change, SQL Server would have to do a lot of bookkeeping and updating all over the place.
So in short:
putting an index on your foreign keys is definitely a great idea - do it all the time!
I would be very very careful about making that a clustered index. First of all, you only get one clustered index, so which FK relationship are you going to pick? And don't put the clustering key on a wide and constantly changing column
An index on the FK column will help the JOIN because the index itself is ordered: clustered just means that the data on disk (leaf) is ordered rather then the B-tree.
If you change it to a covering index, then clustered vs non-clustered is irrelevant. What's important is to have a useful index.
It depends on the database implementation.
For SQL Server, a clustered index is a data structure where the data is stored as pages and there are B-Trees and are stored as a separate data structure. The reason you get fast performance, is that you can get to the start of the chain quickly and ranges are an easy linked list to follow.
Non-Clustered indexes is a data structure that contains pointers to the actual records and as such different concerns.
Refer to the documentation regarding Clustered Index Structures.
An index will not help in relation to a Foreign Key relationship, but it will help due to the concept of "covered" index. If your WHERE clause contains a constraint based upon the index. it will be able to generate the returned data set faster. That is where the performance comes from.
The performance gains usually come if you are selecting data sequentially within the cluster. Also, it depends entirely on the size of the table (data) and the conditions in your between statement.

What column should the clustered index be put on?

Lately, I have been doing some reading on indexes of all types and the main advice is to put the clustered index on the primary key of the table, but what if the primary key actually is not used in a query (via a select or join) and is just put for purely relational purposes, so in this case it is not queried against. Example, say I have a car_parts table and it contains 3 columns, car_part_id, car_part_no, and car_part_title. car_part_id is the unique primary key identity column. In this case car_part_no is unique as well and is most likely car_part_title. car_part_no is what is most queried against, so doesn't it make sense to put the clustered index on that column instead of car_part_id? The basics of the question is what column should actually have the clustered index since you are only allowed one of them?
An index, clustered or non clustred, can be used by the query optimizer if and only if the leftmost key in the index is filtered on. So if you define an index on columns (A, B, C), a WHERE condition on B=#b, on C=#c or on B=#b AND C=#c will not fully leverage the index (see note). This applies also to join conditions. Any WHERE filter that includes A will consider the index: A=#a or A=#a AND B=#b or A=#a AND C=#c or A=#a AND B=#b AND C=#c.
So in your example if you make the clustred index on part_no as the leftmost key, then a query looking for a specific part_id will not use the index and a separate non-clustered index must exist on part-id.
Now about the question which of the many indexes should be the clustered one. If you have several query patterns that are about the same importance and frequency and contradict each other on terms of the keys needed (eg. frequent queries by either part_no or part_id) then you take other factors into consideration:
width: the clustered index key is used as the lookup key by all other non-clustered indexes. So if you choose a wide key (say two uniquidentifier columns) then you are making all the other indexes wider, thus consuming more space, generating more IO and slowing down everything. So between equaly good keys from a read point of view, choose the narrowest one as clustered and make the wider ones non-clustered.
contention: if you have specific patterns of insert and delete try to separate them physically so they occur on different portions of the clustered index. Eg. if the table acts as a queue with all inserts at one logical end and all deletes at the other logical end, try to layout the clustered index so that the physical order matches this logical order (eg. enqueue order).
partitioning: if the table is very large and you plan to deploy partioning then the partitioning key must be the clustered index. Typical example is historical data that is archived using a sliding window partitioning scheme. Even thow the entities have a logical primary key like 'entity_id', the clustred index is done by a datetime column that is also used for the partitioning function.
stability: a key that changes often is a poor candidate for a clustered key as each update the clustered key value and force all non-clustered indexes to update the lookup key they store. As an update of a clustered key will also likely relocate the record into a different page it can cause fragmentation on the clustered index.
Note: not fully leverage as sometimes the engine will choose an non-clustered index to scan instead of the clustered index simply because is narrower and thus has fewer pages to scan. In my example if you have an index on (A, B, C) and a WHERE filter on B=#b and the query projects C, the index will be likely used but not as a seek, as a scan, because is still faster than a full clustered scan (fewer pages).
Kimberly Tripp is always one of the best sources on insights on indexing.
See her blog post "Ever-increasing clustering key - the Clustered Index Debate - again!" in which she quite clearly lists and explains the main requirements for a good clustering key - it needs to be:
Unique
Narrow
Static
and best of all, if you can manage:
ever-increasing
Taking all this into account, an INT IDENTITY (or BIGINT IDENTITY if you really need more than 2 billion rows) works out to be the best choice in the vast majority of cases.
One thing a lot of people don't realize (and thus don't take into account when making their choice) is the fact that the clustering key (all the columns that make up the clustered index) will be added to each and every index entry for each and every non-clustered index on your table - thus the "narrow" requirement becomes extra important!
Also, since the clustering key is used for bookmark lookups (looking up the actual data row when a row is found in a non-clustered index), the "unique" requirement also becomes very important. So important in fact, that if you choose a (set of) column(s) that is/are not guaranteed to be unique, SQL Server will add a 4-byte uniquefier to each row --> thus making each and every of your clustered index keys extra wide ; definitely NOT a good thing.
Marc
Clustered indexes are good when you query ranges of data. For example
SELECT * FROM theTable WHERE age BETWEEN 10 AND 20
The clustered index arranges rows in the particular order on your computer disk. That's why rows with age = 10 will be next to each other, and after them there will be rows with age = 11, etc.
If you have exact select, like this:
SELECT * FROM theTable WHERE age = 20
the non-clustered index is also good. It doesn't rearrange data on your computer disk, but it builds special tree with a pointers to the rows you need.
So it strongly depends on the type of queries you perform.
Keep in mind the usage patterns; If you are almost always querying the DB on the car_part_no, then it would probably be beneficial for it to be clustered on that column.
However, don't forget about joins; If you are most often joining to the table and the join uses the car_part_id field, then you have a good reason to keep the cluster on car_part_id.
Something else to keep in mind (less so in this case, but generally when considering clustered indexes) is that the clustered index will appear implicitly in every other index on the table; So for example, if you were to index car_part_title, that index will also include the car_part_id implicitly. This can affect whether or not an index covers a query and also affects how much disk space the index will take (which affects memory usage, etc).
The clustered index should go on the column that will be the most queried. This includes joins, as a join must access the table just like a direct query, and find the rows indicated.
You can always rebuild your indexes later on if your application changes and you find you need to optimize a table with a different index structure.
Some additional guidelines for deciding on what to cluster your table on can be found on MSDN here: Clustered Index Design Guidelines.