Due to some recent database problems we've been having recently, I've been looking at the state of our indexes and trying to reduce the fragmentation.
The app has several tables of names such as BoysNames and GirlsNames which we use to set attributes on User objects we created. The obvious attribute here would be Gender. These tables can have anywhere from a few hundred to 10,000 rows.
These tables are structured like this:
Name - nvarchar(50) - PK & Clustered Index
DateCreated - datetime
When I tell Sql Server to reorganize or rebuild the indexes on all my tables, most table fragmentaiton goes down to 0%, but some of these Name tables are at 50% fragemented straight away.
I only access these tables in 2 places:
The first is when I select every name from the table and store it in
memory to use against new users coming into the system so I can do something like this: if
(boysNames.Contains(user.Name)) {user.Gender = "M"}; This happens quite
often.
The second is when I'm adding new names to the list, I check for the
existance of a name, if it doesn't exist, I add it. This happens
rarely.
So what I need to know is:
Is this high level of fragmentation going to be causing me problems? How can I reduce the the index fragmentation to 0% when it's being set to 50% straight after a reorganize/rebuild?
Should I have used an int as the primary key and put an index on Name, or was nvarchar the right choice for primary key?
If the index pages are resident in memory, which is likely with that few rows, fragmentation does not matter. You can benchmark that with a count(*) query. Starting with the 2nd execution, you should see in-memory speeds. If you now compare the results for a 100% and a 0% fragmented table, you should see no difference.
I don't think you have a problem. If you insist, you can set a fill factor lower than 100 so that there is room for new rows when you insert rows at random places. Start with 90 and lower it in increments of 5 until you are happy with the rate fragmentation develops.
Clustering on an IDENTITY field would remove the fragmentation for the clustered index, but you'd probably need an index on Name which again fragments. If you do not need any index at all, make this a heap table and be done with it. I recommend very much against this, though.
Related
If I have a table with a primary key, i.e. a physically arranged, clustered index that is of type integer and has an identity value like so (pseudo-SQL-code):
MyTable
--------
Id ( int, primary key, identity(1, 1) )
MyField1
MyField2
Would an insert operation in this table take more time as the number of rows in the table grew? Why?
The only reason I can imagine it taking longer is if the table rows are stored as nodes of a linked list internally before being flushed to the disk.
I am assuming that giving a clustered index to a table makes a copy of the table data and stores that as an array, so traversing that array is a lot faster (constant time as you need only one JMP instruction by a single integer (or machine bit-ness, i.e. 32 bits on a 32-bit machine and 64 bits on a 64-bit machine) size) than traversing the linked list.
And would it make any difference to the differential insertion time if the table did not have an index? That is, if the primary key in the above case was missing?
Where may I read about how a relational database stores a table in the RAM and on the disk?
In general, the overhead for inserting a row consists of a few components. Off-hand, I can think of:
Finding a page to put the row.
Updating indexes.
Logging the transaction.
Any overhead for triggers and constraints.
For (1). Because of the clustered index on an identity column, a new row goes into the table at the "end" of the table -- meaning on the last page. There is not a relationship between the size of the table and finding space for the row, in this case.
For (2). There is a very small additional overhead for updating the clustered index as the table grows. But this is very small -- and fragmentation doesn't seem to be an issue.
For (3). This is not related to the table size.
For (4). You don't seem to have triggers or constraints, so this isn't an issue.
So, by my reckoning, there would be very little additional overhead for an insert as the table grows bigger.
Note: There may be other factors as well. For instance, you might need to grow the table space to support a larger table. However, that isn't really related just to the size of the table, just to the relationship between the size of data and the available resources.
Is there any reason I shouldn't create an index for every one of my database tables, as a means of increasing performance? It would seem there must be some reason(s) else all tables would automatically have one by default.
I use MS SQL Server 2016.
One index on a table is not a big deal. You automatically have an index on columns (or combinations of columns) that are primary keys or declared as unique.
There is some overhead to an index. The index itself occupies space on disk and memory (when used). So, if space or memory are issues then too many indexes could be a problem. When data is inserted/updated/deleted, then the index needs to be maintained as well as the original data. This slows down updates and locks the tables (or parts of the tables), which can affect query processing.
A small number of indexes on each table are reasonable. These should be designed with the typical query load in mind. If you index every column in every table, then data modifications would slow down. If your data is static, then this is not an issue. However, eating up all the memory with indexes could be an issue.
Advantage of having an index
Read speed: Faster SELECT when that column is in WHERE clause
Disadvantages of having an index
Space: Additional disk/memory space needed
Write speed: Slower INSERT / UPDATE / DELETE
As a minimum I would normally recommend having at least 1 index per table, this would be automatically created on your tables primary key, for example an IDENTITY column. Then foreign keys would normally benefit from an index, this will need to be created manually. Other columns that are frequently included in WHERE clauses should be indexed, especially if they contain lots of unique values. The benefit of indexing columns, such as gender (low-cardinality) when this only has 2 values is debatable.
Most of the tables in my databases have between 1 and 4 indexes, depending on the data in the table and how this data is retrieved.
Indexing is used to improve performance of sql query but I always found it little difficult to decide in which situation should I use index and in which not. I want to clarify some of my doubts regarding non-clustered index
What is Non-clustered index key. As book say each index row of non clustered index contains non clustered key value so is it mean it is the column in which we created non clustered index i.e. If created index on empname varchar(50) , so non clustered key will be
that empname .
Why It is preferable to create index on column with small width. It is due to comparison with more width column takes more time for SQL server engine or is it due to it will increment hierarchy of intermediate nodes as page size is fixed so with more width column in a page or node less index row it will contain.
If a table contain multiple non clustered column so whether non clustered key will be combination of all this column or some unique id is generated internally by SQL with locator which will point to actual data row. If possible please clear it will some real time example and graphs.
Why It is said that column with non-repeatable value is good to create index as even if it contains repeated value it will definitely improve performance as once it reach to certain key value its locator will immediately found its actual row.
If column used in indexing is not unique how it find actual data row from table.
Please refer any book or tutorial which will be useful to clear my doubts.
First I think we need to cover what an actual index is. Usually in RDBMS indexes are implemented using a variant of B-tree's (B+ variant is most common). To put it shortly - think a binary search tree optimized for being stored on a disk. The result of looking up a key in the B-tree is usually the primary key of the table. That means if a lookup in the index completes and we need more data than what is present in the index we can do a seek in the table using the primary key.
Please remember that when we think of performance for a RDBMS we usually measure this in disk accesses (I decide to ignore locking and other issues here) and not so much CPU time.
Having the index being non-clustered means that the actual way the data in the table is stored has no relation to the index key - whereas a clustered index specifies that the data in the table will be sorted (or clustered by) the index key - this is why there can only be one clustered index per table.
2) Back to our model of measuring performance - if the index key is has small width (fits into a low amount of bytes) it means that per block of disk data we retrieve we can fit more keys - and as such perform lookups in the B-tree much faster if you measure disk I/O.
3) I tried explaining this further up - unfortunately I don't really have any graphs or drawings to indicate this - hopefully someone else can come along and share these.
4) If you're running a query like so:
SELECT something, something_else FROM sometable t1 WHERE akey = 'some value'
On a table with an index defined like so:
CREATE INDEX idx_sometable_akey ON sometable(akey)
If sometable has alot of rows where akey is equal to 'some value' this means alot of lookups in both the index but also in the actual table to retrieve the values of something and something_else. Whereas if there's a good chance that this filtering returns few rows then it also means less disk accesses.
5) See earlier explanation
Hope this helps :)
When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?
I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.
EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit
At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
Unique Index on ParentPriceID and DependantPriceID
and
ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)
Unique Index on CostDependencyID and DistributionID
and
ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit
Unique Index on ExtensionID and LocationID
I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).
Thanks, Dane.
With all the new info available, I'd like to suggest the following:
1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table.
==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.
2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.
3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.
4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.
It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.
I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?
My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?
Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.
One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?
By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.
The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.
Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.
I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.