I need to insert data and the amount of data is pretty huge. There are some indexes that I need to define on this table. So, my question is...which is better and why
Create table --> Insert Data --> Create Indexes
or, Create table --> Create Indexes --> Insert Data
Thnx,
Vabs
choice #1 will be faster because it's more efficent to build an index from scratch than by adding records one after another. That again is because the records will be sorted beforehand and the index blocks are filled up and written in an ordered manner. In oracle you can also use ´create index ... nologging´ to avoid creating the redo log.
When you build an index oracle can sort the whole table in it's temporary sort space, then build the index from this.
If you have the index built already, then for each row inserted, it has to lookup the position in the index where the new value will go, then add the value to the index.
So it's a lot quicker to load the data then build the index
Choice #1 will usually be faster, but there are cases where Choice #2 could be faster - i.e. where you have multiple INSERTs, and one or more subsequent inserts query the same table and use one or more of the indexes.
Another consideration, of course, is if there will be concurrent activity from other sessions before/while the data is loaded - they might suffer if they have no indexes to take advantage of.
Related
I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?
Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table
I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.
Performance-wise, does a clustered index help or not when bulk inserting hundreds of millions of rows in a table?
LE: after the INSERTs I have to put the database into production so I will have to create the one or more indexes.
A clustered index specifies that the data is ordered on the data pages.
When you are inserting data, the new data has to be sorted and compared to existing values. This is going to incur overhead.
The one exception is when you have an identity column -- that is being generated during the insert. Then the database knows that the new data goes "at the end" of the table.
Indexes are meant for speeding up retrieval (SELECT) of rows. They only have anti-effect with respect to INSERT or DELETE or UPDATE. And, in your case, if INSERT is the predominant operation to be performed in your system, don't go for indexes at all. Even in your Production system, assess the ratio between retrieval operations and insert/update operations and if it turns out to be that the retrieval operation is going to be dominant, then you can think of indexes.
Note: Whenever we define a Primary Key on a table, a basic index structure is already created for that table. So, without any specific need for retrieval optimization, there is no actual need to design and implement indexes.
You can know more here: https://www.geeksforgeeks.org/sql-indexes/
I have a temp table #Data, that I populate inside a stored procedure.
It contains like 15M rows.
Then I create a clustered index, say IX_Data, for a couple of column of the temp table #Data.
Then I delete from #Data which deletes like 1M rows (keeping the total rows 14M now).
My question: at this point, should I drop IX_Data and recreate it?
#Data is being referred further in the rest of the stored procedure just at one place.
You should not. Indexes are maintained by dbms automatically and always kept in sync.
That's why it's not recommended to create more indexes than you need since it's a performance penalty for every DML query.
You don't specify which database you are on (maybe it's Oracle?), but your question seems about fragmentation, since data integrity should be mantained in every database even if you delete a million of rows (so there's no need to drop and recreate an index for data integrity).
So, how do you know if your index is fragmented (so you need to recreate, or better, rebuild it)? Every database has his method. Oracle has internal tables from which you can determine if an object is fragmented.
But it depends on your type of database.
Which option is better and faster?
Insertion of data after creating index on empty table or creating unique index after inserting data. I have around 10M rows to insert. Which option would be better so that I could have least downtime.
Insert your data first, then create your index.
Every time you do an UPDATE, INSERT or DELETE operation, any indexes on the table have to be updated as well. So if you create the index first, and then insert 10M rows, the index will have to be updated 10M times as well (unless you're doing bulk operations).
It is faster and better to insert the records and then create the index after rows have been imported. It's faster because you don't have the overhead of index maintenance as the rows are inserted and it is better from a fragmentation standpoint on your indexes.
Obviously for a unique index, be sure that the data you are importing is unique so you don't have failures when trying to create the index.
As others have said, insert first and add the index later. If the table already exists and you have to insert a pile of data like this, drop all indexes and constraints, insert the data, then re-apply first your indexes and then your constraints. You'll certainly want to do intermediate commits to help preclude the possibility that you'll run out of rollback segment space or something similar. If you're inserting this much data it might prove useful to look at using SQL*Loader to save yourself time and aggravation.
I hope this helps.
I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.