Slow bulk insert for table with many indexes

Slow bulk insert for table with many indexes - sql

I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.

You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO

This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES

As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?

Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.

Related

Am I supposed to drop and recreate indexes on tables every time I add data to it?

I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?

Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table

I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.

Fast deletion of many rows in data warehouse data

In SQL Server 2008 I have some million rows of data which needs be deleted. They are scattered across a handful of tables. Deletion takes up to 20 seconds which I think is way to slow! The data to be deleted is identified by a timestamp column. Here is what I have done so far in order to optimize:
Using isolation level read uncommitted. I don't care about transactions. If we fail the user will issue the delete operation again. And new data is ensured not to have the timestamp we are deleting.
Deleting leaf tables before parent tables.
The timestamp column is part of the PK clustered index, in fact its the first position of the PK/index.
Each table is emptied using a loop which deletes top 200000 entries in order to reduce the transaction log overhead.
Neither I/O nor CPU is maxed out on the server
What have I overlooked?
Also I am in doubt of the effect of moving the timestamp column to the first position in the PK. After doing so, must I reorganize the tables or is SQL Server smart enough to do this itself. My understanding of clustered index is that since it defines the physical layout of the rows, it is force into reorganizing the data. But we have no complaints from the customer that the changing clustered index operation took a long time to perform.

Please make sure the tables you want to delete data from has "primary key" specifically indicated.
Wrong: create table myTable (ID int)
True: create table myTable (ID int PRIMARY KEY)
In addition to that, please try to add "option (recompile)", which will help the performance:
DELETE FROM myTable
WHERE timestamp in (select timestamp from other_table)
OPTION (RECOMPILE)

How to speed up a slow UPDATE query

I have the following UPDATE query:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
Ideas? I'm using Postgres 9.0.0.
UPDATE:
Here's the schema:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.

What #kgrittn posted as comment is the best answer so far. I am merely filling in details.
Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
Redundant data in update statements
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.
In this context it would help to run multiple smaller UPDATEs:
If ...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE is easily divided into multiple patches in multiple transactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.
The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
Lastly, disable & enable triggers that are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
So:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM inside a transaction block. Per documentation:
VACUUM cannot be executed inside a transaction block.
You could split your operation into a few big chunks and run in between:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;

UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?

Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
At your leisure, you can remove LastError from your existing table.
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

Index and Insert Operations

I have one job with around 100K records to process. This job truncates the destination tables, and then insert all the records "one at a time", not a batch insert, in these tables.
I need to know how indexes will take affect as these records are inserted? Whether cost of creating index during the job will be more than benefit from using them?
Are there any best practices or optimization hints in such situation?

This kind of question can only be answered on a case-by-case basis. However the following general considerations may be of help:
Unless some of the data for the inserts comes from additional lookups and such, No index is useful during INSERT (i.e. for this very operation, indexes may of course be useful for other queries under other sessions/users....)
[on the other hand...] The presence of indexes on a table slows down the INSERT (or more generally UPDATE or DELETE) operations
The order in which the new records are added may matter
Special consideration is to be had with regards if the table is a clustered index
Deciding whether to drop indexes (all of them or some of them) prior to the INSERT operation depends much on the relative number of records (added vs. readily in)
INSERT operations may often introduce index fragmentation, which is in of itself an additional incentive to mayve drop the index(es) prior to data load, then to rebuild it (them).
In general, adding 100,000 records is "small potatoes" for MS-SQL, and unless of a particular situation such as unusually wide records, or the presence of many (and possibly poorly defined) constraints of various nature, SQL Server should handle this load in minutes rather than hours on most any hardware configuation.

The answer to this question is very different depending if the indexes you're talking about are clustered or not. Clustered indexes force SQL Server to store data in sorted order, so if you try to insert a record that doesn't sort to the bottom the end of your clustered index, your insert can result in significant reshuffling of your data, as many of your records are moved to make room for your new record.
Nonclustered indexes don't have this problem; all the server has to do is keep track of where the new record is stored. So if your index is clustered (most clustered indexes are primary keys, but this is not required; run "sp_helpindex [TABLENAME]" to find out for sure), you would almost certainly be better off adding the index after all of your inserts are done.
As to the performance of inserts on nonclustered indexes, I can't actually tell you; in my experience, the slowdown hasn't been enough to worry about. The index overhead in this case would be vastly outweighed by the overhead of doing all your inserts one at a time.
Edit: Since you have the luxury of truncating your entire table, performance-wise, you're almost certainly better off dropping (or NOCHECKing) your indexes and constraints before doing all of your inserts, then adding them back in at the end.

The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
The more the indexes table has more slower execution becomes .
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
Check here

Insertion of data after creating index on empty table or creating unique index after inserting data on oracle?

Which option is better and faster?
Insertion of data after creating index on empty table or creating unique index after inserting data. I have around 10M rows to insert. Which option would be better so that I could have least downtime.

Insert your data first, then create your index.
Every time you do an UPDATE, INSERT or DELETE operation, any indexes on the table have to be updated as well. So if you create the index first, and then insert 10M rows, the index will have to be updated 10M times as well (unless you're doing bulk operations).

It is faster and better to insert the records and then create the index after rows have been imported. It's faster because you don't have the overhead of index maintenance as the rows are inserted and it is better from a fragmentation standpoint on your indexes.
Obviously for a unique index, be sure that the data you are importing is unique so you don't have failures when trying to create the index.

As others have said, insert first and add the index later. If the table already exists and you have to insert a pile of data like this, drop all indexes and constraints, insert the data, then re-apply first your indexes and then your constraints. You'll certainly want to do intermediate commits to help preclude the possibility that you'll run out of rollback segment space or something similar. If you're inserting this much data it might prove useful to look at using SQL*Loader to save yourself time and aggravation.
I hope this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas