How to speed up a slow UPDATE query - sql

I have the following UPDATE query:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
Ideas? I'm using Postgres 9.0.0.
UPDATE:
Here's the schema:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.

What #kgrittn posted as comment is the best answer so far. I am merely filling in details.
Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
Redundant data in update statements
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.
In this context it would help to run multiple smaller UPDATEs:
If ...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE is easily divided into multiple patches in multiple transactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.
The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
Lastly, disable & enable triggers that are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
So:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM inside a transaction block. Per documentation:
VACUUM cannot be executed inside a transaction block.
You could split your operation into a few big chunks and run in between:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;

UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?

Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
At your leisure, you can remove LastError from your existing table.
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

Related

Proper way of updating a whole table in Redshift, drop table + create table vs. truncate + insert into table

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.
If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.
I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

How to avoid data fragmentation in an "INSERT once, UPDATE once" table?

I have large number of tables that are "INSERT once", and then read-only after that. ie: After the initial INSERT of a record, there are never any UPDATE or DELETE statements. Due to this, the data fragmentation on disk for the tables is minimal.
I'm now considering adding an needs_action boolean field to every table. This field will only ever get updated once, and it will be done on a slow/periodic basis. As a result of MVCC, when VACUUM comes along (with an even slower schedule) after the UPDATE, the table becomes quite fragmented as it clears out the originally inserted tuples, and they get subsequently backfilled by new insertions.
In short: The addition of this single "always updated once" field changes the table from being being minimally fragmented by design, to being highly fragmented by design.
Is there some method of effectively achieving this single needs_action record flagging that can avoid the resulting table fragmentation?
.
.
.
.
<And now for some background/supplemental info...>
Some options considered so far...
At the risk of making this question massive (and therefore overlooked?), below are some options that have been considered so far:
Just add the column to each table, do the UPDATE and don't worry about resulting fragmentation until it actually proves to be an issue.
I'm conscious of premature optimization here, but with some of the tables getting large (>1M, and even > 1B) I'd rather get the design right up front.
Make an independent tracking table (for every table) only containing A) the PK from the master table, and B) the needs_action flag. Create a record in the tracking table with an AFTER INSERT trigger in the master table
This will preserve "INSERT only" minimal fragmentation levels on the master table... at the expense of adding (significant?) up-front write overhead
putting the tracking tables in a separate schema would also neatly separate the functionality from the core tables
Forcing the needs_action field to be a HOT update to avoid tuple replication
Needing an index on WHERE needs_action = TRUE seems to rule out this option, but maybe there's another way to find them quickly?
Using table fillfactor (at 50?) to leave space for the inevitable UPDATE
eg: set fillfactor to 50 to leave room for the UPDATE, and therefore keep it in the same page
But... with only one UPDATE it seems like this will leave the table packing fraction eternally at 50% and take twice the storage space? I don't 100% understand this option yet... still learning.
Find a special/magical field/bit in the master table records that can be twiddled without MVCC impact.
This does not seem to exist in postgres. Even if it did, it would need to be indexed (or have some other quick finding mechanism akin to a WHERE needs_action = TRUE partial index)
Being able to optionally suppress MVCC operation on specific columns seems like it would be nice here (although surely fraught with peril)
Storing needs_action outside of postgres (eg: as a <table_name>:needs_copying list of PKs in redis) to evade fragmentation due to mvcc.
I have concerns about keeping this atomic, though. Maybe using redis_fdw (or some other fdw?) in an AFTER INSERT trigger can keep it atomic? I need to learn more about fdw capabilities... seems like all fdw's I can find are all read-only, though.
Run a fancy view with background defrag/compacting, as described in this fantastic article
Seems like a bit much to do for all tables.
Simply track ids/PKs that need copying in a postgres table
just stash ids that need action as records to a fast inert table (eg: no PK), and DELETE the records when action is completed
similar to RPUSHing to an offline redis list (but definitely ACID)
This seems like the best option at the moment.
Are there other options to consider?
More on the specific use case driving this...
I'm interested in the general case how to avoid this fragmentation, but here's a bit more on the current use case:
Read performance is much more important than write performance on all tables (but avoiding crazy slow writes is clearly desirable)
Some tables will reach millions of rows. A few may hit billions of rows.
SELECT queries will be across wide table ranges (not just recent data) and can range from single result records to 100k+
Table design can be done from scratch... no need to worry about existing data
PostgreSQL 9.6
I would just lower the fillfactor below the default value of 100.
Depending on the size of the row, use a value like 80 or 90, so that a few new rows will still fit into the block. After the update, the old row can be “pruned” and defragmented by the next transaction so that the space can be reused.
A value of 50 seems way too low. True, this would leave room for all rows in a block being updated at the same time, but that is not your use case, right?
You can try to use inherited tables. This approach does not directly support PK for tables, but it might be resolved by triggers.
CREATE TABLE data_parent (a int8, updated bool);
CREATE TABLE data_inserted (CHECK (NOT updated)) INHERITS (data_parent);
CREATE TABLE data_updated (CHECK (updated)) INHERITS (data_parent);
CREATE FUNCTION d_insert () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = false;
INSERT INTO data_inserted VALUES (NEW.*);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_insert BEFORE INSERT ON data_parent FOR EACH ROW EXECUTE PROCEDURE d_insert();
CREATE FUNCTION d_update () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = true;
INSERT INTO data_updated VALUES (NEW.*);
DELETE FROM data_inserted WHERE (data_inserted.*) IN (OLD);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_update BEFORE INSERT ON data_inserted FOR EACH ROW EXECUTE PROCEDURE d_update();
-- GRANT on d_insert to regular user
-- REVOKE insert / update to regular user on data_inserted/updated
INSERT INTO data_parent (a) VALUES (1);
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
INSERT 0 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | f
(1 row)
a | updated
---+---------
(0 rows)
UPDATE data_parent SET updated = true;
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
UPDATE 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | t
(1 row)

Fast deletion of many rows in data warehouse data

In SQL Server 2008 I have some million rows of data which needs be deleted. They are scattered across a handful of tables. Deletion takes up to 20 seconds which I think is way to slow! The data to be deleted is identified by a timestamp column. Here is what I have done so far in order to optimize:
Using isolation level read uncommitted. I don't care about transactions. If we fail the user will issue the delete operation again. And new data is ensured not to have the timestamp we are deleting.
Deleting leaf tables before parent tables.
The timestamp column is part of the PK clustered index, in fact its the first position of the PK/index.
Each table is emptied using a loop which deletes top 200000 entries in order to reduce the transaction log overhead.
Neither I/O nor CPU is maxed out on the server
What have I overlooked?
Also I am in doubt of the effect of moving the timestamp column to the first position in the PK. After doing so, must I reorganize the tables or is SQL Server smart enough to do this itself. My understanding of clustered index is that since it defines the physical layout of the rows, it is force into reorganizing the data. But we have no complaints from the customer that the changing clustered index operation took a long time to perform.
Please make sure the tables you want to delete data from has "primary key" specifically indicated.
Wrong: create table myTable (ID int)
True: create table myTable (ID int PRIMARY KEY)
In addition to that, please try to add "option (recompile)", which will help the performance:
DELETE FROM myTable
WHERE timestamp in (select timestamp from other_table)
OPTION (RECOMPILE)

Optimizing Delete on SQL Server

Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.
The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.
I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)
I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.
Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]
To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.
(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.
The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.
If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.
I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.
On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.
I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?
There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.
If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.
Do you have foreign keys with referential integrity activated?
Do you have triggers active?
Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!

Slow bulk insert for table with many indexes

I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.