Optimizing Delete on SQL Server - sql

Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.

The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.

I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)

I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.

Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]

To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.

(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.

The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.

If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.

I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.

On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.

I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?

There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.

If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.

Do you have foreign keys with referential integrity activated?
Do you have triggers active?

Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!

Related

To delete, Changing the flag of entity VS Moving to another table?

Should I change the column deleted to 1 to consider it as deleted or it is better to move the record to another table?
The flag approach is good because later selects will search less records.
The second approach is a little complex, isn't it?
Which approach is better?
As stated here Comparison Factors are listed below. It is up-to you take the decision based on below factor
Ease of Setup
Soft Delete is easier to implement since it merely involves updating
a column while hard delete would also involve copying the data to be
deleted to an audit table.
Advantage : Soft Delete
Debugging
Soft Delete makes it easy to debug data issues due to the
deleted_flag But debugging via the Audit table is also easily
possible. So its a tie.
Advantage : NA
Restoring data
It is extremely easy to restore data ‘deleted’ via soft delete since
it just involved unsetting the deleted_flag.
However note that restoring data is an extremely rare occurance.
Advantage : Soft Delete
Querying for active data
By experience, I can state that many issues have come up when the
developer has forgotten to add ‘delete_flag = 0’ condition in the
select queries due to which issues came about.
If you are using an ORM like Doctrine with the ‘soft delete’ plugin
enabled, then this will not be an issue since the ORM takes care of
adding this check.
Advantage : Hard Delete
View Simplicity
Having all the data in the tables as active data relates to view
simplicity (WYSIWYG - What you see is what you get)
In Hard delete, all ‘deleted’ data will only be present in the audit
table while the rest of the tables in the system will have ‘active’
data. So the separation of concerns exists for hard delete.
Advantage : Hard Delete
Performance of operation
Update is a bit faster than delete (microseconds)
So soft delete should technically be faster than hard delete (which
also has the audit table insert to consider).
Advantage : Soft Delete
Application Performance
Speed
To support soft deletes, ALL select queries need to have a condition
‘WHERE delete_flag = 0’.
In situations where JOINs are involved there will be multiple such
conditions. Select queries with lesser conditions are faster than
those with conditions.
Advantage : Hard Delete
Size
To support faster soft deletes, we need to have an index for every
delete_flag in EVERY table
Additionally the table size keeps increasing since the table has
‘soft deleted’ data + active data.
Queries can get slower as table size increases.
Advantage : Hard Delete
Database features compatibility
Unique Index
Unique index ensures data integrity by preventing multiple
occurrences of a row at the database level.
Having soft delete prevents usage of Unique index.
Additionally we cannot update the old soft deleted entry of A1-B1
since it would mean rewriting some data which results in loss of
recorded data (eg : update date time or some other deleted_by column
if it exists)
Advantage : Hard Delete
Cascading
For soft delete, we cannot make use of ‘ON DELETE’ cascading. The
alternative is to create an ‘UPDATE’ trigger which keeps track of
deleted_flag.
Advantage : Hard Delete

Which database operation is heaviest?

If I perform the CRUD operations on the same table, what is the heaviest operation in terms of performance?
People say DELETE and then INSERT is better than UPDATE in some cases, is this true? Then UPDATE is the heaviest operation?
Like all things in life, it depends.
SQL Server uses WAL (write ahead logging) to maintain ACID (Atomicity, Consistency, Isolation, Durability) properties.
A insert needs to log entries for data page and index page changes. If page splits occur, it takes longer. Then the data is written to the data file.
A delete marks the data and index pages for re-use. The data will still be there right after the operation.
A update is implemented as an delete and insert. There for double the log entries.
What can help inserts is pre-allocating the space in the data file before running the job. Auto growing the data files is expensive.
In summary, I would expect updates on average to be the most expensive operation.
I am by no way an expert on the storage engine.
Please check out http://www.sqlskills.com - Paul Randals blog and/or Kalen Daleny SQL Server Internals book, http://sqlserverinternals.com/. These authors go in depth on all the cases that might happen.
It depends mostly on foregin keys and indexes which you have on this table. For deletion and isertion every column that is a foreign key and part of an index has to be checked on foreign key references and every index containing that column has to be rebuilt.
If you do DELETE and then INSERT then checking and rebuilding happens twice. If it is a really large table then rebuilding indexes can take very long time and in this case update will be MUCH faster.
Of course if you have index on the key that you're searching with update statement and you are not updating the key.
For a small table with almost no indexes/foreign keys the operations run so fast that it's not a big issue.

Benefit to keeping a record in the database, rather than deleting it, for performance issues?

So I have a client that I am building a Rails app for....I am using PostgreSQL.
He made this comment about preferring to hide records, rather than delete them, and given that this is the first time I have heard about this I figured I would ask you guys to hear your thoughts.
I'd rather hide than delete because deletions in tables eventually lead to table index havoc that causes queries to take longer than expected (much worse than Inserts or Updates). This won't be a problem in the beginning of the site (it gets exponentially worse over time), but seems like an easy issue to never encounter by just not deleting anything (yet) as part of the "everyday" web application functionality. We can always handle deletions much later as part of a Data Optimization & Maintenance process and re-index tables in that process on some (yet to be determined) scheduled basis.
In all the Rails apps I have built, I have never had an issue with records being deleted and it affecting the index.
Am I missing something? Is this a problem that used to exist, but modern RDBMS products have fixed it?
There may be functional reasons for preferring that records not be deleted, but reasons relating to some form of table index "havoc" are almost certainly bogus unless supported by some technical evidence.
You hear this sort of thing quite often in the Oracle world -- that indexes do not re-use space freed up by deletions. It's usually based on some misinterpretation of the facts (eg. that index blocks are not freed for re-use until they are completely empty). Hence you end up with people giving advice to periodically rebuild indexes. If you give these issues some thought, you wonder why the RDBMS developers would not have fixed such an issue, given that it supposedly harms the system performance.
So there may be some piece of Postgres-related, possibly obsolete, information on which this is based, but the onus is really on the person objecting to a perfectly normal type of database operation to come with evidence to support their position.
Another thought: I believe that in Postgres an update is implemented as a delete and insert, hence the advice to vacuum frequently on heavily updated tables. Based on that, updates should also cause the same index problems that are supposed to be associated with deletes.
Other reasons for not deleting the records.
you don't have to worry about cascading a delete through various other tables in the database that reference the row you are deleting
Every bit of data is useful. Debugging and auditing becomes easy.
Easier to rollback if needed.
Create an deleted column in your table and dont index that one.
If you update that record with deleted = 1 or deleted = o only the data needs to be rewriten, the index doesnt have to be updated that saves lots off IO reads and IO writes
This advies goes for all modern RDBMS when they make use of B tree indexes. B tree is an very good algorithm for searching but not for updating and deleting because off high number off IO reads and IO writes thats needed to insert or update node in the node or delete notes from the tree this is also the reason why you should not "over index" your table
"Delete" an record like this
UPDATE table SET deleted = 1 WHERE id = 1 -- if deleted not is indexed assuming id is index as an primary key this also should be fast
Check if an record is deleted
SELECT * FROM table WHERE id = 1 and deleted = 1 -- assuming id is index as an primary key this also should be fast
Check if an record not is deleted
SELECT * FROM table WHERE id = 1 and deleted = 0 -- assuming id is index as an primary key this also should be fast

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

SQL DELETE performance

delete from a A where a.ID = 132.
The table A contains around 5000 records and A.ID is the primary key in the table A. But it is taking a long time to delete . Sometimes its getting timed out also . That table contains three indexes and it is referenced by three foreign keys . Can anyone explain me why its taking long time even though we are deleting based on the primary key . And please tell me some way to optimize this problem ...?
Possible causes:
1) cascading delete operations
2) trigger(s)
3) the type of your primary key column is something other than an integer, thereby forcing a type conversion on each pk value to do the comparison. this requires a full table scan.
4) does your query really end in a dot like you posted it in the question? if so, the number may considered to be a floating point number instead of an integer, thereby causing a type conversion similar to 3)
5) your delete query is waiting for some other slow query to release a lock
Obviously it should not be taking a long time. However, there isn't enough information here to figure out exactly why. I can tell you, though, that you should focus on the Foreign Keys.
These can slow things down if they impose constraints from other, much larger, tables. You may also find out that your timeouts are due to integrity checks that prevent the delete (then the question is why you aren't getting exceptions instead of a timeout).
My next step would be to remove the foreign keys and then check performance. Then add each one back in at a time and check performance as you go.
Are other operations (e.g. Inserts, Selects, Updates) taking a long time?
First thought: Indexes on foreign keys?
This is related to cascading deletes mentioned
All child tables muts be checked and if you have a total of 500,000 child rows, this might take some time of course...
Second thought: Triggers firing?
On this table or on child tables or trying to cascade via code etc
God forbid, cursor for each row in DELETED...
Try to update the statistics. 5000 rows is not a big deal. If you're doing this regularly you should schedule maintenance on that table as well (i.e. re-build indexes, update stats etc.)
As others have observed, the probable suspects are the foreign keys.
Firstly because the ON DELETE CASCADE can gather momentum if the dependent tables in turn are referenced by other tables, which in turn may be referenced, and so on.
Secondly, because other users may have locks on the rows which need to be deleted. This is the most likely cause of the timeouts. Quite how this works will depend on the flavour and version of your database. For instance, older versions of Oracle (<=8.0) needed to lock the entire dependent table unless the foreign key columns were indexed.