MS SQL : The best way to delete rows from a ginormous table

MS SQL : The best way to delete rows from a ginormous table - sql

I have a very large table [X], which has 170 million rows, and we need to archive data to keep only used records in [X]. We are doing this to keep our system fast as it is slowing down. We are only using a small amount of rows from the whole table (speaking of less then 10%), so we can afford to archive a lot of data into for example Archive.[X].
The problem is that when we try to delete records, it takes a lot of time. Now we have run the following checks for troubleshooting to see any possibilities why it takes so long
1) The table is indexed
2) No un-indexed foreign keys
3) No triggers doing extra work in the background on delete
Have any of you ever encountered a similar scenario? What is the best procedure to follow when doing something similar? And are there any tools out there that can help?
I appreciate your help!

Options
Why not take the 10% into a new table?
Batch delete/insert not in a transaction (see below)
Partition table (aka let the engine deal with it)
To populate an archive table
SELECT 'starting' -- sets ##ROWCOUNT
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (50000) dbo.Mytable
OUTPUT DELETED.* INTO ArchiveTable
WHERE SomeCol < <Afilter>
-- maybe CHECKPOINT
WAIT FOR DELAY ...
END

You should go for partitioning your database/table.

Related

Azure SQL server deletes

I have a SQL server with 16130000 rows. I need to delete around 20%. When I do a simple:
delete from items where jobid=12
Takes forever.
I stopped the query after 11 minutes. Selecting data is pretty fast why is delete so slow? Selecting 850000 rows takes around 30 seconds.
Is it because of table locks? And can you do anything about it? I would expect delete rows should be faster because you dont transfer data on the wire?
Best R, Thomas

Without telling us what reservation size you are using, it is hard to give feedback on whether X records in Y seconds is expected or not. I can tell you about how the system works so that you can make this determination with a bit more investigation by yourself, however. The log commit rate is limited by the reservation size you purchase. Deletes are fundamentally limited on the ability to write out log records (and replicate them to multiple machines in case your main machine dies). When you select records, you don't have to go over the network to N machines and you may not even need to go to the local disk if the records are preserved in memory, so selects are generally expected to be faster than inserts/updates/deletes because of the need to harden log for you.
You can read about the specific limits for different reservation sizes are here:
DTU Limits and vCore Limits
One common problem customers hit is to do individual operations in a loop (like a cursor or driven from the client). This implies that each statement has a single row updated and thus has to harden each log record serially because the app has to wait for the statement to return before submitting the next statement. You are not hitting that since you are running a big delete as a single statement. That could be slow for other reasons such as:
Locking - if you have other users doing operations on the table, it could block the progress of the delete statement. You can potentially see this by looking at sys.dm_exec_requests to see if your statement is blocking on other locks.
Query Plan choice. If you have to scan a lot of rows to delete a small fraction, you could be blocked on the IO to find them. Looking at the query plan shape will help here, as will set statistics time on (I suggest you change the query to do TOP 100 or similar to get a sense of whether you are doing lots of logical read IOs vs. actual logical writes). This could imply that your on-disk layout is suboptimal for this problem. The general solutions would be to either pick a better indexing strategy or to use partitioning to help you quickly drop groups of rows instead of having to delete all the rows explicitly.

Try to use batching techniques to improve performance, minimize log usage and avoid consuming database space.
declare
#batch_size int,
#del_rowcount int = 1
set #batch_size = 100
set nocount on;
while #del_rowcount > 0
begin
begin tran
delete top (#batch_size)
from dbo.LargeDeleteTest
set #del_rowcount = ##rowcount
print 'Delete row count: ' + cast(#del_rowcount as nvarchar(32))
commit tran
end
Drop any foreign keys, delete the rows and then recreate the foreign keys can speed up things also.

SQL DELETE - Maximum number of rows

What limit should be placed on the number of rows to delete in a SQL statement?
We need to delete from 1 to several hundred thousand rows and need to apply some sort of best practise limit in order to not absolutely kill the SQL server or fill up the logs every time we empty a waste-basket.
This question is not specific to any type of database.

That's a very very broad question that basically boils down to "it depends". The factors that influence it include:
What is your level of concurrency? A delete statement places an exclusive lock on affected rows. Depending on the databse engine, deleted data distribution, etc., that could escalate to page or entire table. Can your data readers afford to be blocked for the duration of the delete?
How complex is the delete statement? How many other tables are you joining to, or are there complex WHERE clauses? Sometimes the identification of rows to delete can be more "expensive" than the delete itself, so one big delete may be "cheaper".
Are you fearful about deadlocks? As you decrease the size of your delete, your deadlock "foot print" is reduced. Ideally, single-row deletes will always succeed.
Do you care about throughput performance? As with any SQL statement, there is a generally constant amount of overhead (connection stuff, query parsing, returning results, etc.). From a single-connection point of view, a 1000-line delete will be faster than 1000 x 1-line deletes.
Don't forget about index maintenance overhead, fragmentation cleanup, or any triggers. They can also affect your system.
In general, though, I benchmark at 1000-lines per statement. Most systems I've worked with (sub-"enterprise") end up with a sweet-spot between 500 and 5000 records per delete. I like to do something like this:
set rowcount 500
select 1 -- Just to force ##rowcount > 0
while ##ROWCOUNT > 0
delete from [table]
[where ...]

Though limiting the number of rows affected by your delete using the set rowcount option and then performing a loop is very good (and I've used it many a time before), be aware that from SQL 2012 onwards this will not be an option (see BOL).
Therefore, another option may be to limit the number of rows being deleted using the TOP clause. i.e.
SELECT 1
WHILE ##ROWCOUNT > 0
BEGIN
DELETE TOP (#)
FROM mytable
[WHERE ...]
END

Unless you have a lot of triggers or integrity constraints to verify, deletion shouldn't be that expensive an operation.
But if you're that concerned about performance, my initial hunch would be to mark the appropriate rows as deleted and then physically delete them later during a periodic cleanup. But I'm not a big fan of this because you'll have to change any queries on that table to exclude logically- but not physically-deleted rows.

Whenever I see a database that routinely deletes large amounts of rows in bulk, it makes me think the data model or processing design is not optimal. Why load 1 million rows and then delete them? If you need to do something like purge historical data, then consider table partitioning.

I run into this question and found my own answer to be quite effective: do a subselect.
delete from urls where url in ( select top 10000 url from urls)

a general answer is to drop the table and re-create it, that is a good performing solution, but applies for the full table

How-To delete 8,500,000 Records from one table on sql server

delete activities
where unt_uid is null
would be the fastest way but nobody can access the database / table until this statement has finished so this is a no-go.
I defined a cursor to get this task done during working time but anyway the impact to productivity is to big.
So how to delete these record so that the normal use of this database is guaranteed?
It's a SQL-2005 Server on a 32-bit Win2003. Second Question is: How Long would you estimate for this job to be done (6 hours or 60 hours)? (Yes, i know that depends on the load but assume that this is a small-business environment)

You can do it in chunks. For example, every 10 seconds execute:
delete from activities where activityid in
(select top 1000 activityid from activities where unt_uid is null)
Obviously define the row count (I arbitrarily picked 1000) and interval (I picked 10 seconds) which makes the most sense for your application.

Perhaps instead of deleting the records from your table, you could create a new identical table, insert the records you want to keep, and then rename the tables so the new one replaces the old one. This would still take some time, but the down-time on your site would be pretty minimal (just when swapping the tables)

Who can access the table will depend on your transaction isolation mode, I'd guess.
However, you're broadly right - lots of deletes is bad, particularly if your where clause means it cannot use an index - this means the database probably won't be able to lock only the rows it needs to delete, so it will end up taking a big lock on the whole table.
My best recommendation would be to redesign your application so you don't need to delete these rows, or possibly any rows.
You can either do this by partitioning the table such that you can simply drop partitions instead, or use the "copy the rows you want to keep then drop the table" recipe suggested by others.

I'd use the "nibbling delete" technique. From http://sqladvice.com/blogs/repeatableread/archive/2005/09/20/12795.aspx:
DECLARE #target int
SET #target = 2000
DECLARE #count int
SET #count = 2000
WHILE #count = 2000 BEGIN
DELETE FROM myBigTable
WHERE targetID IN
(SELECT TOP (#target) targetID
FROM myBigTable WITH(NOLOCK)
WHERE something = somethingElse)
SELECT #count = ##ROWCOUNT
WAITFOR DELAY '000:00:00.200'
END
I've used it for exactly this type of scenario.
The WAITFOR is important to keep, it allows other queries to do their work in between deletes.

In a small-business environment, it seems odd that you would need to delete 500,000 rows in standard operational behavior without affecting any other users. Typically for deletes that large, we're making a new table and using TRUNCATE/INSERT or sp_rename to overwrite the old one.
Having said that, in a special case, one of my monthly processes regularly can delete 200m rows in batches of around 3m at a time if it detects that it needs to re-run the process which generated those 200m rows. But this is a single-user process in a dedicated data warehouse database, and I wouldn't call it a small-business scenario.
I second the answers recommending seeking alternative approaches to your design.

i would create a task for this and schedule it to run during offpeak hours. But i would not suggest you to delete in the table being used. Move the rows you want to keep to new table and totally drop the current table with lots of rows you want to delete.

How to quickly duplicate rows in SQL

Edit: Im running SQL Server 2008
I have about 400,000 rows in my table. I would like to duplicate these rows until my table has 160 million rows or so. I have been using an statement like this:
INSERT INTO [DB].[dbo].[Sales]
([TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate])
SELECT [TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate]
FROM [DB].[dbo].[Sales]
This process is very slow. and i have to re-issue the query some large number of times Is there a better way to do this?

To do this many inserts you will want to disable all indexes and constraints (including foreign keys) and then run a series of:
INSERT INTO mytable
SELECT fields FROM mytable
If you need to specify ID, pick some number like 80,000,000 and include in the SELECT list ID+80000000. Run as many times as necessary (no more than 10 since it should double each time).
Also, don't run within a transaction. The overhead of doing so over such a huge dataset will be enormous. You'll probably run out of resources (rollback segments or whatever your database uses) anyway.
Then re-enable all the constraints and indexes. This will take a long time but overall it will be quicker than adding to indexes and checking constraints on a per-row basis.

Since each time you run that command it will double the size of your table, you would only need to run it about 9 times (400,000 * 29 = 204,800,000). Yes, it might take a while because copying that much data takes some time.

The speed of the insert will depend on a number of things...the physical disk speed, indexes, etc. I would recommend removing all indexes from the table and adding them back when you're done. If the table is heavily indexed then that should help quite a bit.
You should be able to repeatedly run that query in a loop until the desired number of rows is achieved. Every time you run it you'll double the data, so you'll end up with:
400,000
800,000
1,600,000
3,200,000
6,400,000
12,800,000
25,600,000
51,200,000
102,400,000
204,800,000
After nine executions.

You don't state your SQL database, but most have a bulk loading tool to handle this scenario. Check the docs. If you have to do it with INSERTs, remove all indexes from the table first and reapply them after the data is INSERTed; this will generally be much faster than indexing during insertion.

this may still take a while to run... you might want to turn off logging while you create your data.
INSERT INTO [DB].[dbo].[Sales] (
[TotalCost] ,[SalesAmount] ,[ETLLoadID]
,[LoadDate] ,[UpdateDate]
)
SELECT s.[TotalCost] ,s.[SalesAmount] ,s.[ETLLoadID]
,s.[LoadDate] ,s.[UpdateDate]
FROM [DB].[dbo].[Sales] s (NOLOCK)
CROSS JOIN (SELECT TOP 400 totalcost FROM [DB].[dbo].[Sales] (NOLOCK)) o

What is the best way to delete all of a large table in t-sql?

We've run across a slightly odd situation. Basically there are two tables in one of our databases that are fed tons and tons of logging info we don't need or care about. Partially because of this we're running out of disk space.
I'm trying to clean out the tables, but it's taking forever (there are still 57,000,000+ records after letting this run through the weekend... and that's just the first table!)
Just using delete table is taking forever and eats up drive space (I believe because of the transaction log.) Right now I'm using a while loop to delete records X at a time, while playing around with X to determine what's actually fastest. For instance X=1000 takes 3 seconds, while X=100,000 takes 26 seconds... which doing the math is slightly faster.
But the question is whether or not there is a better way?
(Once this is done, going to run a SQL Agent job go clean the table out once a day... but need it cleared out first.)

TRUNCATE the table or disable indexes before deleting
TRUNCATE TABLE [tablename]
Truncating will remove all records from the table without logging each deletion separately.

To add to the other responses, if you want to hold onto the past day's data (or past month or year or whatever), then save that off, do the TRUNCATE TABLE, then insert it back into the original table:
SELECT
*
INTO
tmp_My_Table
FROM
My_Table
WHERE
<Some_Criteria>
TRUNCATE TABLE My_Table
INSERT INTO My_Table SELECT * FROM tmp_My_Table
The next thing to do is ask yourself why you're inserting all of this information into a log if no one cares about it. If you really don't need it at all then turn off the logging at the source.

1) Truncate table
2) script out the table, drop and recreate the table

TRUNCATE TABLE [tablename]
will delete all the records without logging.

Depending on how much you want to keep, you could just copy the records you want to a temp table, truncate the log table, and copy the temp table records back to the log table.

If you can work out the optimum x this will constantly loop around the delete at the quickest rate. Setting the rowcount limits the number of records that will get deleted in each step of the loop. If the logfile is getting too big; stick a counter in the loop and truncate every million rows or so.
set ##rowcount x
while 1=1
Begin
delete from table
If ##Rowcount = 0 break
End
Change the logging mode on the db to simple or bulk logged will reduce some of the delete overhead.

check this
article from MSDN Delete_a_Huge_Amount_of_Data_from
Information on Recovery Models
and View or Change the Recovery Model of a Database

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas