How to efficiently delete small set of data from a large sql table

How to efficiently delete small set of data from a large sql table - sql

I want to delete 10GB (1%) data from 1TB table. I have come across several articles to delete large amounts of data from a huge table but didn't find much on deleting smaller percentage of data from a huge table.
Additional details:
Trying to delete bot data from the visits table. The filter condition is a combination of fields... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
useragent size 1024 varchar
The data can be old or new. I can't use date filter

Here is a batch delete in chunks that I use regularly. Perhaps it would give you some ideas on how to approach your need. I create a stored proc and call the proc from a SQL Agent Job. I generally schedule it to allow a transaction log backup between executions so the log does not grow too large. You could always just run it interactively if you wish.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROC [DBA_Delete_YourTableName] AS
SET NOCOUNT ON;
---------------------------------------------------------
DECLARE #DaysHistoryToKeep INT
SET #DaysHistoryToKeep = 90
IF #DaysHistoryToKeep < 30
SET #DaysHistoryToKeep = 30
---------------------------------------------------------
DECLARE #continue INT
DECLARE #rowcount INT
DECLARE #loopCount INT
DECLARE #MaxLoops INT
DECLARE #TotalRows BIGINT
DECLARE #PurgeThruDate DATETIME
SET #PurgeThruDate = DATEADD(dd,(-1)*(#DaysHistoryToKeep+1), GETDATE())
SET #MaxLoops = 100
SET #continue = 1
SET #loopCount = 0
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
WHILE #continue = 1
BEGIN
SET #loopCount = #loopCount + 1
PRINT 'Loop # ' + CAST(#loopCount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
BEGIN TRANSACTION
DELETE TOP (4500) YourTableName WHERE CREATEDDATETIME < #PurgeThruDate
SET #rowcount = ##rowcount
COMMIT
PRINT 'Rows Deleted: ' + CAST(#rowcount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
PRINT ''
IF #rowcount = 0 OR #loopCount >= #MaxLoops
BEGIN
SET #continue = 0
END
END
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows Remaining = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
GO

The filter condition is ... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
Regarding table size, it's important to touch as few rows as possible while executing the delete.
I imagine on a table that size you already have an index on the ip column. It might help (or not) to put your list 20 or so ips in a table instead of in an in clause, especially if they're parameters. I'd look at my query plan to see.
I hope useragent like '%SOMETHING%' is usually true; otherwise it's an expensive test because SQL Server has to examine every row for an eligible ip. If not, a redesign to allow the query to avoid like would probably be beneficial.
[D]eleting smaller percentage isn't really relevant. Using selective search criteria is (per above), as is the size of the delete transaction in absolute terms. By definition, the size of the deletion in terms of rows and row size determines the size of the transaction. Very large transactions can push against machine resources. Breaking them up into smaller ones can yield better performance in such cases.
The last server I used had 0.25 TB RAM and was comfortable deleting 1 million rows at a time, but not 10 million. Your milage will vary; you have to try, and observe, to see.
How much you're willing to tax the machine will depend on what else is (or needs to be able to) run at the same time. The way you break up one logical action -- delete all rows where [condition] -- into "chunks" also depends on what you want the database to look like while the delete procedure is in process, when some chunks are deleted and others remain present.
If you do decide to break it into chunks, I recommend not using a fixed number of rows and a TOP(n) syntax, because that's the least logical solution. Unless you use order by, you're leaving to the server to choose arbitrarily which N rows to delete. If you do use order by, you're requiring the server to sort the result before starting the delete, possibly several times over the whole run. Bleh!
Instead, find some logical subset of rows, ideally distinguishable along the clustered index, that fall beneath your machine's threshold of an acceptable number of rows to delete at one time. Loop over that set. In your case, I would be tempted to iterate over the set of ip values in the in clause. Instead of delete ... where ip in(...), you get (roughly) for each ip delete ... where ip = #ip
The advantage of the latter approach is that you always know where the database stands. If you kill the procedure or it gets rolled back partway through its iteration, you can examine the database to see which ips still remain (or whatever criteria you end up using). You avoid any kind of pathological behavior, whereby some query gets a partial result because some part of your selection criteria (determined by the server alone) are present and others deleted. In thinking about the problem you can say, I'm unable to delete ip 192.168.0.1 because, without wondering which portion have already been removed.
In sum, I recommend:
Give the server every chance to touch only the rows you want to affect, and verify that's what it will do.
Construct your delete routine, if you need one, to delete logical chunks, so you can reason about the state of the database at any time.

Related

How Do I Repeatedly Run SQL Delete Statements and Shrink Transaction Log?

I have a customer who is running out of space on their drive, and it's entirely taken up by the SQL DB and transaction log. Unfortunately, moving the DB and log are not an option available to us at the current time, so I need to figure out how to delete lines from 2 massive tables. So far a coworker has spent 4 days trying to do this and has not been able to put a dent in it. There are only 13GB available for the transaction log, and deleting large quantities from these tables wipes that 13GB out really quick. Obviously the quickest thing to do would be to move what we want to keep to a temporary table, truncate the existing tables, then move them back. Unfortunately, this is an extremely busy environment within a hospital, and there are tens of thousands of lines being written to these tables every hour. So, unfortunately, we can't temporarily stop writing to this table to be able to truncate.
So we've been deleting a month of data at a time from each of these tables, then shrinking the transaction log to do it again. I feel like there's got to be a way to just get this to repeat, but I'm not entirely sure what I'm doing... I tried:
delete top (10000)
from Table 1_
where CreationDate_ < '2017-06-01'
Go
delete top (10000)
from Table2_
where CreationDate_ < '2017-06-01'
Go
dbcc shrinkfile (SQL_Log,4)
Go 2
Go 2
This appears to remove 10,000 lines from each table, then runs the shrinkfile for the log twice (for some reason it doesn't fully shrink down to the 4224kb size when we only run it twice), but it does not seem to repeat. I've also tried adding () starting at the first delete statement and ending after the first "Go 2" line. When I do that it just says:
Incorrect syntax near "Go"
Anybody have any clue how to do this? If we can get this to work, I plan on increasing the delete statements to a number much larger than 10,000, and increasing the repeat on the script to something much larger than 2, but I need to get it to work before I can actually do that...

You can use a WHILE loop and control the number of iterations using a COUNT of records.
DECLARE #Chunk INT = 10000
DECLARE #Date DATE = '2017-06-01'
DECLARE #Cnt INT = SELECT COUNT(*) FROM Table 1_ WHERE CreationDate_ < #Date
WHILE #Cnt > 0
BEGIN
delete top (#Chunk)
from Table 1_
where CreationDate_ < #Date
SET #Cnt = #Cnt - #Chunk
END
//Move on to the next group
SET #Date = '2017-07-01'
SET #Cnt = SELECT COUNT(*) FROM Table 1_ WHERE CreationDate_ < #Date
WHILE #Cnt > 0
BEGIN
// Your delete query
SET #Cnt = #Cnt - #Chunk
END
//and so on

You don't need to shrink the log file repeatedly. If the database is in simple recovery mode the log file will not continue to grow, so long as you don't delete too many rows in a single transaction. Once all the rows have been purged you can shrink the log file down to something reasonable for the environment.

Updating a large table and minimizing user impact

I have a question on general database/sql server designing:
There is a table with 3 million rows that is being accessed 24x7. I need to update all the records in the table. Can you give me some methods to do this so that the user impact is minimized while I update my table?
Thanks in advance.

Normally you'd write a single update statement to update rows. But in your case you actually want to break it up.
http://www.sqlfiddle.com/#!3/c9c75/6
Is a working example of a common pattern. You don't want a batch size of 2, maybe you want 100,000 or 25,000 - you'll have to test on your system to determine the best balance between quick completion and low blocking.
declare #min int, #max int
select #min = min(user_id), #max = max(user_id)
from users
declare #tmp int
set #tmp = #min
declare #batchSize int
set #batchSize = 2
while #tmp <= #max
begin
print 'from ' + Cast(#tmp as varchar(10)) + ' to ' + cast(#tmp + #batchSize as varchar(10)) + ' starting (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
update users
set name = name + '_foo'
where user_id >= #tmp and user_id < #tmp + #batchsize and user_id <= #max
set #tmp = #tmp + #batchSize
print 'Done (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
WAITFOR DELAY '000:00:001'
end
update users
set name = name + '_foo'
where user_id > #max
We use patterns like this to update a user table about 10x your table size. With 100,000 chunks it takes about an hour. Performance depends on your hardware of course.

To minimally impact users, I would update only a certain # of records at a time. The number to update is more dependent on your hardware than anything else in my opinion.

As with all things database, it depends. What is the load pattern (ie, are users reading mainly from the end of the table)? How are new records added, if at all? What are your index fill factor settings and actual values? Will your update force any index re-computes? Can you split up the update to reduce locking? If so, do you need robust rollback ability in case of a failure? Are you setting the same value in every row, or do you need a per row calculation, or do you have a per-row source to match up?

Go through the table one row at a time using a loop or even a cursor. Make sure each update is using row locks.
If you don't have a way of identifying rows that still have to be updated, create another table first to hold the primary key and an update indicator, copy all primary key values in there and then keep track of how far you are along in that table.
This is also going to be the slowest method. If you need it to go a little faster, update a few thousand rows at a time, still using rowlock hints.

sql server: Is this nesting in a transcation sufficient for getting a unique number from the database?

i want to generate a unique number from a table.
It has to be thread safe of course, so when i check for the last number and get '3', and then store '4' in the database, i don't want anybody else just in between those two actions (get the number and store it one higher in the database) also to get '3' back, and then also storing '4'
So i thought, put it in a transaction like this:
begin transaction
declare #maxNum int
select #maxNum = MAX(SequenceNumber) from invoice
where YEAR = #year
if #maxNum is null
begin
set #maxNum = 0
end
set #maxNum = #maxNum + 1
INSERT INTO [Invoice]
([Year]
,[SequenceNumber]
,[DateCreated])
VALUES
(#year
,#maxNum
,GETUTCDATE()
)
commit transaction
return #maxNum
But i wondered, is that enough, to put it in a transaction?
my first thought was: it locks this sp for usage by other people, but is that correct? how can sql server know what to lock at the first step?
Will this construction guarantee me that nobody else will do the select #maxnum part just when i am updating the #maxnum value, and at that moment receiving the same #maxnum as i did so i'm in trouble.
I hope you understand what i want to accomplish, and also if you know if i did choose the right solution.
EDIT:
also described as 'How to Single-Thread a stored procedure'

If you want to have the year and a sequence number stored in the database, and create an invoice number from that, I'd use:
a InvoiceYear column (which could totally be computed as YEAR(InvoiceDate))
an InvoiceID INT IDENTITY column which you could reset every year to 1
create a computed column InvoiceNumber as:
ALTER TABLE dbo.InvoiceTable
ADD InvoiceNumber AS CAST(InvoiceYear AS VARCHAR(4)) + '-' +
RIGHT('000000' + CAST(InvoiceID AS VARCHAR(6)), 6) PERSISTED
This way, you automagically get invoice numbers:
2010-000001
......
2010-001234
......
2010-123456
Of course, if you need more than 6 digits (= 1 million invoices) - just adjust the RIGHT() and CAST() statements for the InvoiceID column.
Also, since this is a persisted computed column, you can index it for fast retrieval.
This way: you don't have to worry about concurrency, stored procs, transactions and stuff like that - SQL Server will do that for you - for free!

No, it's not enough. The shared lock set by the select will not prevent anyone from reading that same value at the same time.
Change this:
select #maxNum = MAX(SequenceNumber) from invoice where YEAR = #year
To this:
select #maxNum = MAX(SequenceNumber) from invoice with (updlock, holdlock) where YEAR = #year
This way you replace the shared lock with an update lock, and two update locks are not compatible with each over.
The holdlock means that the lock is to be held until the end of the transaction. So you do still need the transaction bit.
Note that this will not help if there's some other procedure that also wants to do the update. If that other procedure reads the value without providing the updlock hint, it will still be able to read the previous value of the counter. This may be a good thing, as it improves concurrency in scenarios where the other readers do not intend to make an update later, but it also may be not what you want, in which case either update all procedures to use updlock, or use xlock instead to place an exclusive lock, not compatible with shared locks.

As it turned out, i didn't want to lock the table, i just wanted to execute the stored procedure one at a time.
In C# code i would place a lock on another object, and that's what was discussed here
http://www.sqlservercentral.com/Forums/Topic357663-8-1.aspx
So that's what i used
declare #Result int
EXEC #Result =
sp_getapplock #Resource = 'holdit1', #LockMode = 'Exclusive', #LockTimeout = 10000 --Time to wait for the lock
IF #Result < 0
BEGIN
ROLLBACK TRAN
RAISERROR('Procedure Already Running for holdit1 - Concurrent execution is not supported.',16,9)
RETURN(-1)
END
where 'holdit1' is just a name for the lock.
#result returns 0 or 1 if it succeeds in getting the lock (one of them is when it immediately succeeds, and the other is when you get the lock while waiting)

Copy one column to another for over a billion rows in SQL Server database

Database : SQL Server 2005
Problem : Copy values from one column to another column in the same table with a billion+
rows.
test_table (int id, bigint bigid)
Things tried 1: update query
update test_table set bigid = id
fills up the transaction log and rolls back due to lack of transaction log space.
Tried 2 - a procedure on following lines
set nocount on
set rowcount = 500000
while #rowcount > 0
begin
update test_table set bigid = id where bigid is null
set #rowcount = ##rowcount
set #rowupdated = #rowsupdated + #rowcount
end
print #rowsupdated
The above procedure starts slowing down as it proceeds.
Tried 3 - Creating a cursor for update.
generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.
Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.
Any hints, pointers will be much appreciated.

I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)
Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.
Log Growth
The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.
If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.
Indexes and Speed
ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.
The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:
set #counter = 1
while #counter < 2000000000 --or whatever
begin
update test_table set bigid = id
where id between #counter and (#counter + 499999) --BETWEEN is inclusive
set #counter = #counter + 500000
end
This should be extremely fast, because of the existing indexes on ID.
The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.

Use TOP in the UPDATE statement:
UPDATE TOP (#row_limit) dbo.test_table
SET bigid = id
WHERE bigid IS NULL

You could try to use something like SET ROWCOUNT and do batch updates:
SET ROWCOUNT 5000;
UPDATE dbo.test_table
SET bigid = id
WHERE bigid IS NULL
GO
and then repeat this as many times as you need to.
This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.
Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.

Is this a one time thing? If so, just do it by ranges:
set counter = 500000
while #counter < 2000000000 --or whatever your max id
begin
update test_table set bigid = id where id between (#counter - 500000) and #counter and bigid is null
set counter = #counter + 500000
end

I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.
set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL
You can also try changing the recover model so you don't log the transactions
ALTER DATABASE db1
SET RECOVERY SIMPLE
GO
update test_table
set bigid = id
GO
ALTER DATABASE db1
SET RECOVERY FULL
GO

First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.
The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.
select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'

I second the
UPDATE TOP(X) statement
Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed

Batch commit on large INSERT operation in native SQL?

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.
If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)
I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).
Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?

Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:
SELECT key_columns INTO dbo.temp FROM dbo.HugeView;
(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)
Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:
SET NOCOUNT ON;
DECLARE
#batchsize INT,
#ctr INT,
#rc INT;
SELECT
#batchsize = 10000,
#ctr = 0;
WHILE 1 = 1
BEGIN
WITH x AS
(
SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
FROM dbo.temp
)
INSERT dbo.PrimaryTable(a, b, c, etc.)
SELECT v.a, v.b, v.c, etc.
FROM x
INNER JOIN dbo.HugeView AS v
ON v.key_column = x.key_column
WHERE x.rn > #batchsize * #ctr
AND x.rn <= #batchsize * (#ctr + 1);
IF ##ROWCOUNT = 0
BREAK;
BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;
SET #ctr = #ctr + 1;
END
That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:
Break large delete operations into chunks
Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.

You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.
create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever
while ....
insert into yourtable(...)
select * from whatever
where YEAR(DateCol) = year and MONTH(DateCol) = month
end

I know this is an old thread, but I made a generic version of Arthur's cursor solution:
--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an #Day variable (for example)
DECLARE #Year INT
DECLARE #Month INT
DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;
OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
WHILE ##FETCH_STATUS = 0
BEGIN
--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=#Year AND Month(<SomeDateField>)=#Month
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO
This solved the problem on loads of our large tables.

There is no pixie dust, you know that.
Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:
create table currentKeys (table sysname not null primary key, key sql_variant not null);
go
declare #keysInserted table (key sql_variant);
declare #key sql_variant;
begin transaction
do while (1=1)
begin
select #key = key from currentKeys where table = '<target>';
insert into <target> (...)
output inserted.key into #keysInserted (key)
select top (<batchsize>) ... from <source>
where key > #key
order by key;
if (0 = ##rowcount)
break;
update currentKeys
set key = (select max(key) from #keysInserted)
where table = '<target>';
commit;
delete from #keysInserted;
set #key = null;
begin transaction;
end
commit
It would get more complicated if you want to allow for parallel batches and partition the keys.

You could use the BCP command to load the data and use the Batch Size parameter
http://msdn.microsoft.com/en-us/library/ms162802.aspx
Two step process
BCP OUT data from Views into Text files
BCP IN data from Text files into Tables with batch size parameter

This looks like a job for good ol' BCP.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas