Deleting data from a large table - sql

I have a table with about 10 fields to store gps info for customers. Over time as we have added more customers that table has grown to about 14 million rows. As the gps data comes in a service constantly inserts a row into the table. 90% of the data is not revelent i.e. the customer does not care where the vehicle was 3 months ago, but the most recent data is used to generate tracking reports. My goal is to write a sql to perform a purge of the data that is older than a month.
Here is my problem I can NOT use TRUNCATE TABLE as I would lose everything?
Yesterday I wrote a delete table statement with a where clause. When I ran it on a test system it locked up my table and the simulation gps inserts were intermittently failing. Also my transaction log grew to over 6GB as it attempted to log each delete.
My first thought was to delete the data a little at a time starting with the oldest first but I was wondering if there was a better way.

My 2 cents:
If you are using SQL 2005 and above, you can consider to partition your table based on the date field, so the table doesn't get locked when deleting old records.
Maybe, if you are in position of making dba decisions, you can temporarily change your log model to Simple, so it won't grow up too fast, it will still be growing, but the log won't be too detailed.

Try this
WHILE EXISTS ( SELECT * FROM table WHERE (condition for deleting))
BEGIN
SET ROWCOUNT 1000
DELETE Table WHERE (condition for deleting)
SET ROWCOUNT 0
ENd
This will delete the rows in groups of 1000

Better is to create a temporary table and insert only the data you want to keep. Then truncate your original table and copy back the backup.
Oracle syntax (SQL Server is similar)
create table keep as select * from source where data_is_good = 1;
truncate table source;
insert into source select * from keep;
You'll need to disable foreign keys, if there are any on the source table.
In Oracle, index names must be unique in the entire schema, not just per-table. In SQL server, you can further optimize this by just renaming "keep" to "source", as you can easily create indexes of the same name on both tables

If you're using SQL Server 2005 or 2008, sliding window partitioning is the perfect solution for this - instant archiving or purging without any perceptible locking. Have a look here for further information.

Welcome to Data Warehousing. You need to to split your data into two parts.
The actual application, with current data only.
The history.
You need to do write a little "ETL" job to move data from current to history and delete the history that was moved.
You need to run this periodically. Daily - weekly - monthly quarterly -- doesn't matter technically. What matters is what use the history has and who uses it.

Can you copy recent data to a new table, truncate the table, then copy it back?
Of course, then you're going to need to worry about doing that again in 6 months or a year.

I would do a manual delete by day/month (whatever is largest unit you can get away with.) Once you do that first one, then write a stored proc to kick off every day that deletes the oldest data you don't need.
DELETE FROM TABLENAME
WHERE datediff(day,tableDateTime,getdate() > 90
Personally, I hate doing stuff to production datasets where one missed key results in some really bad things happening.

I would probably do it in batches as you have already come up with. Another option would be to insert the important data into another table, truncate the GPS table, then reinsert the important data. You would have a small window where you would be missing the recent historical data. How small that window is would depend on how much data you needed to reinsert. Also, you would need to be careful if the table uses autoincrementing numbers or other defaults so that you use the original values.
Once you have the table cleaned up, a regular cleaning job should be scheduled. You might also want to look into partitioning depending on your RDBMS.

I assume you can't down the production system (or queue up the GPS results for insertion after the purge is complete).
I'd go with your inclination of deleting a fraction of it at a time (perhaps 10%) depending on the performance you find on your test system.
Is your table indexed? That might help, but the indexing process my have simmilar effects on the system as doing the one great purge.

Keep in mind that most databases lock the neighboring records in an index during an transaction, so keeping your operations short will be helpful. I'm assuming that your insertions are failing on lock wait timeouts, so delete your data in small, bursty transactions. I'd suggest a single-threaded Perl script that loops through in the oldest 1,000 chunk increments. I hope your primary key (and hopefully clustered index incase they somehow ended up being two different things) can be correlated to time as that would be the best thing to delete by.
PseudoSQL:
Select max(primId) < 3_months_ago
Delete from table where primId < maxPrimId limit 1000
Now, here's the really fun part: All these deletions MAY make your indexes a mess and require that they be rebuilt to keep the machine from getting slow. In that case, you'll either have to swap in an up-to-date slave, or just suffer some downtime. Make sure you test for this possible case on your test machine.

If you are using oracle, i would set up a partition by date on your tables and the indexes. Then you delete the data by dropping the partition... the data will magically go away with the partition.
This is an easy step - and doesn't clog up your redo logs etc.
There's a basic intro to all this here

Does the delete statement use any of the indexes on the table? Often times a huge performance improvement can be obtained by either modifying the statement to use an existing index or to add an index on the table that helps improve the performance of the query that the delete statement does.
Also, as other mentioned, the deletes should be done in multiple chunks instead of one huge statement. This prevents the table from getting locked too long, and having other processes time out waiting for the delete to finish.

Performance is pretty fast when dropping a table- even a very large one. So here is what I would do. Script out your table complete with indexes from Management Studio. Edit the script and run it to create a copy of your table. Call it table2. Do a select insert to park the data you want to retain into the new table2. Rename the old table, say tableOld. Rename table2 with the original name. Wait. If no one screams at you drop table2.
There is some risk.
1) Check if there are triggers or constraints defined on the original table. They may not get included in the script generated by management studio.
2) if original table has identity fields you may have to turn on identity_insert before inserting into the new table.

I came up with the following T-SQL script which gets an arbitrary amount of recent data.
IF EXISTS(SELECT name FROM sys.tables WHERE name = 'tmp_xxx_tblGPSVehicleInfoLog')
BEGIN
PRINT 'Dropping temp table tmp_xxx_tblGPSVehicleInfoLog'
DROP TABLE tmp_xxx_tblGPSVehicleInfoLog
END
GO
PRINT 'Creating temp table tmp_xxx_tblGPSVehicleInfoLog'
CREATE TABLE [dbo].[tmp_xxx_tblGPSVehicleInfoLog](
[GPSVehicleInfoLogId] [uniqueidentifier] NOT NULL,
[GPSVehicleInfoId] [uniqueidentifier] NULL,
[Longitude] [float] NULL,
[Latitude] [float] NULL,
[GroundSpeed] [float] NULL,
[Altitude] [float] NULL,
[Heading] [float] NULL,
[GPSDeviceTimeStamp] [datetime] NULL,
[Milliseconds] [float] NULL,
[DistanceNext] [float] NULL,
[UpdateDate] [datetime] NULL,
[Stopped] [nvarchar](1) NULL,
[StopTime] [datetime] NULL,
[StartTime] [datetime] NULL,
[TimeStopped] [nvarchar](100) NULL
) ON [PRIMARY]
GO
PRINT 'Inserting data from tblGPSVehicleInfoLog to tmp_xxx_tblGPSVehicleInfoLog'
SELECT * INTO tmp_xxx_tblGPSVehicleInfoLog
FROM tblGPSVehicleInfoLog
WHERE tblGPSVehicleInfoLog.UpdateDate between '03/30/2009 23:59:59' and '05/19/2009 00:00:00'
GO
PRINT 'Truncating table tblGPSVehicleInfoLog'
TRUNCATE TABLE tblGPSVehicleInfoLog
GO
PRINT 'Inserting data from tmp_xxx_tblGPSVehicleInfoLog to tblGPSVehicleInfoLog'
INSERT INTO tblGPSVehicleInfoLog
SELECT * FROM tmp_xxx_tblGPSVehicleInfoLog
GO

To keep the transaction log from growing out of control, modify it in the following way:
DECLARE #i INT
SET #i = 1
SET ROWCOUNT 10000
WHILE #i > 0
BEGIN
BEGIN TRAN
DELETE TOP 1000 FROM dbo.SuperBigTable
WHERE RowDate < '2009-01-01'
COMMIT
SELECT #i = ##ROWCOUNT
END
SET ROWCOUNT 0
And here is a version using the preferred TOP syntax for SQL 2005 and 2008:
DECLARE #i INT
SET #i = 1
WHILE #i > 0
BEGIN
BEGIN TRAN
DELETE TOP 1000 FROM dbo.SuperBigTable
WHERE RowDate < '2009-01-01'
COMMIT
SELECT #i = ##ROWCOUNT
END

I'm sharing my solution. I did index the date field. While the procedure was running, I tested getting record counts, inserts, and updates. They were able to complete while the procedure was running. In an Azure managed instance, running on the absolute lowest configuration (General Purpose, 4 cores) I was able to purge 1 million rows in a minute (about 55 seconds).
CREATE PROCEDURE [dbo].[PurgeRecords] (
#iPurgeDays INT = 2,
#iDeleteRows INT = 1000,
#bDebug BIT = 1 --defaults to debug mode
)
AS
SET NOCOUNT ON
DECLARE #iRecCount INT = 0
DECLARE #iCycles INT = 0
DECLARE #iRowCount INT = 1
DECLARE #dtPurgeDate DATETIME = GETDATE() - #iPurgeDays
SELECT #iRecCount = COUNT(1) FROM YOURTABLE WHERE [Created] <= #dtPurgeDate
SELECT #iCycles = #iRecCount / #iDeleteRows
SET #iCycles = #iCycles + 1 --add one my cycle to get the remainder
--purge the rows in groups
WHILE #iRowCount <= #iCycles
BEGIN
BEGIN TRY
IF #bDebug = 0
BEGIN
--delete a group of records
DELETE TOP (#iDeleteRows) FROM YOURTABLE WHERE [Created] <= #dtPurgeDate
END
ELSE
BEGIN
--display the delete that would have taken place
PRINT 'DELETE TOP (' + CONVERT(VARCHAR(10), #iDeleteRows) + ') FROM YOURTABLE WHERE [Created] <= ''' + CONVERT(VARCHAR(25), #dtPurgeDate) + ''''
END
SET #iRowCount = #iRowCount + 1
END TRY
BEGIN CATCH
--if there are any issues with the delete, raise error and back out
RAISERROR('Error purging YOURTABLE Records', 16, 1)
RETURN
END CATCH
END
GO

Related

SQL Column ID value jumps 10000 times

The ID column index jumps 10000 times.
For example:
From index :
5 goes to 10006
and the continues 10007 , 10008 , 10009
and then goes to 20003 , 20004 ....
How I could fix the ID values and put them in order again like before?
Also I have find something about a Reseed function, but I do not know what it is and how to use it?
I'm assuming you're using an identity column:
ID INT NOT NULL IDENTITY(1,1)
There's no guarantee that this will remain in sequence. It was annoying when it first became more obvious (it didn't appear to happen in older versions of SQL Server but apparently could happen) but was also always by design. The skipping became very apparent when 2012 (?) was released. You're supposed to use a SEQUENCE now I believe if maintaining a steady sequence is required - e.g. invoice numbers:
https://dba.stackexchange.com/questions/62151/what-could-cause-an-auto-increment-primary-key-to-skip-numbers
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-sequence-transact-sql?view=sql-server-ver15
It may also appear to skip if you perform an INSERT and it fails but this will only skip 1 typically. This has always happened and is by design - you need to reseed you identity to overcome this. Something like:
DBCC CHECKIDENT ("dbo.MyTable", RESEED, 10)
Will make the next identity number 11 provided the other skipping doesn't also occur.
EDIT:
In relation to re-aligning your existing entries I'm no DB Expert but I did do this the other day on a table using a fairly rudimentary approach - but it's only a small table - there's probably a better way to do it:
BEGIN TRAN
--CREATE TEMP TABLE
DECLARE #Tooltip TABLE
(
[TooltipId] INT NOT NULL,
[TooltipKey] NVARCHAR(100) NOT NULL,
[Name] NVARCHAR(255) NOT NULL
)
--INSERT EXISTING INTO TEMP TABLE
INSERT INTO #Tooltip (TooltipKey, Name )
SELECT TooltipKey, Name
FROM dbo.Tooltip
ORDER BY TooltipId
--CLEAR ACTUAL TABLE
TRUNCATE TABLE dbo.Tooltip
--RESET IDENTITY TO 1
DBCC CHECKIDENT ("dbo.Tooltip", RESEED, 1)
--REINSERT FROM TEMP TABLE INTO ACTUAL TABLE
INSERT INTO dbo.Tooltip (TooltipKey, Name )
SELECT TooltipKey, Name
FROM #Tooltip
ORDER BY TooltipId
--TEST OUTPUT
SELECT * FROM dbo.Tooltip
--DO THIS FOR TESTING
ROLLBACK TRAN
--DO THIS WHEN YOU'RE CERTAIN YOU WANT TO PERFORM THE ACTION
--COMMIT TRAN
Bearing in mind that that if you have foreign keys or other references the truncate won't work and you'll have to do something more complex.Particularly if you have foreign keys referencing your existing incorrect IDs
This is not a problem. This is a performance feature of SQL Server.
SQL Server is designed to handle many concurrent transactions -- think dozens or hundreds of inserts per second. It can do this on systems with multiple processors.
In such an environment, "adding just 1" to the maximum can have a lot of overhead -- all the different processors have to agree on what the maximum is. This involves complex locking or sequencing of the transactions -- which slows things down.
To prevent performance bottlenecks, SQL Server will sometimes pre-allocate identity values. This can result in gaps if the numbers are not used.
If you don't like this feature, you can work around it by using a sequence and a trigger to assign the value. Just be warned that alternative approaches have performance implications.
Have you been running large deletes?
Delete doesn't reset the identity, so if you had rows 1-10000, then deleted all of them, the identity would still continue from 10001 when you added a new row.
Truncate does reset identity, but always DELETES ALL ROWS without logging.
You could use the reseed function to reset identity also, but wouldn't be helpful for this case since you'd slowly increment back into ids used by existing data.

How to properly index SQL Server table with 25 million rows

I have created a table in SQL Server 2008 R2 as follows:
CREATE TABLE [dbo].[7And11SidedDiceGame]
(
[Dice11Sides] [INT] NULL,
[Dice7Sides] [INT] NULL,
[WhoWon] [INT] NULL
)
I added the following index:
CREATE NONCLUSTERED INDEX [idxWhoWon]
ON [dbo].[7And11SidedDiceGame] ([WhoWon] ASC)
I then created a WHILE loop to Insert 25 million RANDomly generated rows to tally the results for statistical analysis.
Once I optimized the Insert function (Using BEGIN TRAN and COMMIT TRAN before and after the loop) the While loop ran decent. However, analyzing the data takes a long time. For example: using the following statement takes about 4 minutes to perform:
DECLARE #TotalRows real
SELECT #TotalRows = COUNT(*)
FROM [test].[dbo].[7And11SidedDiceGame]
PRINT REPLACE(CONVERT(VARCHAR, CAST(#TotalRows AS money), 1),'.00','')
SELECT
WhoWon, COUNT(WhoWon) AS Total,
((COUNT(WhoWon) * 100) / #TotalRows) AS PercentWinner
FROM
[test].[dbo].[7And11SidedDiceGame]
GROUP BY
WhoWon
My question is how can I better index the table to speed up retrieval of the data? Or do I need to approach the pulling of the data in a different manner?
I don't think you can do much here.
The query has to read all 25M rows from the index to count them. Though, 25M rows is not that much and I'd expect it to take less than 4 minutes on a modern hardware.
It is only 100MB of data to read (OK, in practice it is more, say, 200MB, still it should not take 4 minutes to read 200MB off the disk).
Is the server under a heavy load? Are there a lot of inserts into this table?
You could make a small improvement by defining WhoWon column as NOT NULL in the table. Do you really have NULL values in it?
And then use COUNT(*) instead of count(WhoWon) in the query.
If this query runs often, but the data in the table doesn't change too often, you can create an indexed view that would essentially materialise/cache/pre-calculate these Counts, so the query that would run off such view would be much faster.
You may be able to speed this by using window functions:
SELECT WhoWon, count(*) AS Total,
count(*) * 100.0 / sum(count(*)) over () as PercentWinner
FROM [test].[dbo].[7And11SidedDiceGame]
GROUP BY WhoWon;
This does not provide the separate print statement.
For performance, try an index on (WhoWon).

Passing a table to a stored procedure

I have a table with 20 billion rows. Table does not have any indexes as it was created on fly for doing bulk insert operation. The table is being used in a stored procedure which does the following operation
Delete A
from master a
inner join (Select distinct Col from TableB ) b
on A.Col = B.Col
Insert into master
Select *
from tableB
group by col1,col2,col3
TableB is the one which has 20 billion rows. I don't want to execute SP directly because it might take days to complete the execution. Master is also a huge table and has clustered index on Col
Can i pass chunks of rows to the stored procedure and perform the operation.This might reduce the log file growth. If yes how can i do that
Should i create clustered index on the table and execute the SP which might be little faster but then again i think creating CI on a huge table might take 10 hours to complete.
Or is there any way to perform this operation fast
I've used a method similar to this one. I'd recommend putting your DB into Bulk Logged recovery mode instead of Full recovery mode if you can.
Blog entry reproduced below to future proof it.
Below is a technique used to transfer a large amount of records from
one table to another. This scales pretty well for a couple reasons.
First, this will not fill up the entire log prior to committing the
transaction. Rather, it will populate the table in chunks of 10,000
records. Second, it’s generally much quicker. You will have to play
around with the batch size. Sometimes it’s more efficient at 10,000,
sometimes 500,000, depending on the system.
If you do not need to insert into an existing table and just need a
copy of the table, it is better to do a SELECT INTO. However for this
example, we are inserting into an existing table.
Another trick you should do is to change the recovery model of the
database to simple. This way, there will be much less logging in the
transaction log.
The WITH (TABLOCK) below only works in SQL 2008.
DECLARE #BatchSize INT = 10000
WHILE 1 = 1
BEGIN
INSERT INTO [dbo].[Destination] --WITH (TABLOCK) -- Uncomment for 2008
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE NOT EXISTS (
SELECT 1
FROM dbo.Destination
WHERE PersonID = s.PersonID
)
IF ##ROWCOUNT < #BatchSize BREAK
END
With the above example, it is important to have at least a non
clustered index on PersonID in both tables.
Another way to transfer records is to use multiple threads. Specifying
a range of records as such:
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 1 AND 5000
GO
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 5001 AND 10000
For super fast performance however, I’d recommend using SSIS.
Especially in SQL Server 2008. We recently transferred 17 million
records in 5 minutes with an SSIS package executed on the same server
as the two databases it transferred between.
SQL Server 2008 SQL Server 2008 has made changes with regards to it’s
logging mechanism when inserting records. Previously, to do an insert
that was minimally logged, you would have to perform a SELECT.. INTO.
Now, you can perform a minimally logged insert if you can lock the
table you are inserting into. The example below shows an example of
this. The exception to this rule is if you have a clustered index on
the table AND the table is not empty. If the table is empty and you
acquire a table lock and you have a clustered index, it will be
minimally logged. However if you have data in the table, the insert
will be logged. Now if you have a non clustered index on a heap and
you acquire a table lock then only the non clustered index will be
logged. It is always better to drop indexes prior to inserting
records.
To determine the amount of logging you can use the following statement
SELECT * FROM ::fn_dblog(NULL, NULL)
Credit for above goes to Derek Dieter at SQL Server Planet.
If you're dead set on passing a table to your stored procedure, you can pass a table-valued parameter to a stored procedure in SQL Server 2008. You might have better luck with some other approaches suggested, like partitioning. Select distinct on a table with 20 billion rows might be part of the problem. I wonder if some very basic tuning wouldn't help, too:
Delete A
from master a
where exists (select 1 from TableB b where b.Col = a.Col)

Avoiding Deadlocks in SQL with concurrent inserts

I have a process (Process A) which keeps adding records to a SQL table (Table A) (Direct inserts using stored procedure). It is a continuous process which reads the requests and writes to a table. There is no pattern to how the requests come. The max requests per day is around 100K.
Once the requests come in, I need to do some processing on those requests. These are currently done in user desktops (due to licensing issues). The way I am currently doing is having an executable (Process B) run on each user and as and when requests come in to the table, this process reads and does some work and writes to the same table. So Table is read/written by multiple processes. The process B has the following logic
Get records that have not been processed by another user and is not
being currently processed by another user
Lock the records for this run by marking a flag isProcessing (C# LINQ through SP). This is a single SQL transaction i.e. lock records and get them for processing is wrapped in a transaction
process the records. This is where the calculation occurs. No db work here.
insert/update records in table A (C# LINQ through db.submitchanges). This is where the deadlock occurs. This is a separate SQL transaction.
Occasionally, I see deadlocks when writing to the table. This SQL Server 2008 (with isolation level Read committed). Access to SQL is done by both Stored procedures and direct C# Linq Queries.
Question is how to avoid the deadlocks. Is there a better overall architecture ? Maybe, instead of all these child processes writing to the table independently, I should send them to a service which queues them up and writes to the table ?. I know it is tough to answer without having all the code (just too many to show) but hopefully I have explained it and I will happy to answer any specific questions.
This is a representative table structure.
CREATE TABLE [dbo].[tbl_data](
[tbl_id] [nvarchar](50) NOT NULL,
[xml_data] [xml] NULL, -- where output will be stored
[error_message] [nvarchar](250) NULL,
[last_processed_date] [datetime] NULL,
[last_processed_by] [nvarchar](50) NULL,
[processing_id] [uniqueidentifier] NULL,
[processing_start_date] [datetime] NULL,
[create_date] [datetime] NOT NULL,
[processing_user] [nvarchar](50) NULL,
CONSTRAINT [PK_tbl_data] PRIMARY KEY CLUSTERED
(
[tbl_id] ASC,
[create_date] ASC
) ON [PRIMARY]
This is the proc that gets the data for processing.
begin tran
-- clear processing records that have been running for more than 6 minutes... they need to be reprocessed...
update tbl_data set processing_id = null, processing_start_date = null
where DATEDIFF(MINUTE, processing_start_date, GETDATE()) >=6
DECLARE #myid uniqueidentifier = NEWID();
declare #user_count int
-- The literal number below is the max any user can process. The last_processed_by and last_processed_date are updated when a record has been processed
select #user_count = 5000 - count(*) from tbl_data where last_processed_by = #user_name and DATEDIFF(dd, last_processed_date, GETDATE()) = 0
IF (#user_count > 1000)
SET #user_count = 1000 -- no more than 1000 requests in each batch.
if (#user_count < 0)
set #user_count = 0
--mark the records as being processed
update tbl_data set processing_id = #myid, processing_start_date = GETDATE(), processing_user = #user_name from tbl_data t1 join
(
select top (#user_count) tbl_id from tbl_data
where
[enabled] = 1 and processing_id is null
and isnull(DATEDIFF(dd, last_processed_date, GETDATE()), 1) > 0
and isnull(DATEDIFF(dd, create_date, GETDATE()), 1) = 0
) t2 on t1.tbl_id = t2.tbl_id
-- get the records that have been marked
select tbl_id from tbl_data where processing_id = #myid
commit tran
My guess is you are deadlocking on pages as concurrent updates are attempted.
With the nature of the updates and inserts (a sliding timeframe window based on getdate), it looks like a good partitioning scheme is difficult to implement. Without it, I think your best option would be to implement an application level lock (the sql equivalent of a mutex) using sp_getapplock
http://msdn.microsoft.com/en-us/library/ms189823(v=sql.100).aspx
I lack the time right now to analyze your workload and find a true fix. So I'm going to add a different kind of answer: You can safely retry deadlocking transactions. This problem can be fixed by just re-running the entire transactions. Maybe a little delay needs to be inserted before a retry is attempted.
Be sure to rerun the entire transaction, though, including any control flow that happens in the application. In case of a retry the data that was already read might have changed.
If retries are rare this is not a performance problem. You should probably log when a retry happened.

DELETE SQL with correlated subquery for table with 42 million rows?

I have a table cats with 42,795,120 rows.
Apparently this is a lot of rows. So when I do:
/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
the query times out :(
(edit: I need to increase my CommandTimeout value, default is only 30 seconds)
I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.
I'm using SQL Server 2005 with "Recovery model" set to "Simple."
So, I thought about doing something like this (executing this SQL from an application btw):
DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?
Or, if my approach is not optimal, please suggest a better approach. Thanks.
This post didn't help me enough:
SQL Server Efficiently dropping a group of rows with millions and millions of rows
EDIT (8/6/2010):
Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.
I will put indexes on:
cats.id_cat
owner_cats.id_cat
owner_cats.id_owner
I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?
However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:
SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)
So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.
Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.
UPDATE (8/7/2010):
Tom suggested:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!
Update (8/8/2010):
Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.
Update (8/9/2010):
Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?
Update (8/10/2010):
For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):
Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs
Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)
Anymore help would still be appreciated :)
There is no practical threshold. It depends on what your command timeout is set to on your connection.
Keep in mind that the time it takes to delete all of these rows is contingent upon:
The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.
Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.
As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.
If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:
SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN ( -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
TRUNCATE TABLE cats
INSERT INTO cats
SELECT * FROM #cats_to_keep
Have you tried no Subquery and use a join instead?
DELETE cats
FROM
cats c
INNER JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
And if you have have you also tried different Join hints e.g.
DELETE cats
FROM
cats c
INNER HASH JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
If you use an EXISTS rather than an IN, you should get much better performance. Try this:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:
SELECT 1
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (somevalue) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
END
As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).
Declare #Cats Cursor
Declare #CatId int --assuming an integer PK here
Declare #Start int
Declare #End int
Declare #GroupCount int
Set #GroupCount = 100
Set #Cats = Cursor Fast_Forward For
With CatHerd As
(
Select cats.id_cat
, NTile(#GroupCount) Over ( Order By cats.id_cat ) As Grp
From cats
Join owner_cats
On owner_cats.id_cat = cats.id_cat
Where owner_cats.id_owner = 1
)
Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
From CatHerd
Group By Grp
Open #Cats
Fetch Next From #Cats Into #CatId, #Start, #End
While ##Fetch_Status = 0
Begin
Delete cats
Where id_cat Between #Start And #End
Fetch Next From #Cats Into #CatId, #Start, #End
End
Close #Cats
Deallocate #Cats
The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.
Might be worth trying MERGE e.g.
MERGE INTO cats
USING owner_cats
ON cats.id_cat = owner_cats.id_cat
AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;
<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>
I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:
DECLARE #RecId bigint
DECLARE #NumRecs bigint
SET #NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET #RecId = 1
WHILE #RecId < #NumRecs
BEGIN
UPDATE [TableToUpdate]
SET UpdatedOn = GETDATE(),
SomeColumn = t2.[ColumnInTable2]
FROM [TableToUpdate] t
INNER JOIN [Table2] t2 ON t2.Name = t.DBAName
AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber
AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId
AND ISNULL(d.Address1,'') = t2.DBAAddress1
AND ISNULL(d.[State],'') = t2.DBAState
AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
WHERE t.Id BETWEEN #RecId AND (#RecId + 49999)
SET #RecId = #RecId + 50000
END
Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.
<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.
You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.
My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement.
</Edit>
Bill Karwin's answer to another question applies to my situation also:
"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."
Matt in this answer says it this way:
"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."
ammoQ in this answer (from the same question) recommends (paraphrased):
issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns