How can I speed up this rather simple UPDATE query? It's been running for over 5 hours!
I'm basically replacing SourceID in a table by joining on a new table that houses the Old and New IDs. All these fields are VARCHAR(72) and must stay that way.
Pub_ArticleFaculty table has 8,354,474 rows (8.3 million). ArticleAuthorOldNew has 99,326,472 rows (99.3 million) and only the 2 fields you see below.
There are separate non-clustered indexes on all these fields. Is there a better way to write this query to make it run faster?
UPDATE PF
SET PF.SourceId = AAON.NewSourceId
FROM AA..Pub_ArticleFaculty PF WITH (NOLOCK)
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
In my experience, looping your update so that it acts on small a numbers of rows each iteration is a good way to go. The ideal number of rows to update each iteration is largely dependent on your environment and the tables you're working with. I usually stick around 1,000 - 10,000 rows per iteration.
Example
SET ROWCOUNT 1000 -- Set the batch size (number of rows to affect each time through the loop).
WHILE (1=1) BEGIN
UPDATE PF
SET NewSourceId = 1
FROM AA..Pub_ArticleFaculty PF WITH (NOLOCK)
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
WHERE NewSourceId IS NULL -- Only update rows that haven't yet been updated.
-- When no rows are affected, we're done!
IF ##ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0 -- Reset the batch size to the default (i.e. all rows).
GO
If you are resetting all or almost all of the values, then the update will be quite expensive. This is due to logging and the overhead for the updates.
One approach you can take instead is insert into a temporary table, then truncate, then re-insert:
select pf.col1, pf.col2, . . . ,
coalesce(aaon.NewSourceId, pf.sourceid) as SourceId
into temp_pf
from AA..Pub_ArticleFaculty PF LEFT JOIN
AA2..ArticleAuthorOldNew AAON
on AAON.OldFullSourceId = PF.SourceId;
truncate table AA..Pub_ArticleFaculty;
insert into AA..Pub_ArticleFaculty
select * from temp_pf;
Note: You should either be sure that the columns in the original table match the temporary table or, better yet, list the columns explicitly in the insert.
I should also note that the major benefit is when your recovery mode is simple or bulk-logged. The reason is that logging for the truncate, select into, and insert . . . select is minimal (see here). This savings on the logging can be very significant.
I would
Disable the index on PF.SourceId
Run the update
Then rebuild the index
I don't get the NOLOCK on the table you are updating
UPDATE PF
SET PF.SourceId = AAON.NewSourceId
FROM AA..Pub_ArticleFaculty PF
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
AND PF.SourceId <> AAON.NewSourceId
I don't have enough points to comment on the question. So I'm adding it as an answer. Can you check the basics
Are there any triggers on the table? If there are, there will be additional overhead when you are updating rows.
Are there indexes on the joining columns?
In other cases, does the system perform well? Verify that the system have enough power.
But 8 million records are not much to run more than 1 minute max if processing properly. An execution time of 5 hrs indicates there is a problem somewhere else.
Related
I have a database table that has 900 million records. I am in a situation where I need to update 4 different keys in that table, by joining them to a dimension and setting the key of the fact table to the key of the dimension. I have written 4 different SQL scripts (see example below) to perform the update, however problem is it is taking way too long to execute. The query has been running for more than 20 hours and I am not even sure how far it go and how long this will take. Is there any way I can do to improve this so it only takes few hours to complete. Would adding Indexes improve this?
UPDATE f
SET f.ClientKey = c.ClientKey
FROM dbo.FactSales f
JOIN dbo.DimClient c
ON f.ClientId = c.ClientId
Script foreign keys. Drop them.
Script indexes on updated columns (which are not part of condition). Drop them.
Disable triggers if exist.
Disable all processes which can make locks (=all, include selects).
Update your keys.
Recreate your foreign keys, indexes, enable triggers.
Be happy.
And comment for 5 - prepare your primary key from destination table with all new source code only and do one statement. It means lesser cost on joins and it will be only one join.
Can use this to not fill up the transaction log
select 1
while(##rowcount > 0)
begin
UPDATE f
SET top (100000) f.ClientKey = c.ClientKey
FROM dbo.FactSales f
JOIN dbo.DimClient c
ON f.ClientId = c.ClientId
AND f.ClientKey != c.ClientKey
end
If you need to update 4 different keys then do them all at once
Most of the cost is acquiring lock
Disable f.ClientKey, run the update, and then rebuild it
If you are sure DimClient is not going to change with (nolock) but need to be sure
If you are the only process that need to update FactSales take a tablock holdlock
Create a new table with the correct values. Add indexes, constraints afterwards. Drop the existing table and rename the new one to existing one in one transaction if that is possible .
We're in need of yet another massive update which as it is, would require downtime because of the risk of extensive locking problems. Basically, we'd like to update hundreds of millions of rows during business hours.
Now, reducing the updates to manageable < 5000 batch sizes helps, but I was wondering if it was feasible to create a template to only read and lock available rows, udpate them, and move on to the next batch? The idea is that this way we could patch some 95% of the data with minimal risk, after which the remaining set of data would be small enough to just update at once during a slower period while watching out for locks.
Yes, I know this sounds weird, but bear with me. How would one go about doing this?
I was thinking of something like this:
WHILE ##ROWCOUNT > 0
BEGIN
UPDATE TOP (5000) T
SET T.VALUE = 'ASD'
FROM MYTABLE T
JOIN (SELECT TOP 5000 S.ID
FROM MYTABLE S WITH (READPAST, UPDLOCK)
WHERE X = Y AND Z = W etc...) SRC
ON SRC.ID = T.ID
END
Any ideas? Basically, the last thing I want is for this query to get stuck in other potentially long-running transactions, or to do the same to others in return. So what I'm looking for here is a script that will skip locked rows, update what it can with minimal risk for getting involved in locks or deadlocks, so it can be safely run for the hour or so during uptime.
Just add WITH (READPAST) to the table for single-table updates:
UPDATE TOP (5000) MYTABLE WITH (READPAST)
SET VALUE = 'ASD'
WHERE X = Y AND Z = W etc...
If you are lucky enough to have a single table involved you can just add WITH (READPAST) and the UPDATE itself will add an exclusive lock on just the rows that get updated.
If there is more than one table involved, it may become more complicated. Also be very careful of the WHERE clause because that could add more load than expected - the first few batches are fine but become progressively worse if scanning the whole table is necessary to find enough rows to satisfy the TOP. You might want to consider a short timeout value for each batch.
I have a table cats with 42,795,120 rows.
Apparently this is a lot of rows. So when I do:
/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
the query times out :(
(edit: I need to increase my CommandTimeout value, default is only 30 seconds)
I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.
I'm using SQL Server 2005 with "Recovery model" set to "Simple."
So, I thought about doing something like this (executing this SQL from an application btw):
DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?
Or, if my approach is not optimal, please suggest a better approach. Thanks.
This post didn't help me enough:
SQL Server Efficiently dropping a group of rows with millions and millions of rows
EDIT (8/6/2010):
Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.
I will put indexes on:
cats.id_cat
owner_cats.id_cat
owner_cats.id_owner
I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?
However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:
SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)
So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.
Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.
UPDATE (8/7/2010):
Tom suggested:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!
Update (8/8/2010):
Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.
Update (8/9/2010):
Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?
Update (8/10/2010):
For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):
Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs
Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)
Anymore help would still be appreciated :)
There is no practical threshold. It depends on what your command timeout is set to on your connection.
Keep in mind that the time it takes to delete all of these rows is contingent upon:
The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.
Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.
As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.
If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:
SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN ( -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
TRUNCATE TABLE cats
INSERT INTO cats
SELECT * FROM #cats_to_keep
Have you tried no Subquery and use a join instead?
DELETE cats
FROM
cats c
INNER JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
And if you have have you also tried different Join hints e.g.
DELETE cats
FROM
cats c
INNER HASH JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
If you use an EXISTS rather than an IN, you should get much better performance. Try this:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:
SELECT 1
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (somevalue) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
END
As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).
Declare #Cats Cursor
Declare #CatId int --assuming an integer PK here
Declare #Start int
Declare #End int
Declare #GroupCount int
Set #GroupCount = 100
Set #Cats = Cursor Fast_Forward For
With CatHerd As
(
Select cats.id_cat
, NTile(#GroupCount) Over ( Order By cats.id_cat ) As Grp
From cats
Join owner_cats
On owner_cats.id_cat = cats.id_cat
Where owner_cats.id_owner = 1
)
Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
From CatHerd
Group By Grp
Open #Cats
Fetch Next From #Cats Into #CatId, #Start, #End
While ##Fetch_Status = 0
Begin
Delete cats
Where id_cat Between #Start And #End
Fetch Next From #Cats Into #CatId, #Start, #End
End
Close #Cats
Deallocate #Cats
The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.
Might be worth trying MERGE e.g.
MERGE INTO cats
USING owner_cats
ON cats.id_cat = owner_cats.id_cat
AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;
<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>
I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:
DECLARE #RecId bigint
DECLARE #NumRecs bigint
SET #NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET #RecId = 1
WHILE #RecId < #NumRecs
BEGIN
UPDATE [TableToUpdate]
SET UpdatedOn = GETDATE(),
SomeColumn = t2.[ColumnInTable2]
FROM [TableToUpdate] t
INNER JOIN [Table2] t2 ON t2.Name = t.DBAName
AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber
AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId
AND ISNULL(d.Address1,'') = t2.DBAAddress1
AND ISNULL(d.[State],'') = t2.DBAState
AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
WHERE t.Id BETWEEN #RecId AND (#RecId + 49999)
SET #RecId = #RecId + 50000
END
Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.
<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.
You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.
My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement.
</Edit>
Bill Karwin's answer to another question applies to my situation also:
"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."
Matt in this answer says it this way:
"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."
ammoQ in this answer (from the same question) recommends (paraphrased):
issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns
I have an SQL Server 2005 database, and I tried putting indexes on the appropriate fields in order to speed up the DELETE of records from a table with millions of rows (big_table has only 3 columns), but now the DELETE execution time is even longer! (1 hour versus 13 min for example)
I have a relationship between to tables, and the column that I filter my DELETE by is in the other table. For example
DELETE FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
Btw, I've also tried:
DELETE FROM big_table
WHERE EXISTS
(SELECT 1 FROM small_table
WHERE small_table.id_product = big_table.id_product
AND small_table.id_category = 1)
and while it seems to run slightly faster than the first, it's still a lot slower with the indexes than without.
I created indexes on these fields:
big_table.id_product
small_table.id_product
small_table.id_category
My .ldf file grows a lot during the DELETE.
Why are my DELETE queries slower when I have indexes on my tables? I thought they were supposed to run faster.
UPDATE
Okay, consensus seems to be indexes will slow down a huge DELETE becuase the index has to be updated. Although, I still don't understand why it can't DELETE all the rows all at once, and just update the index once at the end.
I was under the impression from some of my reading that indexes would speed up DELETE by making searches for fields in the WHERE clause faster.
Odetocode.com says:
"Indexes work just as well when searching for a record in DELETE and UPDATE commands as they do for SELECT statements."
But later in the article, it says that too many indexes can hurt performance.
Answers to bobs questions:
55 million rows in table
42 million rows being deleted
Similar SELECT statement would not run (Exception of type 'System.OutOfMemoryException' was thrown)
I tried the following 2 queries:
SELECT * FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
SELECT * FROM big_table
INNER JOIN small_table
ON small_table.id_product = big_table.id_product
WHERE small_table.id_category = 1
Both failed after running for 25 min with this error message from SQL Server 2005:
An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.
The database server is an older dual core Xeon machine with 7.5 GB ram. It's my toy test database :) so it's not running anything else.
Do I need to do something special with my indexes after I CREATE them to make them work properly?
Indexes make lookups faster - like the index at the back of a book.
Operations that change the data (like a DELETE) are slower, as they involve manipulating the indexes. Consider the same index at the back of the book. You have more work to do if you add, remove or change pages because you have to also update the index.
I Agree with Bobs comment above - if you are deleting large volumes of data from large tables deleting the indices can take a while on top of deleting the data its the cost of doing business though. As it deletes all the data out you are causing reindexing events to happen.
With regards to the logfile growth; if you arent doing anything with your logfiles you could switch to Simple logging; but i urge you to read up on the impact that might have on your IT department before you change.
If you need to do the delete in real time; its often a good work around to flag the data as inactive either directly on the table or in another table and exclude that data from queries; then come back later and delete the data when the users aren't staring at an hourglass. There is a second reason for covering this; if you are deleting lots of data out of the table (which is what i am supposing based on your logfile issue) then you will likely want to do an indexdefrag to reorgnaise the index; doing that out of hours is the way to go if you dont like users on the phone !
JohnB is deleting about 75% of the data. I think the following would have been a possible solution and probably one of the faster ones. Instead of deleting the data, create a new table and insert the data that you need to keep. Create the indexes on that new table after inserting the data. Now drop the old table and rename the new one to the same name as the old one.
The above of course assumes that sufficient disk space is available to temporarily store the duplicated data.
Try something like this to avoid bulk delete (and thereby avoid log file growth)
declare #continue bit = 1
-- delete all ids not between starting and ending ids
while #continue = 1
begin
set #continue = 0
delete top (10000) u
from <tablename> u WITH (READPAST)
where <condition>
if ##ROWCOUNT > 0
set #continue = 1
end
You can also try TSQL extension to DELETE syntax and check whether it improves performance:
DELETE FROM big_table
FROM big_table AS b
INNER JOIN small_table AS s ON (s.id_product = b.id_product)
WHERE s.id_category =1
I have a very large database (~100Gb) primarily consisting of two tables I want to reduce in size (both of which have approx. 50 million records). I have an archive DB set up on the same server with these two tables, using the same schema. I'm trying to determine the best conceptual way of going about removing the rows from the live db and inserting them in the archive DB. In pseudocode this is what I'm doing now:
Declare #NextIDs Table(UniqueID)
Declare #twoYearsAgo = two years from today's date
Insert into #NextIDs
SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo
Insert into myArchiveTable
<fields>
SELECT <fields>
FROM myLargeTable INNER JOIN #NextIDs on myLargeTable.UniqueID = #NextIDs.UniqueID
DELETE MyLargeTable
FROM MyLargeTable INNER JOIN #NextIDs on myLargeTable.UniqueID = #NextIDs.UniqueID
Right now this takes a horrifically slow 7 minutes to complete 1000 records. I've tested the Delete and the Insert, both taking approx. 3.5 minutes to complete, so its not necessarily one is drastically more inefficient than the other. Can anyone point out some optimization ideas in this?
Thanks!
This is SQL Server 2000.
Edit: On the large table there is a clustered index on the ActionDate field. There are two other indexes, but neither are referenced in any of the queries. The Archive table has no indexes. On my test server, this is the only query hitting the SQL Server, so it should have plenty of processing power.
Code (this does a loop in batches of 1000 records at a time):
DECLARE #NextIDs TABLE(UniqueID int primary key)
DECLARE #TwoYearsAgo datetime
SELECT #TwoYearsAgo = DATEADD(d, (-2 * 365), GetDate())
WHILE EXISTS(SELECT TOP 1 UserName FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [ActionDateTime] < #TwoYearsAgo)
BEGIN
BEGIN TRAN
--get all records to be archived
INSERT INTO #NextIDs(UniqueID)
SELECT TOP 1000 UniqueID FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [UserUnitAudit].[ActionDateTime] < #TwoYearsAgo
--insert into archive table
INSERT INTO [ISArchive].[dbo].[userunitaudit]
(<Fields>)
SELECT <Fields>
FROM [ISAdminDB].[dbo].[UserUnitAudit] AS a
INNER JOIN #NextIDs AS b ON a.UniqueID = b.UniqueID
--remove from Admin DB
DELETE [ISAdminDB].[dbo].[UserUnitAudit]
FROM [ISAdminDB].[dbo].[UserUnitAudit] AS a
INNER JOIN #NextIDs AS b ON a.UniqueID = b.UniqueID
DELETE FROM #NextIDs
COMMIT
END
You effectively have three selects which need to be run before your insert/delete commands are executed:
for the 1st insert:
SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo
for the 2nd insert:
SELECT <fields> FROM myLargeTable INNER JOIN NextIDs
on myLargeTable.UniqueID = NextIDs.UniqueID
for the delete:
(select *)
FROM MyLargeTable INNER JOIN NextIDs on myLargeTable.UniqueID = NextIDs.UniqueID
I'd try and optimize these and if they are all quick, then the indexes may be slowing down your writes. Some suggestions:
start profiler and see what's happenng with the reads/writes etc.
check index usage for all three statements.
try running the SELECTs returning only the PK, to see if the delay is query execution or fetching the data (do have e.g. any fulltext-indexed fields, TEXT fields etc.)
Do you have an index on the source table for the column which you're using to filter the results? In this case, that would be the actionDate.
Also, it can often help to remove all indexes from the destination table before doing massive inserts, but in this case you're only doing 100's at a time.
You would also probably be better off doing this in larger batches. With one hundred at a time the overhead of the queries is going to end up dominating the costs/time.
Is there any other activity on the server during this time? Is there any blocking happening?
Hopefully this gives you a starting point.
If you can provide the exact code that you're using (maybe without the column names if there are privacy issues) then maybe someone can spot other ways to optimize.
EDIT:
Have you checked the query plan for your block of code? I've run into issues with table variables like this where the query optimizer couldn't figure out that the table variable would be small in size so it always tried to do a full table scan on the base table.
In my case it eventually became a moot point, so I'm not sure what the ultimate solution is. You can certainly add a condition on the actionDate to all of your select queries, which would at least minimize the effects of this.
The other option would be to use a normal table to hold the IDs.
The INSERT and DELETE statements are joining on
[ISAdminDB].[dbo].[UserUnitAudit].UniqueID
If there's no index on this, and you indicate there isn't, you're doing two table scans. That's likely the source of the slowness, b/c a SQL Server table scan reads the entire table into a scratch table, searches the scratch table for matching rows, then drops the scratch table.
I think you need to add an index on UniqueID. The performance hit for maintaining it has got to be less than table scans. And you can drop it after your archive is done.
Are there any indexes on myLargeTable.actionDate and .UniqueID?
Have you tried larger batch sizes than 100?
What is taking the most time? The INSERT, or the DELETE?
You might try doing this using the output clause:
declare #items table (
<field list just like source table> )
delete top 100 source_table
output deleted.first_field, deleted.second_field, etc
into #items
where <conditions>
insert archive_table (<fields>)
select (<fields>) from #items
You also might be able to do this in a single query, by doing 'output into' directly into the archive table (eliminating the need for the table var)