Best practice for a SQL Archiving Stored Procedure - sql

I have a very large database (~100Gb) primarily consisting of two tables I want to reduce in size (both of which have approx. 50 million records). I have an archive DB set up on the same server with these two tables, using the same schema. I'm trying to determine the best conceptual way of going about removing the rows from the live db and inserting them in the archive DB. In pseudocode this is what I'm doing now:
Declare #NextIDs Table(UniqueID)
Declare #twoYearsAgo = two years from today's date
Insert into #NextIDs
SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo
Insert into myArchiveTable
<fields>
SELECT <fields>
FROM myLargeTable INNER JOIN #NextIDs on myLargeTable.UniqueID = #NextIDs.UniqueID
DELETE MyLargeTable
FROM MyLargeTable INNER JOIN #NextIDs on myLargeTable.UniqueID = #NextIDs.UniqueID
Right now this takes a horrifically slow 7 minutes to complete 1000 records. I've tested the Delete and the Insert, both taking approx. 3.5 minutes to complete, so its not necessarily one is drastically more inefficient than the other. Can anyone point out some optimization ideas in this?
Thanks!
This is SQL Server 2000.
Edit: On the large table there is a clustered index on the ActionDate field. There are two other indexes, but neither are referenced in any of the queries. The Archive table has no indexes. On my test server, this is the only query hitting the SQL Server, so it should have plenty of processing power.
Code (this does a loop in batches of 1000 records at a time):
DECLARE #NextIDs TABLE(UniqueID int primary key)
DECLARE #TwoYearsAgo datetime
SELECT #TwoYearsAgo = DATEADD(d, (-2 * 365), GetDate())
WHILE EXISTS(SELECT TOP 1 UserName FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [ActionDateTime] < #TwoYearsAgo)
BEGIN
BEGIN TRAN
--get all records to be archived
INSERT INTO #NextIDs(UniqueID)
SELECT TOP 1000 UniqueID FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [UserUnitAudit].[ActionDateTime] < #TwoYearsAgo
--insert into archive table
INSERT INTO [ISArchive].[dbo].[userunitaudit]
(<Fields>)
SELECT <Fields>
FROM [ISAdminDB].[dbo].[UserUnitAudit] AS a
INNER JOIN #NextIDs AS b ON a.UniqueID = b.UniqueID
--remove from Admin DB
DELETE [ISAdminDB].[dbo].[UserUnitAudit]
FROM [ISAdminDB].[dbo].[UserUnitAudit] AS a
INNER JOIN #NextIDs AS b ON a.UniqueID = b.UniqueID
DELETE FROM #NextIDs
COMMIT
END

You effectively have three selects which need to be run before your insert/delete commands are executed:
for the 1st insert:
SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo
for the 2nd insert:
SELECT <fields> FROM myLargeTable INNER JOIN NextIDs
on myLargeTable.UniqueID = NextIDs.UniqueID
for the delete:
(select *)
FROM MyLargeTable INNER JOIN NextIDs on myLargeTable.UniqueID = NextIDs.UniqueID
I'd try and optimize these and if they are all quick, then the indexes may be slowing down your writes. Some suggestions:
start profiler and see what's happenng with the reads/writes etc.
check index usage for all three statements.
try running the SELECTs returning only the PK, to see if the delay is query execution or fetching the data (do have e.g. any fulltext-indexed fields, TEXT fields etc.)

Do you have an index on the source table for the column which you're using to filter the results? In this case, that would be the actionDate.
Also, it can often help to remove all indexes from the destination table before doing massive inserts, but in this case you're only doing 100's at a time.
You would also probably be better off doing this in larger batches. With one hundred at a time the overhead of the queries is going to end up dominating the costs/time.
Is there any other activity on the server during this time? Is there any blocking happening?
Hopefully this gives you a starting point.
If you can provide the exact code that you're using (maybe without the column names if there are privacy issues) then maybe someone can spot other ways to optimize.
EDIT:
Have you checked the query plan for your block of code? I've run into issues with table variables like this where the query optimizer couldn't figure out that the table variable would be small in size so it always tried to do a full table scan on the base table.
In my case it eventually became a moot point, so I'm not sure what the ultimate solution is. You can certainly add a condition on the actionDate to all of your select queries, which would at least minimize the effects of this.
The other option would be to use a normal table to hold the IDs.

The INSERT and DELETE statements are joining on
[ISAdminDB].[dbo].[UserUnitAudit].UniqueID
If there's no index on this, and you indicate there isn't, you're doing two table scans. That's likely the source of the slowness, b/c a SQL Server table scan reads the entire table into a scratch table, searches the scratch table for matching rows, then drops the scratch table.
I think you need to add an index on UniqueID. The performance hit for maintaining it has got to be less than table scans. And you can drop it after your archive is done.

Are there any indexes on myLargeTable.actionDate and .UniqueID?

Have you tried larger batch sizes than 100?
What is taking the most time? The INSERT, or the DELETE?

You might try doing this using the output clause:
declare #items table (
<field list just like source table> )
delete top 100 source_table
output deleted.first_field, deleted.second_field, etc
into #items
where <conditions>
insert archive_table (<fields>)
select (<fields>) from #items
You also might be able to do this in a single query, by doing 'output into' directly into the archive table (eliminating the need for the table var)

Related

Why changing where statement to a variable cause query to be 4 times slower

I am inserting data from one table "Tags" from "Recovery" database into another table "Tags" in "R3" database
they all live in my laptop similar SQL Server instance
I have built the insert query and because Recovery..Tags table is around 180M records I decided to break it into smaller sebsets. ( 1 million recs at the time)
Here is my query (Let's call Query A)
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between 13000001 and 14000000
it takes around 2 minutes.
That is ok
To make things a bit easier for me
I put the iiD in the were statement in a variable
so my query looks like this (Let's call Query B)
declare #i int = 12
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between (1000000 * #i) + 1 and (#i+1)*1000000
but that cause the insert to become so slow (around 10 min)
So what I tried query A again and gave me around 2 min
I tried query B again and gave around 8 min!!
I am attaching exec plan for each one (at a site that shows an analysis of the query plan) - Query A Plan and Query B Plan
Any idea why this is happening?
and how to fix it?
The big difference in time is due to the very different plans that are being created to join Tags and Reps.
Fundamentally, in version A, it knows how much data is being extracted (a million rows) and it can design an efficient query for that. However, because you are using variables in B to define how much data is being imported, it has to define a more generic query - one that would work for 10 rows, a million rows, or a hundred million rows.
In the plans, here are the relevant sections of the query joining Tags and Reps...
... in A
... and B
Note that in A it takes just over a minute to do the join; in B it takes 6 and a half minutes.
The key thing that appears to take the time is that it does a table scan of the Tags table which takes 5:44 to complete. The plan has this as a table scan, as the next time you run the query you may want many more than 1 million rows.
A secondary issue is that the amount of data it reads (or expects to read) from Reps is also way out of whack. In A it expected to read 2 million rows and read 1421; in B it basically read them all (even though technically it probably only needed the same 1421).
I think you have two main approaches to fix
Look at indexing, to remove the table scan on Tags - ensure the indexes match what is needed and allows the query to do a scan on that index (it appears that the index at the top of #MikePetri's answer is what you need, or similar). This way instead of doing a table scan, it can do an index scan which can start 'in the middle' of the data set (a table scan must start at either the start or end of the data set).
Separate this into two processes. The first process gets the relevant million rows from Tags, and saves it in a temporary table. The second process uses the data in the temporary table to join to Reps (also try using option (recompile) in the second query, so that it checks the temporary table's size before creating the plan).
You can even put an index or two (and/or Primary Key) on that temporary table to make it better for the next step.
The reason the first query is so much faster is it went parallel. This means the cardinality estimator knew enough about the data it had to handle, and the query was large enough to tip the threshold for parallel execution. Then, the engine passed chunks of data for different processors to handle individually, then report back and repartition the streams.
With the value as a variable, it effectively becomes a scalar function evaluation, and a query cannot go parallel with a scalar function, because the value has to determined before the cardinality estimator can figure out what to do with it. Therefore, it runs in a single thread, and is slower.
Some sort of looping mechanism might help. Create the included indexes to assist the engine in handling this request. You can probably find a better looping mechanism, since you are familiar with the identity ranges you care about, but this should get you in the right direction. Adjust for your needs.
With a loop like this, it commits the changes with each loop, so you aren't locking the table indefinitely.
USE Recovery;
GO
CREATE INDEX NCI_iID
ON Tags (iID)
INCLUDE (
DT
,RepID
,tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,value
,Deleted
,sKey
);
GO
CREATE INDEX NCI_RepID ON Reps (RepID) INCLUDE (RepType);
USE R3;
GO
CREATE INDEX NCI_iID ON Tags (iID);
GO
DECLARE #RowsToProcess BIGINT
,#StepIncrement INT = 1000000;
SELECT #RowsToProcess = (
SELECT COUNT(1)
FROM Recovery..tags AS T
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
);
WHILE #RowsToProcess > 0
BEGIN
INSERT INTO R3..Tags
(
iID
,DT
,RepID
,Tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,Value
,Deleted
,sKey
,RepType
)
SELECT TOP (#StepIncrement)
T.iID
,T.DT
,T.RepID
,T.Tag
,T.xmiID
,T.iBegin
,T.iEnd
,T.Confidence
,T.Polarity
,T.Uncertainty
,T.Conditional
,T.Generic
,T.HistoryOf
,T.CodingScheme
,T.Code
,T.CUI
,T.TUI
,T.PreferredText
,T.ValueBegin
,T.ValueEnd
,T.Value
,T.Deleted
,T.sKey
,R.RepType
FROM Recovery..tags AS T
INNER JOIN Recovery..Reps AS R ON T.RepID = R.RepID
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
ORDER BY
T.iID;
SET #RowsToProcess = #RowsToProcess - #StepIncrement;
END;

How to SELECT COUNT from tables currently being INSERT?

Hi consider there is an INSERT statement running on a table TABLE_A, which takes a long time, I would like to see how has it progressed.
What I tried was to open up a new session (new query window in SSMS) while the long running statement is still in process, I ran the query
SELECT COUNT(1) FROM TABLE_A WITH (nolock)
hoping that it will return right away with the number of rows everytime I run the query, but the test result was even with (nolock), still, it only returns after the INSERT statement is completed.
What have I missed? Do I add (nolock) to the INSERT statement as well? Or is this not achievable?
(Edit)
OK, I have found what I missed. If you first use CREATE TABLE TABLE_A, then INSERT INTO TABLE_A, the SELECT COUNT will work. If you use SELECT * INTO TABLE_A FROM xxx, without first creating TABLE_A, then non of the following will work (not even sysindexes).
Short answer: You can't do this.
Longer answer: A single INSERT statement is an atomic operation. As such, the query has either inserted all the rows or has inserted none of them. Therefore you can't get a count of how far through it has progressed.
Even longer answer: Martin Smith has given you a way to achieve what you want. Whether you still want to do it that way is up to you of course. Personally I still prefer to insert in manageable batches if you really need to track progress of something like this. So I would rewrite the INSERT as multiple smaller statements. Depending on your implementation, that may be a trivial thing to do.
If you are using SQL Server 2016 the live query statistics feature can allow you to see the progress of the insert in real time.
The below screenshot was taken while inserting 10 million rows into a table with a clustered index and a single nonclustered index.
It shows that the insert was 88% complete on the clustered index and this will be followed by a sort operator to get the values into non clustered index key order before inserting into the NCI. This is a blocking operator and the sort cannot output any rows until all input rows are consumed so the operators to the left of this are 0% done.
With respect to your question on NOLOCK
It is trivial to test
Connection 1
USE tempdb
CREATE TABLE T2
(
X INT IDENTITY PRIMARY KEY,
F CHAR(8000)
);
WHILE NOT EXISTS(SELECT * FROM T2 WITH (NOLOCK))
LOOP:
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting for 10 seconds',0,1) WITH NOWAIT;
WAITFOR delay '00:00:10';
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting to drop table',0,1) WITH NOWAIT
DROP TABLE T2
Connection 2
use tempdb;
--Insert 2000 * 2000 = 4 million rows
WITH T
AS (SELECT TOP 2000 'x' AS x
FROM master..spt_values)
INSERT INTO T2
(F)
SELECT 'X'
FROM T v1
CROSS JOIN T v2
OPTION (MAXDOP 1)
Example Results - Showing row count increasing
SELECT queries with NOLOCK allow dirty reads. They don't actually take no locks and can still be blocked, they still need a SCH-S (schema stability) lock on the table (and on a heap it will also take a hobt lock).
The only thing incompatible with a SCH-S is a SCH-M (schema modification) lock. Presumably you also performed some DDL on the table in the same transaction (e.g. perhaps created it in the same tran)
For the use case of a large insert, where an approximate in flight result is fine, I generally just poll sysindexes as shown above to retrieve the count from metadata rather than actually counting the rows (non deprecated alternative DMVs are available)
When an insert has a wide update plan you can even see it inserting to the various indexes in turn that way.
If the table is created inside the inserting transaction this sysindexes query will still block though as the OBJECT_ID function won't return a result based on uncommitted data regardless of the isolation level in effect. It's sometimes possible to get around that by getting the object_id from sys.tables with nolock instead.
Use the below query to find the count for any large table or locked table or being inserted table in seconds . Just replace the table name which you want to search.
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME' AND (index_id < 2)
For those who just need to see the record count while executing a long running INSERT script, I found you can see the current record count through SSMS by right clicking on the destination database table, -> Properties -> Storage, then view the "Row Count" value like so:
Close window and repeat to see the updated record count.

Passing a table to a stored procedure

I have a table with 20 billion rows. Table does not have any indexes as it was created on fly for doing bulk insert operation. The table is being used in a stored procedure which does the following operation
Delete A
from master a
inner join (Select distinct Col from TableB ) b
on A.Col = B.Col
Insert into master
Select *
from tableB
group by col1,col2,col3
TableB is the one which has 20 billion rows. I don't want to execute SP directly because it might take days to complete the execution. Master is also a huge table and has clustered index on Col
Can i pass chunks of rows to the stored procedure and perform the operation.This might reduce the log file growth. If yes how can i do that
Should i create clustered index on the table and execute the SP which might be little faster but then again i think creating CI on a huge table might take 10 hours to complete.
Or is there any way to perform this operation fast
I've used a method similar to this one. I'd recommend putting your DB into Bulk Logged recovery mode instead of Full recovery mode if you can.
Blog entry reproduced below to future proof it.
Below is a technique used to transfer a large amount of records from
one table to another. This scales pretty well for a couple reasons.
First, this will not fill up the entire log prior to committing the
transaction. Rather, it will populate the table in chunks of 10,000
records. Second, it’s generally much quicker. You will have to play
around with the batch size. Sometimes it’s more efficient at 10,000,
sometimes 500,000, depending on the system.
If you do not need to insert into an existing table and just need a
copy of the table, it is better to do a SELECT INTO. However for this
example, we are inserting into an existing table.
Another trick you should do is to change the recovery model of the
database to simple. This way, there will be much less logging in the
transaction log.
The WITH (TABLOCK) below only works in SQL 2008.
DECLARE #BatchSize INT = 10000
WHILE 1 = 1
BEGIN
INSERT INTO [dbo].[Destination] --WITH (TABLOCK) -- Uncomment for 2008
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE NOT EXISTS (
SELECT 1
FROM dbo.Destination
WHERE PersonID = s.PersonID
)
IF ##ROWCOUNT < #BatchSize BREAK
END
With the above example, it is important to have at least a non
clustered index on PersonID in both tables.
Another way to transfer records is to use multiple threads. Specifying
a range of records as such:
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 1 AND 5000
GO
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 5001 AND 10000
For super fast performance however, I’d recommend using SSIS.
Especially in SQL Server 2008. We recently transferred 17 million
records in 5 minutes with an SSIS package executed on the same server
as the two databases it transferred between.
SQL Server 2008 SQL Server 2008 has made changes with regards to it’s
logging mechanism when inserting records. Previously, to do an insert
that was minimally logged, you would have to perform a SELECT.. INTO.
Now, you can perform a minimally logged insert if you can lock the
table you are inserting into. The example below shows an example of
this. The exception to this rule is if you have a clustered index on
the table AND the table is not empty. If the table is empty and you
acquire a table lock and you have a clustered index, it will be
minimally logged. However if you have data in the table, the insert
will be logged. Now if you have a non clustered index on a heap and
you acquire a table lock then only the non clustered index will be
logged. It is always better to drop indexes prior to inserting
records.
To determine the amount of logging you can use the following statement
SELECT * FROM ::fn_dblog(NULL, NULL)
Credit for above goes to Derek Dieter at SQL Server Planet.
If you're dead set on passing a table to your stored procedure, you can pass a table-valued parameter to a stored procedure in SQL Server 2008. You might have better luck with some other approaches suggested, like partitioning. Select distinct on a table with 20 billion rows might be part of the problem. I wonder if some very basic tuning wouldn't help, too:
Delete A
from master a
where exists (select 1 from TableB b where b.Col = a.Col)

DELETE SQL with correlated subquery for table with 42 million rows?

I have a table cats with 42,795,120 rows.
Apparently this is a lot of rows. So when I do:
/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
the query times out :(
(edit: I need to increase my CommandTimeout value, default is only 30 seconds)
I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.
I'm using SQL Server 2005 with "Recovery model" set to "Simple."
So, I thought about doing something like this (executing this SQL from an application btw):
DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?
Or, if my approach is not optimal, please suggest a better approach. Thanks.
This post didn't help me enough:
SQL Server Efficiently dropping a group of rows with millions and millions of rows
EDIT (8/6/2010):
Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.
I will put indexes on:
cats.id_cat
owner_cats.id_cat
owner_cats.id_owner
I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?
However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:
SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)
So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.
Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.
UPDATE (8/7/2010):
Tom suggested:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!
Update (8/8/2010):
Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.
Update (8/9/2010):
Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?
Update (8/10/2010):
For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):
Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs
Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)
Anymore help would still be appreciated :)
There is no practical threshold. It depends on what your command timeout is set to on your connection.
Keep in mind that the time it takes to delete all of these rows is contingent upon:
The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.
Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.
As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.
If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:
SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN ( -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
TRUNCATE TABLE cats
INSERT INTO cats
SELECT * FROM #cats_to_keep
Have you tried no Subquery and use a join instead?
DELETE cats
FROM
cats c
INNER JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
And if you have have you also tried different Join hints e.g.
DELETE cats
FROM
cats c
INNER HASH JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
If you use an EXISTS rather than an IN, you should get much better performance. Try this:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:
SELECT 1
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (somevalue) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
END
As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).
Declare #Cats Cursor
Declare #CatId int --assuming an integer PK here
Declare #Start int
Declare #End int
Declare #GroupCount int
Set #GroupCount = 100
Set #Cats = Cursor Fast_Forward For
With CatHerd As
(
Select cats.id_cat
, NTile(#GroupCount) Over ( Order By cats.id_cat ) As Grp
From cats
Join owner_cats
On owner_cats.id_cat = cats.id_cat
Where owner_cats.id_owner = 1
)
Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
From CatHerd
Group By Grp
Open #Cats
Fetch Next From #Cats Into #CatId, #Start, #End
While ##Fetch_Status = 0
Begin
Delete cats
Where id_cat Between #Start And #End
Fetch Next From #Cats Into #CatId, #Start, #End
End
Close #Cats
Deallocate #Cats
The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.
Might be worth trying MERGE e.g.
MERGE INTO cats
USING owner_cats
ON cats.id_cat = owner_cats.id_cat
AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;
<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>
I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:
DECLARE #RecId bigint
DECLARE #NumRecs bigint
SET #NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET #RecId = 1
WHILE #RecId < #NumRecs
BEGIN
UPDATE [TableToUpdate]
SET UpdatedOn = GETDATE(),
SomeColumn = t2.[ColumnInTable2]
FROM [TableToUpdate] t
INNER JOIN [Table2] t2 ON t2.Name = t.DBAName
AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber
AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId
AND ISNULL(d.Address1,'') = t2.DBAAddress1
AND ISNULL(d.[State],'') = t2.DBAState
AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
WHERE t.Id BETWEEN #RecId AND (#RecId + 49999)
SET #RecId = #RecId + 50000
END
Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.
<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.
You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.
My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement.
</Edit>
Bill Karwin's answer to another question applies to my situation also:
"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."
Matt in this answer says it this way:
"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."
ammoQ in this answer (from the same question) recommends (paraphrased):
issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns

SQL stored procedure temporary table memory problem

We have the following simple Stored Procedure that runs as an overnight SQL server agent job. Usually it runs in 20 minutes, but recently the MatchEvent and MatchResult tables have grown to over 9 million rows each. This has resulted in the store procedure taking over 2 hours to run, with all 8GB of memory on our SQL box being used up. This renders the database unavailable to the regular queries that are trying to access it.
I assume the problem is that temp table is too large and is causing the memory and database unavailablity issues.
How can I rewrite the stored procedure to make it more efficient and less memory intensive?
Note: I have edited the SQL to indicate that there is come condition affecting the initial SELECT statement. I had previously left this out for simplicity. Also, when the query runs CPU usage is at 1-2%, but memoery, as previously stated, is maxed out
CREATE TABLE #tempMatchResult
(
matchId VARCHAR(50)
)
INSERT INTO #tempMatchResult
SELECT MatchId FROM MatchResult WHERE SOME_CONDITION
DELETE FROM MatchEvent WHERE
MatchId IN (SELECT MatchId FROM #tempMatchResult)
DELETE FROM MatchResult WHERE
MatchId In (SELECT MatchId FROM #tempMatchResult)
DROP TABLE #tempMatchResult
There's probably a lot of stuff going on here, and it's not all your query.
First, I agree with the other posters. Try to rewrite this without a temp table if at all possible.
But assuming that you need a temp table here, you have a BIG problem in that you have no PK defined on it. It's vastly going to expand the amount of time your queries will take to run. Create your table like so instead:
CREATE TABLE #tempMatchResult (
matchId VARCHAR(50) NOT NULL PRIMARY KEY /* NOT NULL if at all possible */
);
INSERT INTO #tempMatchResult
SELECT DISTINCT MatchId FROM MatchResult;
Also, make sure that your TempDB is sized correctly. Your SQL server may very well be expanding the database file dynamically on you, causing your query to suck CPU and disk time. Also, make sure your transaction log is sized correctly, and that it is not auto-growing on you. Good luck.
Looking at the code above, why do you need a temp table?
DELETE FROM MatchEvent WHERE
MatchId IN (SELECT MatchId FROM MatchResult)
DELETE FROM MatchResult
-- OR Truncate can help here, if all the records are to be deleted anyways.
You probably want to process this piecewise in some way. (I assume queries are a lot more complicated that you showed?) In that case, you'd want try one of these:
Write your stored procedure to iterate over results. (Might still lock while processing.)
Repeatedly select the N first hits, eg LIMIT 100 and process those.
Divide work by scanning regions of the table separately, using something like WHERE M <= x AND x < N.
Run the "midnight job" more often. Seriously, running stuff like this every 5 mins instead can work wonders, especially if work increases non-linearly. (If not, you could still just get the work spread out over the hours of the day.)
In Postgres, I've had some success using conditional indices. They work magic by applying an index if certain conditions are met. This means that you can keep the many 'resolved' and the few unresolved rows in the same table, but still get that special index over just the unresolved ones. Ymmv.
Should be pointed out that this is where using databases gets interesting. You need to pay close attention to your indices and use EXPLAIN on your queries a lot.
(Oh, and remember, interesting is a good thing in your hobbies, but not at work.)
First, indexes are a MUST here see Dave M's answer.
Another approach that I will sometime use when deleting very large data sets, is creating a shadow table with all the data, recreating indexes and then using sp_rename to switch it in. You have to be careful with transactions here, but depending on the amount of data being deleted this can be faster.
Note If there is pressure on tempdb consider using joins and not copying all the data into the temp table.
So for example
CREATE TABLE #tempMatchResult (
matchId VARCHAR(50) NOT NULL PRIMARY KEY /* NOT NULL if at all possible */
);
INSERT INTO #tempMatchResult
SELECT DISTINCT MatchId FROM MatchResult;
set transaction isolation level serializable
begin transaction
create table MatchEventT(columns... here)
insert into MatchEventT
select * from MatchEvent m
left join #tempMatchResult t on t.MatchId = m.MatchId
where t.MatchId is null
-- create all the indexes for MatchEvent
drop table MatchEvent
exec sp_rename 'MatchEventT', 'MatchEvent'
-- similar code for MatchResult
commit transaction
DROP TABLE #tempMatchResult
Avoid the temp table if possible
It's only using up memory.
You could try this:
DELETE MatchEvent
FROM MatchEvent e ,
MatchResult r
WHERE e.MatchId = r.MatchId
If you can't avoid a temp table
I'm going to stick my neck out here and say: you don't need an index on your temporary table because you want the temp table to be the smallest table in the equation and you want to table scan it (because all the rows are relevant). An index won't help you here.
Do small bits of work
Work on a few rows at a time.
This will probably slow down the execution, but it should free up resources.
- One row at a time
SELECT #MatchId = min(MatchId) FROM MatchResult
WHILE #MatchId IS NOT NULL
BEGIN
DELETE MatchEvent
WHERE Match_Id = #MatchId
SELECT #MatchId = min(MatchId) FROM MatchResult WHERE MatchId > #MatchId
END
- A few rows at a time
CREATE TABLE #tmp ( MatchId Varchar(50) )
/* get list of lowest 1000 MatchIds: */
INSERT #tmp
SELECT TOP (1000) MatchId
FROM MatchResult
ORDER BY MatchId
SELECT #MatchId = min(MatchId) FROM MatchResult
WHILE #MatchId IS NOT NULL
BEGIN
DELETE MatchEvent
FROM MatchEvent e ,
#tmp t
WHERE e.MatchId = t.MatchId
/* get highest MatchId we've procesed: */
SELECT #MinMatchId = MAX( MatchId ) FROM #tmp
/* get next 1000 MatchIds: */
INSERT #tmp
SELECT TOP (1000) MatchId
FROM MatchResult
WHERE MatchId > #MinMatchId
ORDER BY MatchId
END
This one deletes up to 1000 rows at a time.
The more rows you delete at a time, the more resources you will use but the faster it will tend to run (until you run out of resources!). You can experiment to find a more optimal value than 1000.
DELETE FROM MatchResult WHERE
MatchId In (SELECT MatchId FROM #tempMatchResult)
can be replaced with
DELETE FROM MatchResult WHERE SOME_CONDITION
Can you just turn cascading deletes on between matchresult and matchevent? Then you need only worry about identifying one set of data to delete, and let SQL take care of the other.
The alternative would be to make use of the OUTPUT clause, but that's definitely more fiddle.
Both of these would let you delete from both tables, but only have to state (and execute) your filter predicate once. This may still not be as performant as a batching approach as suggested by other posters, but worth considering. YMMV