Batch deletion correctly formatted? - sql

I have multiple tables with millions of rows in them. To be safe and not overflow the transaction log, I am deleting them in batches of 100,000 rows at a time. I have to first filter out based on date, and then delete all rows less than a certain date.
To do this I am creating a table in my stored procedure which holds the ID's of the rows that need to be deleted:
I then insert into that table and delete the rows from the desired table using loops. This seems to run successfully but it is extremely slow. Is this being done correctly? Is this the fastest way to do it?
DECLARE #FILL_ID_TABLE TABLE (
FILL_ID varchar(16)
)
DECLARE #TODAYS_DATE date
SELECT
#TODAYS_DATE = GETDATE()
--This deletes all data older than 2 weeks ago from today
DECLARE #_DATE date
SET #_DATE = DATEADD(WEEK, -2, #TODAYS_DATE)
DECLARE #BatchSize int
SELECT
#BatchSize = 100000
BEGIN TRAN FUTURE_TRAN
BEGIN TRY
INSERT INTO #FILL_ID_TABLE
SELECT DISTINCT
ID
FROM dbo.ID_TABLE
WHERE CREATED < #_DATE
SELECT
#BatchSize = 100000
WHILE #BatchSize <> 0
BEGIN
DELETE TOP (#BatchSize) FROM TABLE1
OUTPUT DELETED.* INTO dbo.TABLE1_ARCHIVE
WHERE ID IN (SELECT
ROLLUP_ID
FROM #FILL_ID_TABLE)
SET #BatchSize = ##rowcount
END
SELECT
#BatchSize = 100000
WHILE #BatchSize <> 0
BEGIN
DELETE TOP (#BatchSize) FROM TABLE2
OUTPUT DELETED.* INTO dbo.TABLE2_ARCHIVE
WHERE ID IN (SELECT
FILL_ID
FROM #FILL_ID_TABLE)
SET #BatchSize = ##rowcount
END
PRINT 'Succeed'
COMMIT TRANSACTION FUTURE_TRAN
END TRY
BEGIN CATCH
PRINT 'Failed'
ROLLBACK TRANSACTION FUTURE_TRAN
END CATCH

Try join instead of subquery
DELETE TOP (#BatchSize) T1
OUTPUT DELETED.* INTO dbo.TABLE1_ARCHIVE
FROM TABLE1 AS T1
JOIN #FILL_ID_TABLE AS FIL ON FIL.ROLLUP_ID = T1.Id

Related

How to Batch INSERT SQL Server?

I am trying to batch inserting rows from one table to another.
DECLARE #batch INT = 10000;
WHILE #batch > 0
BEGIN
BEGIN TRANSACTION
INSERT into table2
select top (#batch) *
FROM table1
SET #batch = ##ROWCOUNT
COMMIT TRANSACTION
END
It runs on the first 10,000 and inserts them. Then i get error message "Cannot insert duplicate key" which its trying to insert the same primary key so i assume its trying to repeat the same batch. What logic am i missing here to loop through the batches? Probably something simple but i cant figure it out.
Can anyone help? thanks
Your code keeps inserting the same rows. You can avoid it by "paginating" your inserts:
DECLARE #batch INT = 10000;
DECLARE #page INT = 0
DECLARE #lastCount INT = 1
WHILE #lastCount > 0
BEGIN
BEGIN TRANSACTION
INSERT into table2
SELECT col1, col2, ... -- list columns explicitly
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY YourPrimaryKey ) AS RowNum, *
FROM table1
) AS RowConstrainedResult
WHERE RowNum >= (#page * #batch) AND RowNum < ((#page+1) * #batch)
SET #lastCount = ##ROWCOUNT
SET #page = #page + 1
COMMIT TRANSACTION
END
You need some way to eliminate existing rows. You seem to have a primary key, so:
INSERT into table2
SELECT TOP (#batch) *
FROM table1 t1
WHERE NOT EXISTS (SELECT 1 FROM table2 t2 WHERE t2.id = t1.id);

How do I update a SQL table in batches?

I already spent some time trying to figure this out, but I am still somewhat stuck an I can't really find the solution online as I think I am missing the keywords.
I want to update an SQL tables in batches, meaning I have a few million entries and want to update index 0-999, 1000-1999 step by step to avoid a huge database lock.
This is what I found:
DECLARE #Rows INT,
#BatchSize INT;
SET #BatchSize = 2500;
SET #Rows = #BatchSize;
WHILE (#Rows = #BatchSize)
BEGIN
UPDATE TOP(#BatchSize) db1
SET db1.attr = db2.attr
FROM DB1 db1
LEFT JOIN DB2 db2
ON db1.attr2 = db2.attr2
SET #Rows = ##ROWCOUNT;
END;
I simplified my statement a little bit as you can see, but it should still be clear how I approached the whole problem.
However, this thing loops forever, and when looking at the output it changed much more rows than there are in the database.
I checked the same loop with a select statement inside later on and found out that it seems to simply select the first #BatchSize rows of the table on and on, even though I thought it would progress in the index with every iteration.
How can I change this so it actually does progress by #BatchSize indices every iteration instead of simply targeting the same rows everytime?
You need some limiting factor to decide which rows are hit each loop. Generally you will use an id field. There are lots of ways to approach it, but here is one way:
DECLARE #MinID int = 1;
DECLARE #MaxID int = 2500;
DECLARE #Rows int = 1;
DECLARE #Batchsize int = 2500;
WHILE (#Rows > 1)
BEGIN
UPDATE db1
SET db1.attr = db2.attr
FROM DB1 db1
LEFT JOIN DB2 db2 ON db1.attr2 = db2.attr2
WHERE db1.ID BETWEEN #MinID AND MaxID
SET #Rows = ##ROWCOUNT
SET #MinID = MinID + #Batchsize
SET #MaxID = MaxID + #Batchsize
END
Replace db1.ID with whatever field works best in your table schema.
Note, your approach would work if you had some kind of WHERE clause on the update query that prevented the same rows from being returned.
Ex. UPDATE table SET id = 1 WHERE id = 2 won't pull the same rows in a second execution
One way to do it is using a cte with row_number:
DECLARE #BatchSize int = 2500,
#LastRowUpdated int = 0;
#Count int
SELECT #Count = COUNT(*) FROM db1;
;WITH CTE AS
(
SELECT attr,
attr2,
ROW_NUMBER() OVER(ORDER BY attr, atrr2) As RN
FROM db1
)
WHILE #LastRowUpdated < #Count
BEGIN
UPDATE c
SET attr = db2.atrr
FROM CTE c
LEFT JOIN DB2 ON c.attr2 = db2.attr2
WHERE c.RN > #LastRowUpdated
AND c.RN < (#LastRowUpdated +1) * #BatchSize
SELECT #LastRowUpdated += 1
END
This will update 2500 records each step of the loop.
You are just updating the same rows. Need a and <>.
Left join? If you really want to assign null values then use a separate update.
DECLARE #Rows INT,
#BatchSize INT;
SET #BatchSize = 2500;
SET #Rows = #BatchSize;
WHILE (#Rows = #BatchSize)
BEGIN
UPDATE TOP(#BatchSize) db1
SET db1.attr = db2.attr
FROM DB1 db1
JOIN DB2 db2
ON db1.attr2 = db2.attr2
AND db1.attr <> db2.attr
SET #Rows = ##ROWCOUNT;
END;
And you can do this:
select 1
WHILE (##ROWCOUNT > 0)
BEGIN
UPDATE TOP(2000) db1
SET db1.attr = db2.attr
FROM DB1 db1
JOIN DB2 db2
ON db1.attr2 = db2.attr2
AND db1.attr <> db2.attr
END;

How to optimize this t-sql script code by avoiding loop?

I use following sql query to update MyTable. the code take between 5 to 15 min. to update MyTabel as long as ROWS <= 100000000 but when Rows > 100000000 it take exponential time to update MYTable. How can I change this code to use set-base instead of while loop?
DECLARE #startTime DATETIME
DECLARE #batchSize INT
DECLARE #iterationCount INT
DECLARE #i INT
DECLARE #from INT
DECLARE #to INT
SET #batchSize = 10000
SET #i = 0
SELECT #iterationCount = COUNT(*) / #batchSize
FROM MyTable
WHERE LitraID = 8175
AND id BETWEEN 100000000 AND 300000000
WHILE #i <= #iterationCount BEGIN
BEGIN TRANSACTION T
SET #startTime = GETDATE()
SET #from = #i * #batchSize
SET #to = (#i + 1) * #batchSize - 1
;WITH data
AS (
SELECT DoorsReleased, ROW_NUMBER() OVER (ORDER BY id) AS Row
FROM MyTable
WHERE LitraID = 8175
AND id BETWEEN 100000000 AND 300000000
)
UPDATE data
SET DoorsReleased = ~DoorsReleased
WHERE row BETWEEN #from AND #to
SET #i = #i + 1
COMMIT TRANSACTION T
END
One of your issues is that your select statement in the loop fetches all records for LitraID = 8175, sets row numbers, then filters in the update statement. This happens on every iteration.
One way round this would be to get all ids for the update before entering the loop and storing them in a temporary table. Then you can write a similar query to the one you have, but joining to this table of ids.
However, there is an even easier way if you know approximately how many records have LitraID = 8175 and if they are spread throughout the table, not bunched together with similar ids.
DECLARE #batchSize INT
DECLARE #minId INT
DECLARE #maxId INT
SET #batchSize = 10000 --adjust according to how frequently LitraID = 8175, larger numbers if infrequent
SET #minId = 100000000
WHILE #minId <= 300000000 BEGIN
SET #maxId = #minId + #batchSize - 1
IF #maxId > 300000000 BEGIN
SET #maxId = 300000000
END
BEGIN TRANSACTION T
UPDATE MyTable
SET DoorsReleased = ~DoorsReleased
WHERE id BETWEEN #minId AND #maxId
COMMIT TRANSACTION T
SET #minId = #maxId + 1
END
This will use the value of id to control the loop, meaning you don't need the extra step to calculate #iterationCount. It uses small batches so that the table isn't locked for long periods. It doesn't have any unnecessary SELECT statements and the WHERE clause in the update is efficient assuming id has an index.
It won't have exactly the same number of records updated in every transaction, but there's no reason it needs to.
This will eliminate the loop
UPDATE MyTable
set DoorsReleased = ~DoorsReleased
WHERE LitraID = 8175
AND id BETWEEN 100000000 AND 300000000
AND DoorsReleased is not null -- if DoorsReleased is nullable
-- AND DoorsReleased <> ~DoorsReleased</strike>
if you are set on looping
below will NOT work
I thought ~ was part of the column name but it is a not operator
select 1;
WHILE (##ROWCOUNT > 0)
BEGIN
UPDATE top (100000) MyTable
set DoorsReleased = ~DoorsReleased
WHERE LitraID = 8175
AND id BETWEEN 100000000 AND 300000000
AND ( DoorsReleased <> ~DoorsReleased
or ( DoorsReleased is null and ~DoorsReleased is not null )
)
END
Inside a transaction I don't think looping would have value as the transaction log cannot clear. And a batch size of 10,000 is small.\
as stated in a comment if you want to loop then try using id as row_number() all those loops is expensive
you might be able to use OFFSET

Efficient SQL Server stored procedure

I am using SQL Server 2008 and running the following stored procedure that needs to "clean" a 70 mill table from about 50 mill rows to another table, the id_col is integer (primary identity key)
According to the last running I made it is working good but it is expected to last for about 200 days:
SET NOCOUNT ON
-- define the last ID handled
DECLARE #LastID integer
SET #LastID = 0
declare #tempDate datetime
set #tempDate = dateadd(dd,-20,getdate())
-- define the ID to be handled now
DECLARE #IDToHandle integer
DECLARE #iCounter integer
DECLARE #watch1 nvarchar(50)
DECLARE #watch2 nvarchar(50)
set #iCounter = 0
-- select the next to handle
SELECT TOP 1 #IDToHandle = id_col
FROM MAIN_TABLE
WHERE id_col> #LastID and DATEDIFF(DD,someDateCol,otherDateCol) < 1
and datediff(dd,someDateCol,#tempDate) > 0 and (some_other_int_col = 1745 or some_other_int_col = 1548 or some_other_int_col = 4785)
ORDER BY id_col
-- as long as we have s......
WHILE #IDToHandle IS NOT NULL
BEGIN
IF ((select count(1) from SOME_OTHER_TABLE_THAT_CONTAINS_20k_ROWS where some_int_col = #IDToHandle) = 0 and (select count(1) from A_70k_rows_table where some_int_col =#IDToHandle )=0)
BEGIN
INSERT INTO SECONDERY_TABLE
SELECT col1,col2,col3.....
FROM MAIN_TABLE WHERE id_col = #IDToHandle
EXEC [dbo].[DeleteByID] #ID = #IDToHandle --deletes the row from 2 other tables that is related to the MAIN_TABLE and than from the MAIN_TABLE
set #iCounter = #iCounter +1
END
IF (#iCounter % 1000 = 0)
begin
set #watch1 = 'iCounter - ' + CAST(#iCounter AS VARCHAR)
set #watch2 = 'IDToHandle - '+ CAST(#IDToHandle AS VARCHAR)
raiserror ( #watch1, 10,1) with nowait
raiserror (#watch2, 10,1) with nowait
end
-- set the last handled to the one we just handled
SET #LastID = #IDToHandle
SET #IDToHandle = NULL
-- select the next to handle
SELECT TOP 1 #IDToHandle = id_col
FROM MAIN_TABLE
WHERE id_col> #LastID and DATEDIFF(DD,someDateCol,otherDateCol) < 1
and datediff(dd,someDateCol,#tempDate) > 0 and (some_other_int_col = 1745 or some_other_int_col = 1548 or some_other_int_col = 4785)
ORDER BY id_col
END
Any ideas or directions to improve this procedure run-time will be welcomed
Yes, try this:
Declare #Ids Table (id int Primary Key not Null)
Insert #Ids(id)
Select id_col
From MAIN_TABLE m
Where someDateCol >= otherDateCol
And someDateCol < #tempDate -- If there are times in these datetime fields,
-- then you may need to modify this condition.
And some_other_int_col In (1745, 1548, 4785)
And Not exists (Select * from SOME_OTHER_TABLE_THAT_CONTAINS_20k_ROWS
Where some_int_col = m.id_col)
And Not Exists (Select * From A_70k_rows_table
Where some_int_col = m.id_col)
Select id from #Ids -- this to confirm above code generates the correct list of Ids
return -- this line to stop (Not do insert/deletes) until you have verified #Ids is correct
-- Once you have verified that above #Ids is correctly populated,
-- then delete or comment out the select and return lines above so insert runs.
Begin Transaction
Delete OT -- eliminate row-by-row call to second stored proc
From OtherTable ot
Join MAIN_TABLE m On m.id_col = ot.FKCol
Join #Ids i On i.Id = m.id_col
Insert SECONDERY_TABLE(col1, col2, etc.)
Select col1,col2,col3.....
FROM MAIN_TABLE m Join #Ids i On i.Id = m.id_col
Delete m -- eliminate row-by-row call to second stored proc
FROM MAIN_TABLE m
Join #Ids i On i.Id = m.id_col
Commit Transaction
Explaanation.
You had numerous filtering conditions that were not SARGable, i.e., they would force a complete table scan for every iteration of your loop, instead of being able to use any existing index. Always try to avoid filter conditions that apply processing logic to a table column value before comparing it to some other value. This eliminates the opportunity for the query optimizer to use an index.
You were executing the inserts one at a time... Way better to generate a list of PK Ids that need to be processed (all at once) and then do all the inserts at once, in one statement.

SQL: Query timeout expired

I have a simple query for update table (30 columns and about 150 000 rows).
For example:
UPDATE tblSomeTable set F3 = #F3 where F1 = #F1
This query will affected about 2500 rows.
The tblSomeTable has a trigger:
ALTER TRIGGER [dbo].[trg_tblSomeTable]
ON [dbo].[tblSomeTable]
AFTER INSERT,DELETE,UPDATE
AS
BEGIN
declare #operationType nvarchar(1)
declare #createDate datetime
declare #UpdatedColumnsMask varbinary(500) = COLUMNS_UPDATED()
-- detect operation type
if not exists(select top 1 * from inserted)
begin
-- delete
SET #operationType = 'D'
SELECT #createDate = dbo.uf_DateWithCompTimeZone(CompanyId) FROM deleted
end
else if not exists(select top 1 * from deleted)
begin
-- insert
SET #operationType = 'I'
SELECT #createDate = dbo..uf_DateWithCompTimeZone(CompanyId) FROM inserted
end
else
begin
-- update
SET #operationType = 'U'
SELECT #createDate = dbo..uf_DateWithCompTimeZone(CompanyId) FROM inserted
end
-- log data to tmp table
INSERT INTO tbl1
SELECT
#createDate,
#operationType,
#status,
#updatedColumnsMask,
d.F1,
i.F1,
d.F2,
i.F2,
d.F3,
i.F3,
d.F4,
i.F4,
d.F5,
i.F5,
...
FROM (Select 1 as temp) t
LEFT JOIN inserted i on 1=1
LEFT JOIN deleted d on 1=1
END
And if I execute the update query I have a timeout.
How can I optimize a logic to avoid timeout?
Thank you.
This query:
SELECT *
FROM (
SELECT 1 AS temp
) t
LEFT JOIN
INSERTED i
ON 1 = 1
LEFT JOIN
DELETED d
ON 1 = 1
will yield 2500 ^ 2 = 6250000 records from a cartesian product of INSERTED and DELETED (that is all possible combinations of all records in both tables), which will be inserted into tbl1.
Is that what you wanted to do?
Most probably, you want to join the tables on their PRIMARY KEY:
INSERT
INTO tbl1
SELECT #createDate,
#operationType,
#status,
#updatedColumnsMask,
d.F1,
i.F1,
d.F2,
i.F2,
d.F3,
i.F3,
d.F4,
i.F4,
d.F5,
i.F5,
...
FROM INSERTED i
FULL JOIN
DELETED d
ON i.id = d.id
This will treat update to the PK as deleting a record and inserting another, with a new PK.
Thanks Quassnoi, It's a good idea with "FULL JOIN". It is helped me.
Also I try to update table in portions (1000 items in one time) to make my code works faster because for some companyId I need to update more than 160 000 rows.
Instead of old code:
UPDATE tblSomeTable set someVal = #someVal where companyId = #companyId
I use below one:
declare #rc integer = 0
declare #parts integer = 0
declare #index integer = 0
declare #portionSize int = 1000
-- select Ids for update
declare #tempIds table (id int)
insert into #tempIds
select id from tblSomeTable where companyId = #companyId
-- calculate amount of iterations
set #rc=##rowcount
set #parts = #rc / #portionSize + 1
-- update table in portions
WHILE (#parts > #index)
begin
UPDATE TOP (#portionSize) t
SET someVal = #someVal
FROM tblSomeTable t
JOIN #tempIds t1 on t1.id = t.id
WHERE companyId = #companyId
delete top (#portionSize) from #tempIds
set #index += 1
end
What do you think about this? Does it make sense? If yes, how to choose correct portion size?
Or simple update also good solution? I just want to avoid locks in the future.
Thanks