Azure Synapse fastest way to process 20k statements in order - sql

I am designing an incremental update process for a cloud based database (Azure). The only existing changelog is a .txt file that records every insert, delete, and update statement that the database processes. There is no change data capture table available, or any database table that records changes and I cannot enable watermarking on the database. The .txt file is structured as follows:
update [table] set x = 'data' where y = 'data'
go
insert into [table] values (data)
go
delete from [table] where x = data
go
I have built my process to convert the .txt file into a table in the cloud as follows:
update_id | db_operation | statement | user | processed_flag
----------|--------------|-------------------------------------------------|-------|---------------
1 | 'update' | 'update [table] set x = data where y = data' | user1 | 0
2 | 'insert' | 'insert into [table] values (data)' | user2 | 0
3 | 'delete' | 'delete from [table] where x = data' | user3 | 1
I use this code to create a temporary table of the unprocessed transactions, and then loop over the table, create a sql statement and then execute that transaction.
CREATE TABLE temp_incremental_updates
WITH
(
DISTRIBUTION = HASH ( [user] ),
HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence,
[user],
[statement]
FROM upd.incremental_updates
WHERE processed_flag = 0;
DECLARE #nbr_statements INT = (SELECT COUNT(*) FROM temp_incremental_updates),
#i INT = 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000) = (SELECT [statement] FROM temp_incremental_updates WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i +=1;
END
DROP TABLE temp_incremental_updates;
UPDATE incremental_updates SET processed_flag = 1
This is taking a very long time, upwards of an hour. Is there a different way I can quickly processes multiple sql statements that need to occur in a specific order? Order is relevant because, for example: if I try to process a delete statement before the insert statement that created that data, azure synapse will throw an error.

Less than 2 hours for 20k individual statements is pretty good for Synapse!
Synapse isn't meant to do transactional processing. You need to convert individual updates to batch updates and execute statements like MERGE for big batches or rows instead of INSERT, UPDATE and DELETE for each row.
In your situation, you could:
Group all inserts/updates by table name
Create a temp table for each group. E.g. table1_insert_updates
Run MERGE like statement from table1_insert_updates to table1.
For deletes:
Group primary keys by table name
Run one DELETE FROM table1 where key in (primary keys) per table.
Frankly 20k is such a bad number, it's not too small and far from big enough. So even after "grouping" you could still have performance issues if you batch/group sizes are too small.
Synapse isn't meant for transaction processing. It'll merge a table with a million rows into a table with a billion rows in less than 5 minutes using a single MERGE statement to upsert a million rows, but if you run 1000 delete and 1000 insert statements one after the other it'll probably take longer!
EDIT: You'll also have to use PARTITION BY and RANK (or ROWNUMBER) to de-duplicate in case there are multiple updates to same row in a single batch. Not easy depending on how your input is (update contains all columns (even unchanged) or only changed columns) this might become very complicated.
Again Synapse is not meant for transaction processing.

Try to declare a cursor for selecting all the data from temp_incremental_updates at once, instead of making multiple reads:
CREATE TABLE temp_incremental_updates
WITH
(
DISTRIBUTION = HASH ( [user] ),
HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence,
[user],
[statement]
FROM upd.incremental_updates
WHERE processed_flag = 0;
DECLARE cur CURSOR FOR SELECT [statement] FROM temp_incremental_updates ORDER BY Sequence
OPEN cur
FETCH NEXT FROM cur INTO #sql_code
WHILE ##FETCH_STATUS = 0 BEGIN
EXEC sp_executesql #sql_code;
FETCH NEXT FROM cur INTO #sql_code
END
-- Rest of the code

Related

How to delete a massive amount of data from several related tables in SQL on MS Server

Given a "main" table which has a single primary key, from which a huge number of rows need to be deleted (perhaps about 200M). In addition, there are about 30 "related" tables that are related to the main table, and related rows must also be deleted from each. It is expected that about an equivalent huge number of rows (or more) would need to be deleted from each of the related tables.
Of course it's possible to change the condition to partition the amount of data to be deleted, and run it several times, but in any case, I need an efficient solution to do this.
John Rees suggests a way to do massive deletes in a single table in Delete Large Number of Rows Is Very Slow - SQL Server
, but the problem with that is that it performs several transactional deletes in a single table. This could potentially leave the db in an inconsistent state.
John Gibb suggests a way to delete from several related tables, in How do I delete from multiple tables using INNER JOIN in SQL server
, but it does not consider the possibility that the amount of data to be deleted from each of these tables is large.
How can these two solutions be combined into an efficient way to delete a large number of rows from several related tables? (I'm new to SQL)
Perhaps it's important to note that, in the scope of this problem, each "related" table is only related to the "main" table
I think this is what you're after...
This will delete 4000 rows from the tables with the foreign key references (assuming 1:1) before deleting the same 4000 rows from the main table.
It will loop until done, or it hits the stop time (if enabled).
DECLARE #BATCHSIZE INT, #ITERATION INT, #TOTALROWS INT, #MAXRUNTIME VARCHAR(8), #BSTOPATMAXTIME BIT, #MSG VARCHAR(500)
SET DEADLOCK_PRIORITY LOW;
SET #BATCHSIZE = 4000
SET #MAXRUNTIME = '08:00:00' -- 8AM
SET #BSTOPATMAXTIME = 1 -- ENFORCE 8AM STOP TIME
SET #ITERATION = 0 -- LEAVE THIS
SET #TOTALROWS = 0 -- LEAVE THIS
IF OBJECT_ID('TEMPDB..#TMPLIST') IS NOT NULL DROP TABLE #TMPLIST
CREATE TABLE #TMPLIST (ID BIGINT)
WHILE #BATCHSIZE>0
BEGIN
-- IF #BSTOPATMAXTIME = 1, THEN WE'LL STOP THE WHOLE JOB AT A SET TIME...
IF CONVERT(VARCHAR(8),GETDATE(),108) >= #MAXRUNTIME AND #BSTOPATMAXTIME=1
BEGIN
RETURN
END
TRUNCATE TABLE #TMPLIST
INSERT INTO #TMPLIST (ID)
SELECT TOP(#BATCHSIZE) ID
FROM MAINTABLE
WHERE X=Y -- DELETE CRITERIA HERE...
SET #BATCHSIZE=##ROWCOUNT
DELETE T1
FROM SOMETABLE1 T1
WHERE EXISTS (SELECT 1 FROM #TMPLIST T WHERE T1.MAINID=T.ID)
DELETE T2
FROM SOMETABLE2 T2
WHERE EXISTS (SELECT 1 FROM #TMPLIST T WHERE T2.MAINID=T.ID)
DELETE T3
FROM SOMETABLE3 T3
WHERE EXISTS (SELECT 1 FROM #TMPLIST T WHERE T3.MAINID=T.ID)
DELETE M
FROM MAINTABLE M
WHERE EXISTS (SELECT 1 FROM #TMPLIST T WHERE T3.MAINID=M.ID)
SET #ITERATION=#ITERATION+1
SET #TOTALROWS=#TOTALROWS+#BATCHSIZE
SET #MSG = 'Iteration: ' + CAST(#ITERATION AS VARCHAR) + ' Total deletes:' + CAST(#TOTALROWS AS VARCHAR)
RAISERROR (#MSG, 0, 1) WITH NOWAIT
END

Adding/updating bulk data using SQL

We are inserting bulk data into one of our database tables using SQL Server Management studio. Currently we are in a position where the data being sent to the database will be added to a particular row in a table (this is controlled by a stored procedure). What we are finding is that a timeout is occurring before the operation completes; at this point we think the operation is slow because of the while loop but we're unsure of how to approach writing a faster equivalent.
-- Insert statements for procedure here
WHILE #i < #nonexistingTblCount
BEGIN
Insert into AlertRanking(MetricInstanceID,GreenThreshold,RedThreshold,AlertTypeID,MaxThreshold,MinThreshold)
VALUES ((select id from #nonexistingTbl order by id OFFSET #i ROWS FETCH NEXT 1 ROWS ONLY), #greenThreshold, #redThreshold, #alertTypeID, #maxThreshold, #minThreshold)
set #id = (SELECT ID FROM AlertRanking
WHERE MetricInstanceID = (select id from #nonexistingTbl order by id OFFSET #i ROWS FETCH NEXT 1 ROWS ONLY)
AND GreenThreshold = #greenThreshold
AND RedThreshold = #redThreshold
AND AlertTypeID = #alertTypeID);
set #i = #i + 1;
END
Where #nonexistingTblCount is the total number of rows inside the table #nonexistingTbl. The #nonexistingTbl table is declared earlier and contains all the values we want to add to the table.
Instead of using a loop, you should be able to insert all of the records with a single statement.
INSERT INTO AlertRanking(MetricInstanceID,GreenThreshold,RedThreshold,AlertTypeID,MaxThreshold,MinThreshold)
SELECT id, #greenThreshold, #redThreshold, #alertTypeID, #maxThreshold, #minThreshold FROM #nonexistingTbl ORDER BY id

How to get the next number in a sequence

I have a table like this:
+----+-----------+------+-------+--+
| id | Part | Seq | Model | |
+----+-----------+------+-------+--+
| 1 | Head | 0 | 3 | |
| 2 | Neck | 1 | 3 | |
| 3 | Shoulders | 2 | 29 | |
| 4 | Shoulders | 2 | 3 | |
| 5 | Stomach | 5 | 3 | |
+----+-----------+------+-------+--+
How can I insert another record with the next seq after Stomach for Model 3. So here is what the new table suppose to look like:
+----+-----------+------+-------+--+
| id | Part | Seq | Model | |
+----+-----------+------+-------+--+
| 1 | Head | 0 | 3 | |
| 2 | Neck | 1 | 3 | |
| 3 | Shoulders | 2 | 29 | |
| 4 | Shoulders | 2 | 3 | |
| 5 | Stomach | 5 | 3 | |
| 6 | Groin | 6 | 3 | |
+----+-----------+------+-------+--+
Is there a way to craft an insert query that will give the next number after the highest seq for Model 3 only. Also, looking for something that is concurrency safe.
If you do not maintain a counter table, there are two options. Within a transaction, first select the MAX(seq_id) with one of the following table hints:
WITH(TABLOCKX, HOLDLOCK)
WITH(ROWLOCK, XLOCK, HOLDLOCK)
TABLOCKX + HOLDLOCK is a bit overkill. It blocks regular select statements, which can be considered heavy even though the transaction is small.
A ROWLOCK, XLOCK, HOLDLOCK table hint is probably a better idea (but: read the alternative with a counter table further on). The advantage is that it does not block regular select statements, ie when the select statements don't appear in a SERIALIZABLE transaction, or when the select statements don't provide the same table hints. Using ROWLOCK, XLOCK, HOLDLOCK will still block insert statements.
Of course you need to be sure that no other parts of your program select the MAX(seq_id) without these table hints (or outside a SERIALIZABLE transaction) and then use this value to insert rows.
Note that depending on the number of rows that are locked this way, it is possible that SQL Server will escalate the lock to a table lock. Read more about lock escalation here.
The insert procedure using WITH(ROWLOCK, XLOCK, HOLDLOCK) would look as follows:
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Spine';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #max_seq INT=(SELECT MAX(seq) FROM dbo.table_seq WITH(ROWLOCK,XLOCK,HOLDLOCK) WHERE model=#target_model);
IF #max_seq IS NULL SET #max_seq=0;
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#max_seq+1,#target_model);
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH
An alternative and probably a better idea is to have a counter table, and provide these table hints on the counter table. This table would look like the following:
CREATE TABLE dbo.counter_seq(model INT PRIMARY KEY, seq_id INT);
You would then change the insert procedure as follows:
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Spine';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #new_seq INT=(SELECT seq FROM dbo.counter_seq WITH(ROWLOCK,XLOCK,HOLDLOCK) WHERE model=#target_model);
IF #new_seq IS NULL
BEGIN SET #new_seq=1; INSERT INTO dbo.counter_seq(model,seq)VALUES(#target_model,#new_seq); END
ELSE
BEGIN SET #new_seq+=1; UPDATE dbo.counter_seq SET seq=#new_seq WHERE model=#target_model; END
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#new_seq,#target_model);
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH
The advantage is that fewer row locks are used (ie one per model in dbo.counter_seq), and lock escalation cannot lock the whole dbo.table_seq table thus blocking select statements.
You can test all this and see the effects yourself, by placing a WAITFOR DELAY '00:01:00' after selecting the sequence from counter_seq, and fiddling with the table(s) in a second SSMS tab.
PS1: Using ROW_NUMBER() OVER (PARTITION BY model ORDER BY ID) is not a good way. If rows are deleted/added, or ID's changed the sequence would change (consider invoice id's that should never change). Also in terms of performance having to determine the row numbers of all previous rows when retrieving a single row is a bad idea.
PS2: I would never use outside resources to provide locking, when SQL Server already provides locking through isolation levels or fine-grained table hints.
The correct way to handle such insertions is to use an identity column or, if you prefer, a sequence and a default value for the column.
However, you have a NULL value for the seq column, which does not seem correct.
The problem with a query such as:
Insert into yourtable(id, Part, Seq, Model)
Select 6, 'Groin', max(Seq) + 1, 3
From yourtable;
is that two such queries, running at the same time, could produce the same value. The recommendation is to declare seq as a unique, identity column and let the database do all the work.
Let's first list the challenges:
We cannot use a normal constraint as there are existing null values and we also need to cater for duplicates as well as gaps - if we look at the existing data. This is fine, we will figure it out ;-> in step 3
We require safety for concurrent operations (thus some form or mix of transactions, isolation levels and possibly a "kinda SQL mutex".) Gut feel here is a stored proc for a couple of reasons:
2.1 It protects more easily from sql injection
2.2 We can control the isolation levels (table locking) more easily and recover from some issues which come with this kind of requirement
2.3 We can use application level db locks to control the concurrency
We must store or find the next value on every insert. The word concurrency tells us already that there will be contention and probably high throughput (else please stick to single threads). So we must already be thinking: do not read from the same table you want to write to in an already complicated world.
So with that short prequel, let's attempt a solution:
As a start, we are creating your original table and then also a table to hold the sequence (BodyPartsCounter) which we are setting to the last used sequence + 1:
CREATE TABLE BodyParts
([id] int identity, [Part] varchar(9), [Seq] varchar(4), [Model] int)
;
INSERT INTO BodyParts
([Part], [Seq], [Model])
VALUES
('Head', NULL, 3),
('Neck', '1', 3),
('Shoulders', '2', 29),
('Shoulders', '2', 3),
('Stomach', '5', 3)
;
CREATE TABLE BodyPartsCounter
([id] int
, [counter] int)
;
INSERT INTO BodyPartsCounter
([id], [counter])
SELECT 1, MAX(id) + 1 AS id FROM BodyParts
;
Then we need to create the stored procedure which will do the magic. In short, it acts as a mutex, basically guaranteeing you concurrency (if you do not do inserts or updates into the same tables elsewhere). It then get's the next seq, updates it and inserts the new row. After this has all happened it will commit the transaction and release the stored proc for the next waiting calling thread.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: Charlla
-- Create date: 2016-02-15
-- Description: Inserts a new row in a concurrently safe way
-- =============================================
CREATE PROCEDURE InsertNewBodyPart
#bodypart varchar(50),
#Model int = 3
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
-- Get an application lock in your threaded calls
-- Note: this is blocking for the duration of the transaction
DECLARE #lockResult int;
EXEC #lockResult = sp_getapplock #Resource = 'BodyPartMutex',
#LockMode = 'Exclusive';
IF #lockResult = -3 --deadlock victim
BEGIN
ROLLBACK TRANSACTION;
END
ELSE
BEGIN
DECLARE #newId int;
--Get the next sequence and update - part of the transaction, so if the insert fails this will roll back
SELECT #newId = [counter] FROM BodyPartsCounter WHERE [id] = 1;
UPDATE BodyPartsCounter SET [counter] = #newId + 1 WHERE id = 1;
-- INSERT THE NEW ROW
INSERT INTO dbo.BodyParts(
Part
, Seq
, Model
)
VALUES(
#bodypart
, #newId
, #Model
)
-- END INSERT THE NEW ROW
EXEC #lockResult = sp_releaseapplock #Resource = 'BodyPartMutex';
COMMIT TRANSACTION;
END;
END
GO
Now run the test with this:
EXEC #return_value = [dbo].[InsertNewBodyPart]
#bodypart = N'Stomach',
#Model = 4
SELECT 'Return Value' = #return_value
SELECT * FROM BodyParts;
SELECT * FROM BodyPartsCounter
This all works - but be careful - there's a lot to consider with any kind of multithreaded app.
Hope this helps!
I believe the best bet to handle this kind of sequence generation scenario is the counter table as TT suggested. I just wanted to show you here a slightly simplified version of TT implementation.
Tables:
CREATE TABLE dbo.counter_seq(model INT PRIMARY KEY, seq INT);
CREATE TABLE dbo.table_seq(part varchar(128), seq int, model int);
Simpler version (No SELECT statement to retrieve the current seq):
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Otra MAS';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #seq int = 1
UPDATE dbo.counter_seq WITH(ROWLOCK,HOLDLOCK) SET #seq = seq = seq + 1 WHERE model=#target_model;
IF ##ROWCOUNT = 0 INSERT INTO dbo.counter_seq VALUES (#target_model, 1);
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#seq,#target_model);
COMMIT
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH
Since you want the sequence to be based on the a specific model, just add that into the where clause when doing the select. This will ensure the Max(SEQ) pertains only to that model series. Also since the SEQ can be null wrap it in a ISNULL, so if it is null it will be 0, so 0 + 1, will set the next to 1.
The basic way to do this is :
Insert into yourtable(id, Part, Seq, Model)
Select 6, 'Groin', ISNULL(max(Seq),0) + 1, 3
From yourtable
where MODEL = 3;
I would not try to store the Seq value in the table in the first place.
As you said in the comments, your ID is IDENTITY, which increases automatically in a very efficient and concurrent-safe way by the server. Use it for determining the order in which rows were inserted and the order in which the Seq values should be generated.
Then use ROW_NUMBER to generate values of Seq partitioned by Model (the sequence restarts from 1 for each value of Model) as needed in the query.
SELECT
ID
,Part
,Model
,ROW_NUMBER() OVER(PARTITION BY Model ORDER BY ID) AS Seq
FROM YourTable
insert into tableA (id,part,seq,model)
values
(6,'Groin',(select MAX(seq)+1 from tableA where model=3),3)
create function dbo.fncalnxt(#model int)
returns int
begin
declare #seq int
select #seq= case when #model=3 then max(id) --else
end from tblBodyParts
return #seq+1
end
--query idea To insert values, ideal if using SP to insert
insert into tblBodyParts values('groin',dbo.fncalnxt(#model),#model)
You can try this i guess.
A novice shot, correct me if im wrong. i'd suggest using function to get the value in seq column based on model;
you'll have to check the else case though to return another value you want, when model!=3, it'll return null now.
Assuming you have following table:
CREATE TABLE tab (
id int IDENTITY(1,1) PRIMARY KEY,
Part VARCHAR(32) not null,
Seq int not null,
Model int not null
);
INSERT INTO
tab(Part,Seq,Model)
VALUES
('Head', 0, 3),
('Neck', 1, 3),
('Shoulders', 2, 29),
('Shoulders', 2, 3),
('Stomach', 5, 3);
The query below will allow you to import multiple records, without ruine the model_seq
INSERT INTO
tab (model, part, model_seq)
SELECT
n.model,
n.part,
-- ensure new records will get receive the proper model_seq
IFNULL(max_seq + model_seq, model_seq) AS model_seq
FROM
(
SELECT
-- row number for each model new record
ROW_NUMBER() OVER(PARTITION BY model ORDER BY part) AS model_seq,
n.model,
n.part,
MAX(t.seq) AS max_seq
FROM
-- Table-values constructor allows you to prepare the
-- temporary data (with multi rows),
-- where you could join the existing one
-- to retrieve the max(model_seq) if any
(VALUES
('Stomach',3),
('Legs',3),
('Legs',29),
('Arms',1)
) AS n(part, model)
LEFT JOIN
tab
ON
tab.model = n.model
GROUP BY
n.model n.part
) AS t
We need row_number() to ensure if we import more than one value the order will be kept. More info about ROW_NUMBER() OVER() (Transact-SQL)
Table-value constructor is used to create a table with the new values and join the MAX model_seq for model.
You could find more about table-value contructor here: Table Value Constructor (Transact-SQL)

How to efficiently delete rows while NOT using Truncate Table in a 500,000+ rows table

Let's say we have table Sales with 30 columns and 500,000 rows. I would like to delete 400,000 in the table (those where "toDelete='1'").
But I have a few constraints :
the table is read / written "often" and I would not like a long "delete" to take a long time and lock the table for too long
I need to skip the transaction log (like with a TRUNCATE) but while doing a "DELETE ... WHERE..." (I need to put a condition), but haven't found any way to do this...
Any advice would be welcome to transform a
DELETE FROM Sales WHERE toDelete='1'
to something more partitioned & possibly transaction log free.
Calling DELETE FROM TableName will do the entire delete in one large transaction. This is expensive.
Here is another option which will delete rows in batches :
deleteMore:
DELETE TOP(10000) Sales WHERE toDelete='1'
IF ##ROWCOUNT != 0
goto deleteMore
I'll leave my answer here, since I was able to test different approaches for mass delete and update (I had to update and then delete 125+mio rows, server has 16GB of RAM, Xeon E5-2680 #2.7GHz, SQL Server 2012).
TL;DR: always update/delete by primary key, never by any other condition. If you can't use PK directly, create a temp table and fill it with PK values and update/delete your table using that table. Use indexes for this.
I started with solution from above (by #Kevin Aenmey), but this approach turned out to be inappropriate, since my database was live and it handles a couple of hundred transactions per second and there was some blocking involved (there was an index for all there fields from condition, using WITH(ROWLOCK) didn't change anything).
So, I added a WAITFOR statement, which allowed database to process other transactions.
deleteMore:
WAITFOR DELAY '00:00:01'
DELETE TOP(1000) FROM MyTable WHERE Column1 = #Criteria1 AND Column2 = #Criteria2 AND Column3 = #Criteria3
IF ##ROWCOUNT != 0
goto deleteMore
This approach was able to process ~1.6mio rows/hour for updating and ~0,2mio rows/hour for deleting.
Turning to temp tables changed things quite a lot.
deleteMore:
SELECT TOP 10000 Id /* Id is the PK */
INTO #Temp
FROM MyTable WHERE Column1 = #Criteria1 AND Column2 = #Criteria2 AND Column3 = #Criteria3
DELETE MT
FROM MyTable MT
JOIN #Temp T ON T.Id = MT.Id
/* you can use IN operator, it doesn't change anything
DELETE FROM MyTable WHERE Id IN (SELECT Id FROM #Temp)
*/
IF ##ROWCOUNT > 0 BEGIN
DROP TABLE #Temp
WAITFOR DELAY '00:00:01'
goto deleteMore
END ELSE BEGIN
DROP TABLE #Temp
PRINT 'This is the end, my friend'
END
This solution processed ~25mio rows/hour for updating (15x faster) and ~2.2mio rows/hour for deleting (11x faster).
What you want is batch processing.
While (select Count(*) from sales where toDelete =1) >0
BEGIN
Delete from sales where SalesID in
(select top 1000 salesId from sales where toDelete = 1)
END
Of course you can experiment which is the best value to use for the batch, I've used from 500 - 50000 depending on the table. If you use cascade delete, you will probably need a smaller number as you have those child records to delete.
One way I have had to do this in the past is to have a stored procedure or script that deletes n records. Repeat until done.
DELETE TOP 1000 FROM Sales WHERE toDelete='1'
You should try to give it a ROWLOCK hint so it will not lock the entire table. However, if you delete a lot of rows lock escalation will occur.
Also, make sure you have a non-clustered filtered index (only for 1 values) on the toDelete column. If possible make it a bit column, not varchar (or what it is now).
DELETE FROM Sales WITH(ROWLOCK) WHERE toDelete='1'
Ultimately, you can try to iterate over the table and delete in chunks.
Updated
Since while loops and chunk deletes are the new pink here, I'll throw in my version too (combined with my previous answer):
SET ROWCOUNT 100
DELETE FROM Sales WITH(ROWLOCK) WHERE toDelete='1'
WHILE ##rowcount > 0
BEGIN
SET ROWCOUNT 100
DELETE FROM Sales WITH(ROWLOCK) WHERE toDelete='1'
END
My own take on this functionality would be as follows.
This way there is no repeated code and you can manage your chunk size.
DECLARE #DeleteChunk INT = 10000
DECLARE #rowcount INT = 1
WHILE #rowcount > 0
BEGIN
DELETE TOP (#DeleteChunk) FROM Sales WITH(ROWLOCK)
SELECT #rowcount = ##RowCount
END
I have used the below to delete around 50 million records -
BEGIN TRANSACTION
DeleteOperation:
DELETE TOP (BatchSize)
FROM [database_name].[database_schema].[database_table]
IF ##ROWCOUNT > 0
GOTO DeleteOperation
COMMIT TRANSACTION
Please note that keeping the BatchSize < 5000 is less expensive on resources.
As I assume the best way to delete huge amount of records is to delete it by Primary Key. (What is Primary Key see here)
So you have to generate tsql script that contains the whole list of lines to delete and after this execute this script.
For example code below is gonna generate that file
GO
SET NOCOUNT ON
SELECT 'DELETE FROM DATA_ACTION WHERE ID = ' + CAST(ID AS VARCHAR(50)) + ';' + CHAR(13) + CHAR(10) + 'GO'
FROM DATA_ACTION
WHERE YEAR(AtTime) = 2014
The ouput file is gonna have records like
DELETE FROM DATA_ACTION WHERE ID = 123;
GO
DELETE FROM DATA_ACTION WHERE ID = 124;
GO
DELETE FROM DATA_ACTION WHERE ID = 125;
GO
And now you have to use SQLCMD utility in order to execute this script.
sqlcmd -S [Instance Name] -E -d [Database] -i [Script]
You can find this approach explaned here https://www.mssqltips.com/sqlservertip/3566/deleting-historical-data-from-a-large-highly-concurrent-sql-server-database-table/
Here's how I do it when I know approximately how many iterations:
delete from Activities with(rowlock) where Id in (select top 999 Id from Activities
(nolock) where description like 'financial data update date%' and len(description) = 87
and User_Id = 2);
waitfor delay '00:00:02'
GO 20
Edit: This worked better and faster for me than selecting top:
declare #counter int = 1
declare #msg varchar(max)
declare #batch int = 499
while ( #counter <= 37600)
begin
set #msg = ('Iteration count = ' + convert(varchar,#counter))
raiserror(#msg,0,1) with nowait
delete Activities with (rowlock) where Id in (select Id from Activities (nolock) where description like 'financial data update date%' and len(description) = 87 and User_Id = 2 order by Id asc offset 1 ROWS fetch next #batch rows only)
set #counter = #counter + 1
waitfor delay '00:00:02'
end
Declare #counter INT
Set #counter = 10 -- (you can always obtain the number of rows to be deleted and set the counter to that value)
While #Counter > 0
Begin
Delete TOP (4000) from <Tablename> where ID in (Select ID from <sametablename> with (NOLOCK) where DateField < '2021-01-04') -- or opt for GetDate() -1
Set #Counter = #Counter -1 -- or set #counter = #counter - 4000 if you know number of rows to be deleted.
End

Batch commit on large INSERT operation in native SQL?

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.
If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)
I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).
Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?
Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:
SELECT key_columns INTO dbo.temp FROM dbo.HugeView;
(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)
Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:
SET NOCOUNT ON;
DECLARE
#batchsize INT,
#ctr INT,
#rc INT;
SELECT
#batchsize = 10000,
#ctr = 0;
WHILE 1 = 1
BEGIN
WITH x AS
(
SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
FROM dbo.temp
)
INSERT dbo.PrimaryTable(a, b, c, etc.)
SELECT v.a, v.b, v.c, etc.
FROM x
INNER JOIN dbo.HugeView AS v
ON v.key_column = x.key_column
WHERE x.rn > #batchsize * #ctr
AND x.rn <= #batchsize * (#ctr + 1);
IF ##ROWCOUNT = 0
BREAK;
BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;
SET #ctr = #ctr + 1;
END
That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:
Break large delete operations into chunks
Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.
You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.
create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever
while ....
insert into yourtable(...)
select * from whatever
where YEAR(DateCol) = year and MONTH(DateCol) = month
end
I know this is an old thread, but I made a generic version of Arthur's cursor solution:
--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an #Day variable (for example)
DECLARE #Year INT
DECLARE #Month INT
DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;
OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
WHILE ##FETCH_STATUS = 0
BEGIN
--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=#Year AND Month(<SomeDateField>)=#Month
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO
This solved the problem on loads of our large tables.
There is no pixie dust, you know that.
Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:
create table currentKeys (table sysname not null primary key, key sql_variant not null);
go
declare #keysInserted table (key sql_variant);
declare #key sql_variant;
begin transaction
do while (1=1)
begin
select #key = key from currentKeys where table = '<target>';
insert into <target> (...)
output inserted.key into #keysInserted (key)
select top (<batchsize>) ... from <source>
where key > #key
order by key;
if (0 = ##rowcount)
break;
update currentKeys
set key = (select max(key) from #keysInserted)
where table = '<target>';
commit;
delete from #keysInserted;
set #key = null;
begin transaction;
end
commit
It would get more complicated if you want to allow for parallel batches and partition the keys.
You could use the BCP command to load the data and use the Batch Size parameter
http://msdn.microsoft.com/en-us/library/ms162802.aspx
Two step process
BCP OUT data from Views into Text files
BCP IN data from Text files into Tables with batch size parameter
This looks like a job for good ol' BCP.