Use T-SQL Transaction for batch of delete statements? - sql

I have a stored procedure that deletes records from multiple tables.
I wish for either all of the delete statements to complete successfully, or none. The actual purpose here is to wipe all data related to a particular user.
Note that none of this data is related in any way to any other data. E.g. a user's data is not referenced in any way by another users data. However it is possible to have concurrent client sources accessing one user's data simultaneously. I don't know if this is relevant
So I've wrapped it in BEGIN TRANSACTION ... COMMIT TRANSACTION
like so:
CREATE PROCEDURE [dbo].[spDeleteData]
#MyID AS INT
AS
BEGIN TRANSACTION
DELETE FROM [Table1] WHERE myId = #MyID;
DELETE FROM [Table2] WHERE myId = #MyID;
....
COMMIT TRANSACTION
RETURN 0
My question here is what are the implications of wrapping multiple DELETE calls in a transaction? Will it create possible deadlock scenarios, or hurt performance in some way?
From what I am reading, using TRANSACTION ISOLATION LEVEL only applies to read operations, is this true?

What you are guaranteeing is that either all the rows that match the conditions in both tables are successfully deleted or none of the rows are deleted (i.e. if there is a problem the deletes are rolled back.) There are more locks and they are kept for a longer period but if it fails you don't have to manually recreate the rows the deletes are undone for you automatically. You probably want to add the statement:
set xact_abort on
at the beginning of the transaction and to wrap the whole thing in a begin try/begin catch statement.
Please see sommarskog.se/error-handling-I.html#XACT_ABORT for an execellent discussion on this statement and on error handling for TSQL.

Related

Should I use sp_getapplock to prevent multiple instances of a stored procedure that conditionally inserts?

Hear me out! I know this use case sounds suspect, but...
I have a stored procedure which checks a table (effectively a cache) for data for a given requested ID. If it doesn't find any data for that ID, or deems it out of date, it executes a second stored procedure which will pull data from a separate database (using dynamic SQL, source DB name is based on the requested ID) and insert it into the local table. It then selects from this table.
If the data is in the table, everything returns quickly (ms), but if it needs to be brought in from the other database, it takes about 10 seconds. We're seeing race conditions where two concurrent instances check the local cache, see something is missing, and queue up sequential ingestions of the remote data into the cache. To avoid double-insertion, the cache-populating procedure will clear whatever is already there for this id, but this just means the first instance of the procedure can selecting no rows because the second instance deleted the just-inserted records before re-inserting them itself.
I think I want to put a lock around the entire procedure (checking the cache, potentially populating the cache, selecting from the cache) - although I'm open to other solutions. I think the overall caching approach has to remain on-demand though, the remote databases come and go by the hundreds, and we only want to cache the ones actually requested by reporting as-needed.
BEGIN TRANSACTION;
BEGIN TRY
-- Take out a lock intended to prevent anyone else modifying the cache while we're reading and potentially modifying it
EXEC sp_getapplock #Resource = '[private].[cache_entries]', #LockOwner='Transaction', #LockMode = 'Exclusive', #LockTimeout = 120000;
-- Invoke a stored procedure that ingests any required data that is not already cached
EXEC [private].populate_cache #required_dbs
-- CALCULATIONS
-- ... SELECT FROM [private].cache_entries
COMMIT TRANSACTION; -- Free the lock
END TRY
BEGIN CATCH --Ensure we release our lock on failure
ROLLBACK TRANSACTION;
THROW
END CATCH;
The alternative to sp_getapplock is to use locking hints with your transaction. Both are reasonable approaches. Locking hints can be complex, but they protect the target object itself rather than a single code path. So sometimes necessary. sp_getapplock is simple (with Transaction as owner), and reliable.
You can do this without sp_getapplock, which tends to inhibit concurrency a lot.
The way to do this is to continue do your checks within a transaction, but to apply a HOLDLOCK hint, as well as a UPDLOCK hint.
HOLDLOCK aka the SERIALIZABLE isolation level, will place a lock not only on the ID you specify, but even on the absence of such data, in other words it will prevent anyone else inserting into that ID.
You must use both these hints, as well as have an index that matches that SELECT, otherwise you could run into major blocking and deadlocking problems due to full table scans.
Also, you don't need a CATCH and ROLLBACK. Just use SET XACT_ABORT ON; which ensures a rollback in any event of an error.
SET XACT_ABORT ON; -- always have this set
BEGIN TRANSACTION;
DECLARE #SomeData nvarchar(100) = (
SELECT ce.SomeColumn
FROM [private].cache_entries ce WITH (HOLDLOCK, UPDLOCK)
WHERE ce.SomeCondition = 1
);
IF #SomeData IS NULL
BEGIN
-- Invoke a stored procedure that ingests any required data that is not already cached
EXEC [private].populate_cache #required_dbs
END
-- CALCULATIONS
-- ... SELECT FROM [private].cache_entries
COMMIT TRANSACTION; -- Free the lock

Preventing deadlocks in SQL Server

I have an application connected to a SQL Server 2014 database that combines several rows into one. There are no other connections to this database while the application is running.
First, select a chunk of rows within a specific time span. This query uses a non-clustered seek (TIME column) merged with a clustered lookup.
select ...
from FOO
where TIME >= #from and TIME < #to and ...
Then, we process these rows in c# and write changes as a single update and multiple deletes, this happens many times per chunk. These also use non-clustered index seeks.
begin tran
update FOO set ...
where NON_CLUSTERED_ID = #id
delete FOO where NON_CLUSTERED_ID in (#id1, #id2, #id3, ...)
commit
I am getting deadlocks when running this with multiple parallel chunks. I tried using ROWLOCK for the update and delete but that caused even more deadlocks than before for some reason, even though there are no overlaps between chunks.
Then I tried TABLOCKX, HOLDLOCK on the update, but that means I can't perform my select in parallel so I'm losing the advantages of parallelism.
Any idea how I can avoid deadlocks but still process multiple parallel chunks?
Would it be safe to use NOLOCK on my select in this case, given there is no row overlap between chunks? Then TABLOCKX, HOLDLOCK would only block the update and delete, correct?
Or should I just accept that deadlocks will happen and retry the query in my application?
UPDATE (additional information): All deadlocks so far have happened in the update and delete phase, none in the select. I'll try to get some deadlock logs up if I can't get this solved today (the correct trace flags weren't enabled before).
UPDATE: These are the two arrangements of deadlocks that occur with ROWLOCK, they both refer only to the delete statement and the non-clustered index it uses. I'm not sure if these are the same as the deadlocks that occur without any table hints as I wasn't able to reproduce any of those.
Ask if there's anything else needed from the .xdl, I'm a bit weary of attaching the whole thing.
The general advice regarding deadlocks: make sure you do everything in the same order, i.e. acquire locks in the same order, for different processes.
You can find the same advice in this technical article on microsoft.com regarding Minimizing Deadlocks. There's a good reason it is listed first.
Access objects in the same order.
Avoid user interaction in transactions.
Keep transactions short and in one batch.
Use a lower isolation level.
Use a row versioning-based isolation level.
Set READ_COMMITTED_SNAPSHOT database option ON to enable read-committed transactions to use row versioning.
Use snapshot isolation.
Use bound connections.
Update after question from Cato:
How would acquiring locks in the same order apply here? Have you got any advice on how he would change his SQL to do that?
Deadlocks are always the same, no matter what environment: two processes (say A & B) acquire multiple locks (say X & Y) in a different order so that A is waiting for Y and B is waiting for X while A is holding X and B is holding Y.
It applies here because DELETE and UPDATE statements implicitely acquire locks on the rows or index range or table (depending on what the engine deems appropriate).
You should analyze your process and see if there are scenarios where locks could be acquired in a different order. If that doesn't reveal anything, you can analyze deadlocks using the SQL Server Profiler:
To trace deadlock events, add the Deadlock graph event class to a trace. This event class populates the TextData data column in the trace with XML data about the process and objects that are involved in the deadlock. SQL Server Profiler can extract the XML document to a deadlock XML (.xdl) file which you can view later in SQL Server Management Studio. You can configure SQL Server Profiler to extract Deadlock graph events to a single file that contains all Deadlock graph events, or to separate files.
I'd use sp_getapplock in the updating transaction to prevent multiple instances of this code running in parallel. This will not block the selecting statement as table locking hints do.
You still should program the retrying logic, because it may take a while to acquire the lock, longer than the timeout parameter.
This is how the updating transaction can be wrapped into sp_getapplock.
BEGIN TRANSACTION;
BEGIN TRY
DECLARE #VarLockResult int;
EXEC #VarLockResult = sp_getapplock
#Resource = 'some_unique_name_app_lock',
#LockMode = 'Exclusive',
#LockOwner = 'Transaction',
#LockTimeout = 60000,
#DbPrincipal = 'public';
IF #VarLockResult >= 0
BEGIN
-- Acquired the lock
update FOO set ...
where NON_CLUSTERED_ID = #id
delete FOO where NON_CLUSTERED_ID in (#id1, #id2, #id3, ...)
END ELSE BEGIN
-- return some error code, so that the caller could retry
END;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
-- handle the error
END CATCH;
The selecting statement doesn't need any changes.
I would recommend against NOLOCK, even though you say that IDs in chunks do not overlap. With this hint the SELECT query can skip some pages that are being changed, it can read some pages twice. It is unlikely that such behavior can be tolerated.
Kindly use get applock in such format in code.  The stored procedure sp_getapplock puts the lock on the application resource .
EXEC Sp_getapplock
#Resource = 'storeprocedurename',
#LockMode = 'Exclusive',
#LockOwner = 'Transaction',
#LockTimeout = 25000
It is very helpful. Kindly increase LockTimeout to reduce deadlock

Delete and Insert Inside one Transaction SQL

I just want to ask if it is always the first query will be executed when encapsulate to a transaction? for example i got 500 k records to be deleted and 500 k to be inserted, is there a possibility of locking?
Actually I already test this query and it works fine but i want to make sure if my assumption is correct.
Note: this will Delete and Insert the same record with possible update on other columns.
BEGIN TRAN;
DELETE FROM OUTPUT TABLE WHERE ID = (1,2,3,4 etc)
INSERT INTO OUTPUT TABLE Values (1,2,3,4 etc)
COMMIT TRAN;
Within a transaction all write locks (all locks acquired for modifications) must obey the strict two phase locking rule. One of the consequences is that a write (X) lock acquired in a transaction cannot be released until the transaction commits. So yes, the DELETE and INSERT will execute sequentially and all locks acquired during the DELETE will be retained while executing the INSERT.
Keep in mind that deleting 500k rows in a transaction will escalate the locks to one table lock, see Lock Escalation.
Deleting 500k rows and inserting 500k rows in a single transaction, while maybe correct, is a bad idea. You should avoid such large units of works, long transaction, if possible. Long transactions pin the log in place, create blocking and contention, increase recovery and DB startup time, increase SQL Server resource consumption (locks require memory).
You should consider doing the operation in small batches (perhaps 10000 rows at time), use MERGE instead of DELETE/INSERT (if possible) and, last but not least, consider a partitioned sliding window
implementation, see How to Implement an Automatic Sliding Window in a Partitioned Table.
From the documentation on TRANSACTION (emphasis mine):
BEGIN TRANSACTION represents a point at which the data referenced by a
connection is logically and physically consistent. If errors are
encountered, all data modifications made after the BEGIN TRANSACTION
can be rolled back to return the data to this known state of
consistency. Each transaction lasts until either it completes without
errors and COMMIT TRANSACTION is issued to make the modifications a
permanent part of the database, or errors are encountered and all
modifications are erased with a ROLLBACK TRANSACTION statement.
BEGIN TRANSACTION starts a local transaction for the connection
issuing the statement. Depending on the current transaction isolation
level settings, many resources acquired to support the Transact-SQL
statements issued by the connection are locked by the transaction
until it is completed with either a COMMIT TRANSACTION or ROLLBACK
TRANSACTION statement. Transactions left outstanding for long periods
of time can prevent other users from accessing these locked resources,
and also can prevent log truncation.
Although BEGIN TRANSACTION starts a local transaction, it is not
recorded in the transaction log until the application subsequently
performs an action that must be recorded in the log, such as executing
an INSERT, UPDATE, or DELETE statement. An application can perform
actions such as acquiring locks to protect the transaction isolation
level of SELECT statements, but nothing is recorded in the log until
the application performs a modification action.

Check if table data has changed?

I am pulling the data from several tables and then passing the data to a long running process. I would like to be able to record what data was used for the process and then query the database to check if any of the tables have changed since the process was last run.
Is there a method of solving this problem that should work across all sql databases?
One possible solution that I've thought of is having a separate table that is only used for keeping track of whether the data has changed since the process was run. The table contains a "stale" flag. When I start running the process, stale is set to false. If any creation, update, or deletion occurs in any of the tables on which the operation depends, I set stale to true. Is this a valid solution? Are there better solutions?
One concern with my solution is situations like this:
One user starts inserting a new row into one of the tables. Stale gets set to true, but the new row has not actually been added yet. Another user has simultaneously started the long running process, pulling the data from the tables and setting the flag to false. The row is finally added. Now the data used for the process is out of date but the flag indicates it is not stale. Would transactions be able to solve this problem?
EDIT:
This is some SQL for my idea. Not sure if it works, but just to give you a better idea of what I was thinking:
# First transaction reads the data and sets the flag to false
BEGIN TRANSACTION
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
UPDATE flag SET stale = false
SELECT * FROM DATATABLE
COMMIT TRANSACTION
# Second transaction updates the data and sets the flag to true
BEGIN TRANSACTION
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
UPDATE data SET val = 15 WHERE ID = 10
UPDATE flag SET stale = true
COMMIT TRANSACTION
I do not have much experience with transactions or handwriting xml, so there are probably issues with this. From what I understand two serializable transactions can not be interleaved. Please correct me if I'm wrong.
Is there a way to accomplish this with only the first transaction? The process will be run rarely, but the updates to the data table will occur more frequently, so it would be nice to not lock up the data table when performing updates.
Also, is the SET TRANSACTION ISOLATION syntax specific to MS?
The stale flag will probably work, but a timestamp would be better since it provides more metadata about the age of the records which could be used to tune your queries, e.g., only pull data that is over 5 minutes old.
To address your concern about inserting a row at the same time a query is run, transactions with an appropriate isolation level will help. For row inserts, updates, and selects, at least use a transaction with an isolation level that prevents dirty reads so that no other connections can see the updated data until the transaction is committed.
If you are strongly concerned about the case where an update happens at the same time as a record pull, you could use the REPEATABLE READ or even SERIALIZABLE isolation levels, but this will slow DB access down.
Your SQLServer sampled should work. For alternate databases, Here's an example that works in PostGres:
Transaction 1
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- run queries that update the tables, then set last_updated column
UPDATE sometable SET last_updatee = now() WHERE id = 1;;
COMMIT;
Transaction 2
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- select data from tables, then set last_queried column
UPDATE sometable SET last_queried = now() WHERE id = 1;
COMMIT;
If transaction 1 starts, and then transaction 2 starts before transaction 1 has completed, transaction 2 will block during on the update, and then will throw an error when transaction 1 is committed. If transaction 2 starts first, and transaction 1 starts before that has finished, then transaction 1 will error. Your application code or process should be able to handle those errors.
Other databases use similar syntax - MySQL (with InnoDB plugin) requires you to set the isolation level before you start the transaction.

SQL Server Race Condition Question

(Note: this is for MS SQL Server)
Say you have a table ABC with a primary key identity column, and a CODE column. We want every row in here to have a unique, sequentially-generated code (based on some typical check-digit formula).
Say you have another table DEF with only one row, which stores the next available CODE (imagine a simple autonumber).
I know logic like below would present a race condition, in which two users could end up with the same CODE:
1) Run a select query to grab next available code from DEF
2) Insert said code into table ABC
3) Increment the value in DEF so it's not re-used.
I know that, two users could get stuck at Step 1), and could end up with same CODE in the ABC table.
What is the best way to deal with this situation? I thought I could just wrap a "begin tran" / "commit tran" around this logic, but I don't think that worked. I had a stored procedure like this to test, but I didn't avoid the race condition when I ran from two different windows in MS:
begin tran
declare #x int
select #x= nextcode FROM def
waitfor delay '00:00:15'
update def set nextcode = nextcode + 1
select #x
commit tran
Can someone shed some light on this? I thought the transaction would prevent another user from being able to access my NextCodeTable until the first transaction completed, but I guess my understanding of transactions is flawed.
EDIT: I tried moving the wait to after the "update" statement, and I got two different codes... but I suspected that. I have the waitfor statement there to simulate a delay so the race condition can be easily seen. I think the key problem is my incorrect perception of how transactions work.
Set the Transaction Isolation Level to Serializable.
At lower isolation levels, other transactions can read the data in a row that is read, (but not yet modified) in this transaction. So two transactions can indeed read the same value. At very low isolation (Read Uncommitted) other transactions can even read data after it's been modified (but before committed)...
Review details about SQL Server Isolation Levels here
So bottom line is that the Isolation level is crtitical piece here to control what level of access other transactions get into this one.
NOTE. From the link, about Serializable
Statements cannot read data that has been modified but not yet committed by other transactions.
This is because the locks are placed when the row is modified, not when the Begin Trans occurs, So what you have done may still allow another transaction to read the old value until the point where you modify it. So I would change the logic to modify it in the same statement as you read it, thereby putting the lock on it at the same time.
begin tran
declare #x int
update def set #x= nextcode, nextcode += 1
waitfor delay '00:00:15'
select #x
commit tran
As other responders have mentioned, you can set the transaction isolation level to ensure that anything you 'read' using a SELECT statement cannot change within a transaction.
Alternatively, you could take out a lock specifically on the DEF table by adding the syntax WITH HOLDLOCK after the table name, e.g.,
SELECT nextcode FROM DEF WITH HOLDLOCK
It doesn't make much difference here, as your transaction is small, but it can be useful to take out locks for some SELECTs and not others within a transaction. It's a question of 'repeatability versus concurrency'.
A couple of relavant MS-SQL docs.
Isolation levels
Table hints
Late answer. You want to avoid a race condition...
"SQL Server Process Queue Race Condition"
Recap:
You began a transaction. This doesn't actually "do" anything in and of itself, it modifies subsequent behavior
You read data from a table. The default isolation level is Read Committed, so this select statement is not made part of the transaction.
You then wait 15 seconds
You then issue an update. With the declared transaction, this will generate a lock until the transaction is committed.
You then commit the transaction, releasing the lock.
So, guessing you ran this simultaneously in two windows (A and B):
A read the "next" value from table def, then went into wait mode
B read the same "next" value from the table, then went into wait mode. (Since A only did a read, the transaction did not lock anything.)
A then updated the table, and probably commited the change before B exited the wait state.
B then updated the table, after A's write was committed.
Try putting the wait statement after the update, before the commit, and see what happens.
It's not a real race condition. It's more a common problem with concurrent transactions. One solution is to set a read lock on the table and therefor have a serialization in place.
This is actually a common problem in SQL databases and that is why most (all?) of them have some built in features to take care of this issue of obtaining a unique identifier. Here are some things to look into if you are using Mysql or Postgres. If you are using a different database I bet the provide something very similar.
A good example of this is postgres sequences which you can check out here:
Postgres Sequences
Mysql uses something called auto increments.
Mysql auto increment
You can set the column to a computed value that is persisted. This will take care of the race condition.
Persisted Computed Columns
NOTE
Using this method means you do not need to store the next code in a table. The code column becomes the reference point.
Implementation
Give the column the following properties under computed column specification.
Formula = dbo.GetNextCode()
Is Persisted = Yes
Create Function dbo.GetNextCode()
Returns VarChar(10)
As
Begin
Declare #Return VarChar(10);
Declare #MaxId Int
Select #MaxId = Max(Id)
From Table
Select #Return = Code
From Table
Where Id = #MaxId;
/* Generate New Code ... */
Return #Return;
End