Assume a table structure of MyTable(MyTableId NVARCHAR(MAX) PRIMARY KEY, NumberOfInserts INTEGER).
I often need to either update i.e. increment a counter of an existing record, or insert a new record if it doesn't exist with a value of 0 for NumberOfInserts.
Essentially:
IF (MyTableId exists)
run UPDATE command
ELSE
run INSERT command
My concern is losing data due to race conditions, etc.
What's the safest way to write this?
I need it to be 100% accurate if possible, and willing to sacrifice speed where necessary.
MERGE statement can perform both UPDATE and INSERT (and DELETE if needed).
Even though it is a single atomic statement, it is important to use HOLDLOCK query hint to prevent race condition. There is a blog post “UPSERT” Race Condition With MERGE by Dan Guzman where he explains in great details how it works and provides a test script to verify it.
The query itself is straight-forward:
DECLARE #NewKey NVARCHAR(MAX) = ...;
MERGE INTO dbo.MyTable WITH (HOLDLOCK) AS Dst
USING
(
SELECT #NewKey AS NewKey
) AS Src
ON Src.NewKey = Dst.[Key]
WHEN MATCHED THEN
UPDATE
SET NumberOfInserts = NumberOfInserts + 1
WHEN NOT MATCHED THEN
INSERT
(
[Key]
,NumberOfInserts
)
VALUES
(
#NewKey
,0
);
Of course, you can also use explicit two-step approach with a separate check if a row exists and separate UPDATE and INSERT statements. Just make sure to wrap them all in a transaction with appropriate table locking hints.
See Conditional INSERT/UPDATE Race Condition by Dan Guzman for details.
Related
This question already has answers here:
Only inserting a row if it's not already there
(7 answers)
SQL MERGE statement to update data
(6 answers)
Closed 9 years ago.
I have a table "INSERTIF" which looks like this -
id value
S1 s1rocks
S2 s2rocks
S3 s3rocks
Before inserting a row into this table, I wanted to check if the given id exists or not. If it does not exist, then insert. Else, just update the value. I want to do this in a thread safe way. Can you tell me if my code is correct or not ? I tried it and it worked but, I want to be sure that I am not missing anything like performance issues.
EDIT 1- I want to use this code to insert millions of rows, one at a time. Each insert statement is wrapped around the code I have shown.
EDIT 2 - I do not want to use the UPDATE part of my code, only inserting is enough.
I do NOT want to use MERGE because it works with only SQL server 2008 and above
Thanks.
Code -
-- no check insert
INSERT INTO INSERTIF(ID,VALUE)
VALUES('S1', 's1doesNOTrock')
--insert with checking
begin tran /* default read committed isolation level is fine */
if not exists
(select * from INSERTIF with (updlock, rowlock, holdlock)
where ID = 'S1')
BEGIN
INSERT INTO INSERTIF(ID,VALUE)
VALUES('S1', 's1doesNOTrock')
END
else
/* update */
UPDATE INSERTIF
SET VALUE = 's1doesNOTrock'
WHERE ID = 'S1'
commit /* locks are released here */
Code to create table -
CREATE TABLE [dbo].[INSERTIF](
[id] [varchar](50) NULL,
[value] [varchar](50) NULL
)
INSERT [dbo].[INSERTIF] ([id], [value]) VALUES (N'S1', N's1rocks')
INSERT [dbo].[INSERTIF] ([id], [value]) VALUES (N'S2', N's2rocks')
INSERT [dbo].[INSERTIF] ([id], [value]) VALUES (N'S3', N's3rocks')
Your question is about the thread-safety of your code. Succinctly, no — it is not thread-safe. (But see below where isolation is discussed.)
You have a (smallish) window of vulnerability because of the TOCTOU (Time of Check, Time of Use) issue between your 'not exists' SELECT and the corresponding action. Assuming you have a unique (primary) key constraint on the id column, you should use the 'Easier to Ask Forgiveness than Permission' paradigm rather than the 'Look Before You Leap' paradigm (see EAFP vs LBYL).
That means you should determine which of two operation sequences you're going to use:
INSERT, but UPDATE if it fails.
UPDATE, but INSERT if no rows are updated.
Either works. If the work will be mostly insert and occasionally update, then 1 is better than 2; if the work will be mostly update with the occasional insert, then 2 is better than 1. You might even work adaptively; keep a track of what happened in the last N rows (where N might be as few as 5 or as many as 500) and use a heuristic to decide which to try on the new row. There still could be a problem if the INSERT fails (because the row existed) but the UPDATE updates nothing (because someone deleted the row after the insert failed). Similarly, there could still be a problem with UPDATE and INSERT too (no row existed, but one was inserted).
Note that the INSERT option depends wholly on a unique constraint to ensure that duplicate rows are not inserted; the UPDATE option is more reliable.
You also need to consider your isolation level — which might change the original answer. If your isolation is high enough to ensure that after the 'not exists' SELECT is executed, no-one else will be able to insert the row that you determined did not exist, then you may be OK. That gets into some nitty-gritty understanding of your DBMS (and I'm not an SQL Server expert).
You'll need to think about transaction boundaries, too; how big a transaction would be appropriate, especially if the source data has a million entries.
This technique is typically called UPSERT. It can be done in SQL Server using MERGE.
It works like this:
MERGE INTO A_Table
USING
(SELECT 'data_searched' AS Search_Col) AS SRC
-- Search predicates
--
ON A_Table.Data = SRC.Search_Col
WHEN MATCHED THEN
-- Update part of the 'UPSERT'
--
UPDATE SET
Data = 'data_searched_updated'
WHEN NOT MATCHED THEN
-- INSERT part of the 'UPSERT'
--
INSERT (Data)
VALUES (SRC.Search_Col);
Also see http://www.sergeyv.com/blog/archive/2010/09/10/sql-server-upsert-equivalent.aspx
EDIT: I see you're using an older SQL Server. In that case you must use multiple statements.
This question is related with Occasionally Getting SqlException: Timeout expired. Actually, I am using IF EXISTS... UPDATE .. ELSE .. INSERT heavily in my app. But user Remus Rusanu is saying that you should not use this. Why I should not use this and what danger it include. So, if I have
IF EXISTS (SELECT * FROM Table1 WHERE Column1='SomeValue')
UPDATE Table1 SET (...) WHERE Column1='SomeValue'
ELSE
INSERT INTO Table1 VALUES (...)
How to rewrite this statement to make it work?
Use MERGE
Your SQL fails because 2 concurrent overlapping and very close calls will both get "false" from the EXISTS before the INSERT happens. So they both try to INSERT, and of course one fails.
This is explained more here: Select / Insert version of an Upsert: is there a design pattern for high concurrency? THis answer is old though and applies before MERGE was added
The problem with IF EXISTS ... UPDATE ... (and IF NOT EXISTS ... INSERT ...) is that under concurrency multiple threads (transactions) will execute the IF EXISTS part and all reach the same conclusion (eg. it does not exists) and try to act accordingly. Result is that all threads attempt to INSERT resulting in a key violation. Depending on the code this can result in constraint violation errors, deadlocks, timeouts or worse (lost updates).
You need to ensure that the check IF EXISTS and the action are atomic. On pre SQL Server 2008 the solution involved using a transaction and lock hints and was very very error prone (easy to get wrong). Post SQL Server 2008 you can use MERGE, which will ensure proper atomicity as is a single statement and the engine understand what you're trying to do.
Sample MERGE statement in this case would be:
MERGE INTO Table1 t1
USING (SELECT 'SomeValue' as Column_id FROM dual) t2 ON
(t1.column_id = t2.column_id)
WHEN MATCHED THEN
UPDATE SET(...)
WHEN NOT MATCHED THEN
INSERT (t1.column_id)
VALUES ('SomeValue');
Merge, is in the first case created to compare 2 tables, so if that is the case you could use merge.
Take a look at the following, which is also another option.
In this case unfortunately you have the possibility of getting a problem with concurrency.
--Just update a row
UPDATE Table1 SET (...) WHERE Column1='SomeValue'
-- If 0 rows are updated ...
IF ##ROWCOUNT=0
--Insert Row
INSERT INTO Table1 VALUES (...)
In this blog its explained more.
Additionally this interesting blog is about concurrency.
Given:
customer[id BIGINT AUTO_INCREMENT PRIMARY KEY, email VARCHAR(30), count INT]
I'd like to execute the following atomically: Update the customer if he already exists; otherwise, insert a new customer.
In theory this sounds like a perfect fit for SQL-MERGE but the database I am using doesn't support MERGE with AUTO_INCREMENT columns.
https://stackoverflow.com/a/1727788/14731 seems to indicate that if you execute a query or update statement against a non-existent row, the database will lock the index thereby preventing concurrent inserts.
Is this behavior guaranteed by the SQL standard? Are there any databases that do not behave this way?
UPDATE: Sorry, I should have mentioned this earlier: the solution must use READ_COMMITTED transaction isolation unless that is impossible in which case I will accept the use of SERIALIZABLE.
This question is asked about once a week on SO, and the answers are almost invariably wrong.
Here's the right one.
insert customer (email, count)
select 'foo#example.com', 0
where not exists (
select 1 from customer
where email = 'foo#example.com'
)
update customer set count = count + 1
where email = 'foo#example.com'
If you like, you can insert a count of 1 and skip the update if the inserted rowcount -- however expressed in your DBMS -- returns 1.
The above syntax is absolutely standard and makes no assumption about locking mechanisms or isolation levels. If it doesn't work, your DBMS is broken.
Many people are under the mistaken impression that the select executes "first" and thus introduces a race condition. No: that select is part of the insert. The insert is atomic. There is no race.
Use Russell Fox's code but use SERIALIZABLE isolation. This will take a range lock so that the non-existing row is logically locked (together with all other non-existing rows in the surrounding key range).
So it looks like this:
BEGIN TRAN
IF EXISTS (SELECT 1 FROM foo WITH (UPDLOCK, HOLDLOCK) WHERE [email] = 'thisemail')
BEGIN
UPDATE foo...
END
ELSE
BEGIN
INSERT INTO foo...
END
COMMIT
Most code taken from his answer, but fixed to provided mutual exclusion semantics.
Answering my own question since there seems to be a lot of confusion around the topic. It seems that:
-- BAD! DO NOT DO THIS! --
insert customer (email, count)
select 'foo#example.com', 0
where not exists (
select 1 from customer
where email = 'foo#example.com'
)
is open to race-conditions (see Only inserting a row if it's not already there). From what I've been able to gather, the only portable solution to this problem:
Pick a key to merge against. This could be the primary key, or another unique key, but it must have a unique constraint.
Try to insert a new row. You must catch the error that will occur if the row already exists.
The hard part is over. At this point, the row is guaranteed to exist and you are protected from race-conditions by the fact that you are holding a write-lock on it (due to the insert from the previous step).
Go ahead and update if needed or select its primary key.
IF EXISTS (SELECT 1 FROM foo WHERE [email] = 'thisemail')
BEGIN
UPDATE foo...
END
ELSE
BEGIN
INSERT INTO foo...
END
I have a table that keeps a count of user actions. Each time an action is done, the value needs to increase. Since the user can have multiple sessions at the same time, the process needs to be atomic to avoid multi-user issues.
The table has 3 columns:
ActionCode as varchar
UserID as int
Count as int
I want to pass ActionCode and UserID to a function that will add a new row if one doesn't already exist, and set count to 1. If the row does exist, it will just increase the count by one. ActionCode and UserID make up the primary unique index for this table.
If all I needed to do was update, I could do something simple like this (because an UPDATE query is atomic already):
UPDATE (Table)
SET Count = Count + 1
WHERE ActionCode = #ActionCode AND UserID = #UserID
I'm new to atomic transactions in SQL. This question has probably been answered in multiple parts here, but I'm having trouble finding those and also placing those parts in one solution. This needs to be pretty fast as well, without getting to complex, because these actions may occur frequently.
Edit: Sorry, this might be a dupe of MySQL how to do an if exist increment in a single query. I searched a lot but had tsql in my search, once I changed to sql instead, that was the top result. It isn't obvious if that is atomic, but pretty sure it would be. I'll probably vote to delete this as dupe, unless someone thinks there can be some new value added by this question and answer.
Assuming you are on SQL Server, to make a single atomic statement you could use MERGE
MERGE YourTable AS target
USING (SELECT #ActionCode, #UserID) AS source (ActionCode, UserID)
ON (target.ActionCode = source.ActionCode AND target.UserID = source.UserID)
WHEN MATCHED THEN
UPDATE SET [Count] = target.[Count] + 1
WHEN NOT MATCHED THEN
INSERT (ActionCode, UserID, [Count])
VALUES (source.ActionCode, source.UserID, 1)
OUTPUT INSERTED.* INTO #MyTempTable;
UPDATE Use output to select the values if necessary. The code updated.
Using MERGE in SQL Server 2008 is probably the best bet. There is also another simple way to solve it.
If the UserID/Action doesn't exist, do an INSERT of a new row with a 0 for Count. If this statement fails due to it already being present (as inserted by another concurrent session just then), simply ignore the error.
If you want to do the insert and block while performing it to eliminate any chance of error, you can add some lock hints:
INSERT dbo.UserActionCount (UserID, ActionCode, Count)
SELECT #UserID, #ActionCode, 0
WHERE NOT EXISTS (
SELECT *
FROM dbo.UserActionCount WITH (ROWLOCK, HOLDLOCK, UPDLOCK)
WHERE
UserID = #UserID
AND ActionCode = #ActionCode
);
Then do the UPDATE with + 1 as in the usual case. Problem solved.
DECLARE #NewCount int,
UPDATE UAC
SET
Count = Count + 1,
#NewCount = Count + 1
FROM dbo.UserActionCount UAC
WHERE
ActionCode = #ActionCode
AND UserID = #UserID;
Note 1: The MERGE should be okay, but know that just because something is done in one statement (and therefore atomic) does not mean that it does not have concurrency problems. Locks are acquired and released over time throughout the lifetime of a query's execution. A query like the following WILL experience concurrency problems causing duplicate ID insertion attempts, despite being atomic.
INSERT T
SELECT (SELECT Max(ID) FROM Table) + 1, GetDate()
FROM Table T;
Note 2: An article I read by people experienced in super-high-transaction-volume systems said that they found the "try-it-then-handle-any-error" method to offer higher concurrency than acquiring and releasing locks. This may not be the case in all system designs, but it is at least worth considering. I have since searched for this article several times (including just now) and been unable to find it again... I hope to find it some day and reread it.
Incase anyone else needs the syntax to use this in a stored procedure and return the inserted/updated rows (I was surprised inserted.* also returns the updated rows, but it does). Here is what I ended up with. I forgot I had an additional column in my primary key (ActionKey), it is reflected below. Can also do "output inserted.Count" if you only want to return the Count, which is more practical.
CREATE PROCEDURE dbo.AddUserAction
(
#Action varchar(30),
#ActionKey varchar(50) = '',
#UserID int
)
AS
MERGE UserActions AS target
USING (SELECT #Action, #ActionKey, #UserID) AS source (Action, ActionKey, UserID)
ON (target.Action = source.Action AND target.ActionKey = source.ActionKey AND target.UserID = source.UserID)
WHEN MATCHED THEN
UPDATE SET [Count] = target.[Count] + 1
WHEN NOT MATCHED THEN
INSERT (Action, ActionKey, UserID, [Count])
VALUES (source.Action, source.ActionKey, source.UserID, 1)
output inserted.*;
We have a status table. When the status changes we currently delete the old record and insert a new.
We are wondering if it would be faster to do a select to check if it exists followed by an insert or update.
Although similar to the following question, it is not the same, since we are changing individual records and the other question was doing a total table refresh.
DELETE, INSERT vs UPDATE || INSERT
Since you're talking SQL Server 2008, have you considered MERGE? It's a single statement that allows you to do an update or insert:
create table T1 (
ID int not null,
Val1 varchar(10) not null
)
go
insert into T1 (ID,Val1)
select 1,'abc'
go
merge into T1
using (select 1 as ID,'def' as Val1) upd on T1.ID = upd.ID --<-- These identify the row you want to update/insert and the new value you want to set. They could be #parameters
when matched then update set Val1 = upd.Val1
when not matched then insert (ID,Val1) values (upd.ID,upd.Val1);
What about INSERT ... ON DUPLICATE KEY? First doing a select to check if a record exists and checking in your program the result of that creates a race condition. That might not be important in your case if there is only a single instance of the program however.
INSERT INTO users (username, email) VALUES ('Jo', 'jo#email.com')
ON DUPLICATE KEY UPDATE email = 'jo#email.com'
You can use ##ROWCOUNT and perform UPDATE. If it was 0 rows affected - then perform INSERT after, nothing otherwise.
Your suggestion would mean always two instructions for each status change. The usual way is to do an UPDATE and then check if the operation changed any rows (Most databases have a variable like ROWCOUNT which should be greater than 0 if something changed). If it didn't, do an INSERT.
Search for UPSERT for find patterns for your specific DBMS
Personally, I think the UPDATE method is the best. Instead of doing a SELECT first to check if a record already exists, you can first attempt an UPDATE but if no rows are affected (using ##ROWCOUNT) you can do an INSERT.
The reason for this is that sooner or later you might want to track status changes, and the best way to do this would be to keep an audit trail of all changes using a trigger on the status table.