Is INSERT ... SELECT an atomic transaction? - sql

I use a query like this:
INSERT INTO table
SELECT * FROM table2 t2
JOIN ...
...
WHERE table2.date < now() - '1 day'::INTERVAL
FOR UPDATE OF t2 SKIP LOCKED
ON CONFLICT (...)
DO UPDATE SET ...
RETURNING *;
My question is about FOR UPDATE t2 SKIP LOCKED. Should I use it here? Or will Postgres lock these rows automatically with INSERT SELECT ON CONFLICT till the end of the transaction?
My goal is to prevent other apps from (concurrently) capturing rows with the inner SELECT which are already captured by this one.

Yes, FOR UPDATE OF t2 SKIP LOCKED is the right approach to prevent race conditions with default Read Committed transaction isolation.
The added SKIP LOCKED also prevents deadlocks. Be aware that competing transactions might each get a partial set from the SELECT - whatever it could lock first.
While any transaction is atomic in Postgres, it would not prevent another (also atomic) transaction from selecting (and inserting - or at least trying) the same row, because SELECT without FOR UPDATE does not take an exclusive lock.
The Postgres manual about transactions:
A transaction is said to be atomic: from the point of view of other transactions, it either happens completely or not at all.
Related:
Postgres UPDATE … LIMIT 1
Clarifications:
An SQL DML command like INSERT is always automatically atomic, since it cannot run outside a transaction. But you can't say that INSERT is a transaction. Wrong terminology.
In Postgres all locks are kept until and released at the end of the current transaction.

Related

PostgreSQL select for update lock, new rows

I have the following concurrency use-case: An endpoint can be called at any time and an operation is supposed to happen. The operation goes like this in pseudocode (current isolation level is READ COMMITTED):
SELECT * FROM TABLE_A WHERE IS_LATEST=true FOR UPDATE
// DO SOME APP LOGIC TO TEST VALIDITY
// ALL GOES WELL => INSERT OR UPDATE NEW ROW WITH IS_LATEST=TRUE => COMMIT
// OTHERWISE => ROLLBACK (all good not interesting)
Now this approach with SELECT FOR UPDATE is fine if two of these operations start at the same time in the respects of update. Because both transactions see the same number of rows, one will update the rows and the second transaction will wait its turn before being able to SELECT FOR UPDATE and the state is valid.
The issue I have is when I have an insert in the first transaction. What happens is that for example when the first transaction makes that lock SELECT FOR UPDATE there are two rows, then the transaction continues, in the middle of the transaction, the second transaction comes in wanting to SELECT FOR UPDATE (latest) and waits for first transaction to finish.. The first transaction finished and there is a new third item realistically in the db, but the second transaction picks up only two rows while it was waiting for the row locks to be released. (This is because at the time of calling the SELECT FOR UPDATE the snapshot was different had only two rows that matched IS_LATEST=true).
Is there a way to make this transaction such that the SELECT lock picks up the latest snapshot after waiting?
The issue is that each command only sees rows that have been committed before the query started. There are various possible solutions ...
Stricter isolation level
You can solve this with a stricter isolation level, but that's relatively expensive.
Laurenz already provided a solution for this.
Just start a new command
Keep the (cheap) default isolation level READ COMMITTED, and just start a new command.
Only few rows to lock
While only locking a hand full of rows, the dead simple solution is to repeat the same SELECT ... FOR UPDATE. The second iteration sees newly committed rows and locks them additionally.
There is a theoretical race condition with additional transactions that might lock new rows before the waiting transaction does. That would result in a deadlock. Highly unlikely, but to be absolutely sure, lock rows in consistent order:
BEGIN; -- default READ COMMITTED
SELECT FROM table_a WHERE is_latest ORDER BY id FOR UPDATE; -- consistent order
SELECT * FROM table_a WHERE is_latest ORDER BY id FOR UPDATE; -- just repeat !!
-- DO SOME APP LOGIC TO TEST VALIDITY
-- pseudo-code
IF all_good
UPDATE table_a SET is_latest = true WHERE ...;
INSERT table_a (IS_LATEST, ...) VALUES (true, ...);
COMMIT;
ELSE
ROLLBACK;
END;
A partial index on (id) WHERE is_latest would be ideal.
More rows to lock
For more than a hand full of rows, I would instead create a dedicated one-row token table. A bullet-proof implementation could look like this, run as admin or superuser:
CREATE TABLE public.single_task_x (just_me bool CHECK (just_me) PRIMARY KEY DEFAULT true);
INSERT INTO public.single_task_x VALUES (true);
REVOKE ALL ON public.single_task_x FROM public;
GRANT SELECT, UPDATE ON public.single_task_x TO public; -- or just to those who need it
See:
How to allow only one row for a table?
Then:
BEGIN; -- default READ COMMITTED
SELECT FROM public.single_task_x FOR UPDATE;
SELECT * FROM table_a WHERE is_latest; -- FOR UPDATE? ①
-- DO SOME APP LOGIC TO TEST VALIDITY
-- pseudo-code
IF all_good
ROLLBACK;
ELSE
UPDATE table_a SET is_latest = true WHERE ...;
INSERT table_a (IS_LATEST, ...) VALUES (true, ...);
COMMIT;
END;
A single lock is cheaper.
① You may or may not want to lock additionally, to defend against other writes, possibly with a weaker lock ....
Either way, all locks are released at the end of the transaction automatically.
Advisory lock
Or use an advisory lock. pg_advisory_xact_lock() persists for the duration of the transaction:
BEGIN; -- default READ COMMITTED
SELECT pg_advisory_xact_lock(123);
SELECT * FROM table_a WHERE is_latest;
-- do stuff
COMMIT; -- or ROLLBACK;
Make sure to use a unique token for your particular task. 123 in my example. Consider a look-up table if you have many different tasks.
To release the lock at a different point in time (not when the transaction ends), consider a session-level lock with pg_advisory_lock(). Then you can (and must) unlock manually with pg_advisory_unlock() - or close the session.
Both of these wait for the locked resource. There are alternative functions returning false instead of waiting ...
With your method, the query in the second transaction will return an empty result after the lock is gone, because it sees is_latest = FALSE on the row in question, and the new row is not yet visible. So you would have to retry the transaction in that case.
I suggest that you use REPEATABLE READ isolation level and optimistic locking instead:
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM table_a WHERE is_latest; -- no locks!
/* perform your application ruminations */
UPDATE table_a SET is_latest = FALSE WHERE id = <id you found above>;
INSERT INTO table_a (is_latest, ...) VALUES (TRUE, ...);
COMMIT;
Then three things may happen:
Your query finds a row, and the transaction succeeds.
Your query finds no row, then you could insert the first row.
The query finds a row, but the update of that row causes a serialization error.
In that case you know that a concurrent transaction interfered, and you repeat the complete transaction in response.

Deadlock involving SELECT FOR UPDATE

I have transaction with several queries. First, a select rows with FOR UPDATE lock:
SELECT f.source_id FROM files AS f WHERE
f.component_id = $1 AND
f.archived_at IS NULL
FOR UPDATE
Next, there is an update query:
UPDATE files AS f SET archived_at = NOW()
WHERE
hw_component_id = $1 AND
f.source_id = ANY($2::text[])
And then there is an insert:
INSERT INTO files AS f (
source_id,
...
)
VALUES (..)
ON CONFLICT (component_id, source_id) DO UPDATE
SET archived_at = null,
is_valid = excluded.is_valid
I have two application instances and sometimes I see deadlock errors in PostgreSQL log:
ERROR: deadlock detected
DETAIL: Process 3992939 waits for ShareLock on transaction 230221362; blocked by process 4108096.
Process 4108096 waits for ShareLock on transaction 230221365; blocked by process 3992939.
Process 3992939: SELECT f.source_id FROM files AS f WHERE f.component_id = $1 AND f.archived_at IS NULL FOR UPDATE
Process 4108096: INSERT INTO files AS f (source_id, ...) VALUES (..) ON CONFLICT (component_id, source_id) DO UPDATE SET archived_at = null, is_valid = excluded.is_valid
CONTEXT: while locking tuple (41116,185) in relation \"files\"
I assume that it may be caused by ON CONFLICT DO UPDATE statement, which may update rows which are not locked by previous SELECT FOR UPDATE
But I can't understand how can SELECT ... FOR UPDATE query cause deadlock if it is the first query in transaction. There is not queries before it.
Can SELECT ... FOR UPDATE statement lock several rows and then wait for other rows in condition to be unlocked?
SELECT FOR UPDATE is no safeguard against deadlocks. It just locks rows. Locks are acquired along the way, in the order instructed by ORDER BY, or in arbitrary order in the absence of ORDER BY. The best defense against deadlocks is to lock rows in consistent order across the whole transaction - and doing likewise in all concurrent transactions. Or, as the manual puts it:
The best defense against deadlocks is generally to avoid them by being
certain that all applications using a database acquire locks on
multiple objects in a consistent order.
Else, this can happen (row1, row2, ... are rows numbered according to the virtual consistent order):
T1: SELECT FOR UPDATE ... -- lock row2, row3
T2: SELECT FOR UPDATE ... -- lock row4, wait for T1 to release row2
T1: INSERT ... ON CONFLICT ... -- wait for T2 to release lock on row4
--> deadlock
Adding ORDER BY to your SELECT... FOR UPDATE may already avoid your deadlocks. (It would avoid the one demonstrated above.) Or this happens and you have to do more:
T1: SELECT FOR UPDATE ... -- lock row2, row3
T2: SELECT FOR UPDATE ... -- lock row1, wait for T1 to release row2
T1: INSERT ... ON CONFLICT ... -- wait for T2 to release lock on row1
--> deadlock
Everything within the transaction must happen in consistent order to be absolutly sure.
Also, your UPDATE does not seem to be in line with the SELECT FOR UPDATE. component_id <> hw_component_id. Typo?
Also, f.archived_at IS NULL does not guarantee that the later SET archived_at = NOW() only affects these rows. You would have to add WHERE f.archived_at IS NULL to the UPDATE be in line. (Seems like a good idea in any case?)
I assume that it may be caused by ON CONFLICT DO UPDATE statement,
which may update rows which are not locked by previous SELECT FOR UPDATE.
As long as the UPSERT (ON CONFLICT DO UPDATE) sticks to the consistent order, that wouldn't be a problem. But that may be hard or impossible to enforce.
Can SELECT ... FOR UPDATE statement lock several rows and then wait for other rows in condition to be unlocked?
Yes, as explained above, locks are acquired along the way. It can have to stop and wait half way through.
NOWAIT
If all that still can't resolve your deadlocks, the slow and sure method is to use Serializable Isolation Level. Then you have to be prepared for serialization failures and retry the transaction in this case. Considerably more expensive overall.
Or it might be enough to add NOWAIT:
SELECT FROM files
WHERE component_id = $1
AND archived_at IS NULL
ORDER BY id -- whatever you use for consistent, deterministic order
FOR UPDATE NOWAIT;
The manual:
With NOWAIT, the statement reports an error, rather than waiting, if a selected row cannot be locked immediately.
You may even skip the ORDER BY clause with NOWAIT if you cannot establish consistent order with the UPSERT anyway.
Then you have to catch that error and retry the transaction. Similar to catching serialization failures, but much cheaper - and less reliable. For example, multiple transactions can still interlock with their UPSERT alone. But it gets less and less likely.

Deadlock when using SELECT FOR UPDATE

I noticed that concurrent execution of simple and identical queries similar to
BEGIN;
SELECT files.data FROM files WHERE files.file_id = 123 LIMIT 1 FOR UPDATE;
UPDATE files SET ... WHERE files.file_id = 123;
COMMIT;
lead to deadlock which is surprising to me since it looks like such queries should not create a deadlock. Also: it is usually takes only milliseconds to complete such request. During such deadlock situation if I run:
SELECT blockeda.pid AS blocked_pid, blockeda.query as blocked_query,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(blockingl.transactionid=blockedl.transactionid
AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
WHERE NOT blockedl.granted;
I see both of my identical select statements listed for blocked_pid and blockin_pid for whole duration of the deadlock.
So my question is: Is it normal and expected for queries that try to select same row FOR UPDATE to cause deadlock? And if so, what is the best strategy to avoid deadlocking in this scenario?
Your commands are contradicting.
If files.file_id is defined UNIQUE (or PRIMARY KEY), you don't need LIMIT 1. And you don't need explicit locking at all. Just run the UPDATE, since only a single row is affected in the whole transaction, there cannot be a deadlock. (Unless there are side effects from triggers or rules or involved functions.)
If files.file_id is not UNIQUE (like it seems), then the UPDATE can affect multiple rows in arbitrary order and only one of them is locked, a recipe for deadlocks. The more immediate problem would then be that the query does not do what you seem to want to begin with.
The best solution depends on missing information. This would work:
UPDATE files
SET ...
WHERE primary_key_column = (
SELECT primary_key_column
FROM files
WHERE file_id = 123
LIMIT 1
-- FOR UPDATE SKIP LOCKED
);
No BEGIN; and COMMIT; needed for the single command, while default auto-commit is enabled.
You might want to add FOR UPDATE SKIP LOCKED (or FOR UPDATE NOWAIT) to either skip or report an error if the row is already locked.
And you probably want to add a WHERE clause that avoids processing the same row repeatedly.
More here:
Postgres UPDATE … LIMIT 1

Does "SELECT FOR UPDATE" prevent other connections inserting when the row is not present?

I'm interested in whether a SELECT FOR UPDATE query will lock a non-existent row.
Example
Table FooBar with two columns, foo and bar, foo has a unique index.
Issue query SELECT bar FROM FooBar WHERE foo = ? FOR UPDATE
If the first query returns zero rows, issue a query
INSERT INTO FooBar (foo, bar) values (?, ?)
Now is it possible that the INSERT would cause an index violation or does the SELECT FOR UPDATE prevent that?
Interested in behavior on SQLServer (2005/8), Oracle and MySQL.
MySQL
SELECT ... FOR UPDATE with UPDATE
Using transactions with InnoDB (auto-commit turned off), a SELECT ... FOR UPDATE allows one session to temporarily lock down a particular record (or records) so that no other session can update it. Then, within the same transaction, the session can actually perform an UPDATE on the same record and commit or roll back the transaction. This would allow you to lock down the record so no other session could update it while perhaps you do some other business logic.
This is accomplished with locking. InnoDB utilizes indexes for locking records, so locking an existing record seems easy--simply lock the index for that record.
SELECT ... FOR UPDATE with INSERT
However, to use SELECT ... FOR UPDATE with INSERT, how do you lock an index for a record that doesn't exist yet? If you are using the default isolation level of REPEATABLE READ, InnoDB will also utilize gap locks. As long as you know the id (or even range of ids) to lock, then InnoDB can lock the gap so no other record can be inserted in that gap until we're done with it.
If your id column were an auto-increment column, then SELECT ... FOR UPDATE with INSERT INTO would be problematic because you wouldn't know what the new id was until you inserted it. However, since you know the id that you wish to insert, SELECT ... FOR UPDATE with INSERT will work.
CAVEAT
On the default isolation level, SELECT ... FOR UPDATE on a non-existent record does not block other transactions. So, if two transactions both do a SELECT ... FOR UPDATE on the same non-existent index record, they'll both get the lock, and neither transaction will be able to update the record. In fact, if they try, a deadlock will be detected.
Therefore, if you don't want to deal with a deadlock, you might just do the following:
INSERT INTO ...
Start a transaction, and perform the INSERT. Do your business logic, and either commit or rollback the transaction. As soon as you do the INSERT on the non-existent record index on the first transaction, all other transactions will block if they attempt to INSERT a record with the same unique index. If the second transaction attempts to insert a record with the same index after the first transaction commits the insert, then it will get a "duplicate key" error. Handle accordingly.
SELECT ... LOCK IN SHARE MODE
If you select with LOCK IN SHARE MODE before the INSERT, if a previous transaction has inserted that record but hasn't committed yet, the SELECT ... LOCK IN SHARE MODE will block until the previous transaction has completed.
So to reduce the chance of duplicate key errors, especially if you hold the locks for awhile while performing business logic before committing them or rolling them back:
SELECT bar FROM FooBar WHERE foo = ? LOCK FOR UPDATE
If no records returned, then
INSERT INTO FooBar (foo, bar) VALUES (?, ?)
In Oracle, the SELECT ... FOR UPDATE has no effect on a non-existent row (the statement simply raises a No Data Found exception). The INSERT statement will prevent a duplicates of unique/primary key values. Any other transactions attempting to insert the same key values will block until the first transaction commits (at which time the blocked transaction will get a duplicate key error) or rolls back (at which time the blocked transaction continues).
On Oracle:
Session 1
create table t (id number);
alter table t add constraint pk primary key(id);
SELECT *
FROM t
WHERE id = 1
FOR UPDATE;
-- 0 rows returned
-- this creates row level lock on table, preventing others from locking table in exclusive mode
Session 2
SELECT *
FROM t
FOR UPDATE;
-- 0 rows returned
-- there are no problems with locking here
rollback; -- releases lock
INSERT INTO t
VALUES (1);
-- 1 row inserted without problems
I wrote a detailed analysis of this thing on SQL Server: Developing Modifications that Survive Concurrency
Anyway, you need to use SERIALIZABLE isolation level, and you really need to stress test.
SQL Server only has the FOR UPDATE as part of a cursor. And, it only applies to UPDATE statements that are associated with the current row in the cursor.
So, the FOR UPDATE has no relationship with INSERT. Therefore, I think your answer is that it's not applicable in SQL Server.
Now, it may be possible to simulate the FOR UPDATE behavior with transactions and locking strategies. But, that may be more than what you're looking for.

SQL Server locks - avoid insertion of duplicate entries

After reading a lot of articles and many answers related to the above subject, I am still wondering how the SQL Server database engine works in the following example:
Let's assume that we have a table named t3:
create table t3 (a int , b int);
create index test on t3 (a);
and a query as follow:
INSERT INTO T3
SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 where t3.a=-86);
The query inserts a line in the table t3 after verifying that the row does not already exist based on the column "a".
Many articles and answers indicate that using the above query there is no way that a row will be inserted twice.
For the execution of the above query, I assume that the database engine works as follow:
The subquery is executed first.
The database engine sets a shared(s) lock on a range.
The data is read.
The shared lock is released. According to MSDN a shared
lock is released as soon as the data
has been read.
If a row does not exist it inserts a new line in the table.
The new line is locked with an exclusive lock (x)
Now consider the following scenario:
The above query is executed by processor A (SPID 1).
The same query is executed by a
processor B (SPID 2).
[SPID 1] The database engine sets a shared(s) lock
[SPID 1] The subquery reads the
data. Now rows are returned.
[SPID 1] The shared lock is
released.
[SPID 2] The database engine sets a
shared(s) lock
[SPID 2] The subquery reads the
data. No rows are return.
[SPID 2] The shared lock is
released.
Both processes proceed with a row insertion (and we get a duplicate entry).
Am I missing something? Is the above way a correct way for avoiding duplicate entries?
A safe way to avoid duplicate entries is using the code below, but I am just wondering whether the above method is correct.
begin tran
if (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
begin
INSERT INTO T3
SELECT -86,-86
end
commit
If you just have a unique constraint on the column, you'll never have duplicates.
The technique you've outlined will avoid you having to catch an error or an exception in the case of the (second "simultaneous") operation failing.
I'd like to add that relying on "outer" code (even T-SQL) to enforce your database consistency is not a great idea. In all cases, using declarative referential integrity at the table level is important for the database to ensure consistency and matching expectations, regardless of whether application code is written well or not. As in security, you need to utilize a strategy of defense in depth - constraints, unique indexes, triggers, stored procedures, and views can all assist in making a multi-layered approach to ensure the database presents a consistent and reliable interface to the application or system.
To keep locks between multiple statements, they have to be wrapped in a transaction. In your example:
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
The update lock can be released before the insert is executed. This would work reliably:
begin transaction
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
commit transaction
Single statements are always wrapped in a transaction, so this would work too:
INSERT INTO T3 SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
(This is assuming you have "implicit transactions" turned off, like the default SQL Server setting.)