Can an INSERT-SELECT query be subject to race conditions? - sql

I have this query that attempts to add rows to the balances table if no corresponding row exists in the totals table. The query is run in a transaction using the default isolation level on PostgreSQL.
INSERT INTO balances (account_id, currency, amount)
SELECT t.account_id, t.currency, 0
FROM balances AS b
RIGHT OUTER JOIN totals USING (account_id, currency) AS t
WHERE b.id IS NULL
I have a UNIQUE constraint on balances (accountId, currency). I'm worried that I will get into a race condition situation that will lead to duplicate key errors if multiple sessions execute this query concurrently. I've seen many questions on that subject, but they all seem to involve either subqueries, multiple queries or pgSQL functions.
Since I'm not using any of those in my query, is it free from race conditions? If it isn't how can I fix it?

Yes it will fail with a duplicate key value violates unique constraint error. What I do is to place the insertion code in a try/except block and when the exception is thrown I catch it and retry. That simple. Unless the application has a huge amount of users it will work flawlessly.
In your query the default isolation level is enough since it is a single insert statement and there is no risk of phantom reads.
Notice that even when setting the isolation level to serializable the try/except block is not avoidable. From the manual about serializable:
like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures

The default transaction level is Read Committed. Phantom reads are possible in this level (see Table 13.1). While you are protected from seeing any weird effects in the totals table were you to update the totals, you are not protected from phantom reads in the balances table.
What this means can be explained when looking at a single query similar to yours that attempts the outer join twice (and only queries, does not insert anything). The fact that a balance is missing is not guaranteed to stay the same between the two "peeks" at the balances table. The sudden appearance of a balance that wasn't there when the same transaction looked for the first time, is called a "phantom read".
In your case, several concurrent statements can see that a balance is missing and nothing prevents them trying to insert it and error out.
To rule out phantom reads (and to fix your query), you need to execute in in the SERIALIZABLE isolation level prior to running your query:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

Related

How can this SQL command cause phantom read?

In short, my professor said the following transactions are susceptible to phantom read if they're both left to default isolation level (READ COMMITTED)
BEGIN;
UPDATE product SET price = 100 WHERE type='Electronics';
COMMIT;
BEGIN;
UPDATE product SET price = 100 WHERE price < 100;
COMMIT;
I can't really seem to be able to figure how a phantom read could happen.
He also said that, to fix this, you'd have to set the second transaction to REPEATABLE READ
So... why? How could a phantom read happen here, and why does REPEATABLE READ fixes it?
EDIT: could this be the case?
Say we have an initial product P that has type=Electronics AND price=1049
T1 would begin, and add P to the set of rows to consider.
T2 would begin, and ignore P (its price is below 1050).
T1 would increment its price to 1100, and COMMITs.
Now T2 should update its rows set and include P.
But since in READ COMMITTED a transaction will get an updated snapshot only if changes are made to rows that were within the SET they are considering, the change goes unnotified.
T2, therefore, simply ignores P, and COMMITs.
This scenario was suggested by the example I found in postgresql docs,
on the isolation level page, read committed paragraph.
Do you think this is a possible scenario and, hopefully, what my professor meant?
A phantom read means that if you run the same SELECT twice in a transaction, the second one could get different results than the first.
In the words of the SQL standard:
SQL-transaction T1 reads the set of rows N that satisfy some <search condition>.
SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the <search condition> used by SQL-transaction T1.
If SQL-transaction T1 then repeats the initial read with the same
<search condition>, it obtains a different collection of rows.
This can be caused by concurrent data modifications like the ones you quote at low isolation levels. This is because each query will see all committed data, even if they were committed after the transaction started.
You could also speak of phantom reads in the context of an UPDATE statement, since it also reads from the table. Then the same UPDATE can affect different statements if it is run twice.
However, it makes no sense to speak of phantom reads in the context of the two statements in your question: The second one modifies the column it is searching for, so the second execution will read different rows, no matter if there are concurrent data modifications or not.
Note: The SQL standard does not require that REPEATABLE READ transactions prevent phantom reads — this is only guaranteed with SERIALIZABLE isolation.
In PostgreSQL phantom reads are already impossible at REPEATABLE READ isolation, because it uses snapshots that guarantee a stable view of the database.
This might help
https://en.wikipedia.org/wiki/Isolation_(database_systems)
Non-repeatable reads A non-repeatable read occurs, when during the
course of a transaction, a row is retrieved twice and the values
within the row differ between reads.
Non-repeatable reads phenomenon may occur in a lock-based concurrency
control method when read locks are not acquired when performing a
SELECT, or when the acquired locks on affected rows are released as
soon as the SELECT operation is performed. Under the multiversion
concurrency control method, non-repeatable reads may occur when the
requirement that a transaction affected by a commit conflict must roll
back is relaxed.
A phantom read occurs when, in the course of a transaction, new rows
are added or removed by another transaction to the records being read.
This can occur when range locks are not acquired on performing a
SELECT ... WHERE operation. The phantom reads anomaly is a special
case of Non-repeatable reads when Transaction 1 repeats a ranged
SELECT ... WHERE query and, between both operations, Transaction 2
creates (i.e. INSERT) new rows (in the target table) which fulfil that
WHERE clause.

What is the difference between inconsistent analysis and non-repeatable reads?

I've seen a lot of comparison to inconsistent analysis and dirty reads and non-repeatable reads to dirty reads but i can't seem to grasp the difference between an inconsistent (incorrect) analysis vs a non repeatable read?
Is there a better way to explain this.
My confusion is in the fact that they are both multiple reads part of a transaction where a second (or third) transaction makes updates that are committed.
Incorrect analysis - the data read by the second transaction was committed by the transaction that made the change. Inconsistent analysis involves multiple reads (two or more) of the same row and each time the information is changed by another transaction, thus producing different results each time, and hence inconsistent.
Where as
Non Repeatable Reads occur when one transaction attempts to access the same data twice and a second transaction modifies the data between the first transaction's read attempts. This may cause the first transaction to read two different values for the same data, causing the original read to be non-repeatable.
I cant quite figure out how are they different.
Thank you.
In my view, the following is an example of an inconsistent analysis, but not a non-repeatable read.
The example uses a table for bank accounts:
CREATE TABLE ACCOUNT(
NO NUMERIC(10) NOT NULL PRIMARY KEY,
BALANCE NUMERIC(9,2) NOT NULL);
INSERT INTO ACCOUNT VALUES (1, 100.00);
INSERT INTO ACCOUNT VALUES (2, 200.00);
For performance reasons, the bank has another table that stores the sum of all account balances redundantly (therefore, this table has always only one row):
CREATE TABLE TOTAL(AMOUNT NUMERIC(9,2) NOT NULL);
INSERT INTO TOTAL SELECT SUM(BALANCE) FROM ACCOUNT;
Suppose that transaction A wants to check whether the redundant sum is indeed correct. It first computes the sum of the account balances:
START TRANSACTION; -- PostgreSQL and SQL Standard, switches autocommit off
SELECT SUM(BALANCE) FROM ACCOUNT;
Now the owner of account 1 makes a deposit of 50 dollars. This is done in transaction B:
START TRANSACTION;
UPDATE ACCOUNT SET BALANCE = BALANCE + 50 WHERE NO = 1;
UPDATE TOTAL SET AMOUNT = AMOUNT + 50;
COMMIT;
Finally, transaction A continues and reads the redundant sum:
SELECT AMOUNT FROM TOTAL;
It will see the increased value, which is different from the sum it computed (probably causing a false alarm).
In this example, transaction A did not read any table row twice, therefore this cannot be a non-repeatable read. However, it did not see a unique database state - some part of the information was from the old state before the update of transaction B, and some part from the new state after the update.
But this is certainly very related to a non-repeatable read: If A had read the ACCOUNT rows again, this would be a non-repeatable read. It seems that the same internal mechanisms that prevent non-repeatable reads also prevent this problem: One could keep read rows locked until the end of the transaction, or use multi-version concurrency control with the version at the begin of the transaction.
However, there is also one nice solution here, namely to get all data in one query. At least Oracle and PostgreSQL guarantee that a single query is evaluated with respect to only one state of the database:
SELECT SUM(BALANCE) AS AMOUNT, 'SUM' AS PART FROM ACCOUNT
UNION ALL
SELECT AMOUNT, 'TOTAL' AS PART FROM TOTAL;
In a formal model of transaction schedules with reads and writes this also looks very similar to a non-repeatable read: First A does read(x), then B does write(x), write(y), and then A does read(y). If only a single object would be involved, this would be a non-repeatable read.

Microsoft SQL, understand transaction locking on low level

I had a weird case when I got a deadlock between a select statement that joined two tables and a transaction that performed multiple updates on those two tables.
I am coming from a Java world, so I thought that using a transaction will lock all the tables in it, but what I come to understand now is that the lock will only be requested when you access the table from your transaction and if someone else is doing a heavy select on that table during that time you might get a deadlock. Just to be fair I also must say that I had multiple connections making that same sequence of calls where they performed a heavy query on two tables and then created a transaction to update those tables, so whatever age case you might think of, there is a big chance I been running into it.
With that been said, can you please provide a low level explanation in what situations you might get a deadlock between a select statement and a transaction?

Isolation level for two statements on the same transaction vs. single statement

I have the following table:
DROP TABLE IF EXISTS event;
CREATE TABLE event(
kind VARCHAR NOT NULL,
num INTEGER NOT NULL
);
ALTER TABLE event ADD PRIMARY KEY (kind, num);
The idea is that I want to use the num column to maintain separate increment counters for each kind of event. So num is like a dedicated auto-increment sequence for each different kind.
Assuming multiple clients/threads writing events (potentially of the same kind) into that table concurrently, is there any difference in terms of the required level for transaction isolation between: (a) executing the following block:
BEGIN TRANSACTION;
DO
$do$
DECLARE
nextNum INTEGER;
BEGIN
SELECT COALESCE(MAX(num),-1)+1 FROM event WHERE kind='A' INTO nextNum;
INSERT INTO event(kind, num) VALUES('A', nextNum);
END;
$do$;
COMMIT;
... and (b) combining the select and insert into a single statement:
INSERT INTO event(kind, num)
(SELECT 'A', COALESCE(MAX(num),-1)+1 FROM event WHERE kind='A');
From some tests I run it seems that in both cases I need the serializable transaction isolation level. What's more, even with the serializable transactions isolation level, my code has to be prepared to retry due to the following error in highly concurrency situations:
ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt.
HINT: The transaction might succeed if retried.
In other words merging the select into the insert does not seem to confer any benefit in terms of atomicity or allow any lower/more lenient transaction isolation level to be set. Am I missing anything?
(This question is broadly related in that it asks for a PostgreSQL pattern to facilitate the generation of multiple sequences inside a table. So just to be clear I am not asking for the right pattern to do that sort of thing; I just want to understand if the block of two statements is in any way different than a single merged INSERT/SELECT statement).
The problem with the task is possible concurrent write access. Merging SELECT and INSERT into one statement reduces the time frame for possible conflicts to a minimum and is the superior approach in any case. The potential for conflicts is still there. And yes, serializable transaction isolation is one possible (if expensive) solution.
Typically (but that's not what you are asking), the best solution is not to try what you are trying. Gapless sequential numbers are a pain in databases with concurrent write access. If possible, use a serial column instead, which gives you unique ascending numbers - with possible gaps. You can eliminate gaps later or dynamically with in a VIEW. Details:
Serial numbers per group of rows for compound key
Aside: you don't need parentheses around the SELECT:
INSERT INTO event(kind, num)
SELECT 'A', COALESCE(MAX(num) + 1, 0) FROM event WHERE kind='A';

Minimizing deadlocks with purposely contrived + highly concurrent transactions?

I'm currently working on benchmarking different isolation levels in SQL Server 2008 -- but right now I'm stuck on what seems to be a trivial deadlocking problem, but I can't seem to figure it out. Hopefully someone here can offer advice (I'm a novice to SQL)
I currently have two types of transactions (to demonstrate dirty reads, but that's irrelevant):
Transaction Type A: Select all rows from Table A.
Transaction Type B: Set value 'cost' = 0 in all rows in Table A, then rollback immediately.
I currently run a threadpool of 1000 threads and 10,000 transactions, where each thread randomly chooses between executing Transaction Type A and Transaction Type B. However, I'm getting a ton of deadlocks even with forced row locking.
I assume that the deadlocks are occurring because of the row ordering of locks being acquired -- that is, if both Type A and Type B 'scan' table A in the same ordering, e.g. from top to bottom, such deadlocks cannot occur. However, I'm having trouble figuring out how to get SQL Server to maintain row ordering during SELECT and UPDATE statements.
Any tips? First time poster to stackoverflow, so please be gentle :-)
EDIT: The isolation level is purposely set to READ_COMMITTED to show that it eliminates dirty reads (and it does). Deadlocks only occur on any level equal to or higher than READ_COMMITTED; obviously no deadlocks occur on READ_UNCOMMITTED.
EDIT 2: These transactions are being run on a fresh instance of AdventureWorks LT on SQL Server 2008R2.
If you are starting a transaction to update all the rows, type B, and then rollback the transaction, the lock will need to be held for that entire transaction on all rows. Even though you have row level locks the lock needs to be held for the entire transaction.
You may see less deadlocks if you have page level or table level locking because these are easier to handle for Sql Server, but you will still need to hold these locks on the whole whilst the transaction is ongoing.
When you are designing a highly concurrent system you should avoid queries that lock the whole table. I recommend the following MicroSoft guide for understanding locks and reducing their impact:
http://technet.microsoft.com/en-us/library/cc966413.aspx