What is the difference between inconsistent analysis and non-repeatable reads?

What is the difference between inconsistent analysis and non-repeatable reads? - sql

I've seen a lot of comparison to inconsistent analysis and dirty reads and non-repeatable reads to dirty reads but i can't seem to grasp the difference between an inconsistent (incorrect) analysis vs a non repeatable read?
Is there a better way to explain this.
My confusion is in the fact that they are both multiple reads part of a transaction where a second (or third) transaction makes updates that are committed.
Incorrect analysis - the data read by the second transaction was committed by the transaction that made the change. Inconsistent analysis involves multiple reads (two or more) of the same row and each time the information is changed by another transaction, thus producing different results each time, and hence inconsistent.
Where as
Non Repeatable Reads occur when one transaction attempts to access the same data twice and a second transaction modifies the data between the first transaction's read attempts. This may cause the first transaction to read two different values for the same data, causing the original read to be non-repeatable.
I cant quite figure out how are they different.
Thank you.

In my view, the following is an example of an inconsistent analysis, but not a non-repeatable read.
The example uses a table for bank accounts:
CREATE TABLE ACCOUNT(
NO NUMERIC(10) NOT NULL PRIMARY KEY,
BALANCE NUMERIC(9,2) NOT NULL);
INSERT INTO ACCOUNT VALUES (1, 100.00);
INSERT INTO ACCOUNT VALUES (2, 200.00);
For performance reasons, the bank has another table that stores the sum of all account balances redundantly (therefore, this table has always only one row):
CREATE TABLE TOTAL(AMOUNT NUMERIC(9,2) NOT NULL);
INSERT INTO TOTAL SELECT SUM(BALANCE) FROM ACCOUNT;
Suppose that transaction A wants to check whether the redundant sum is indeed correct. It first computes the sum of the account balances:
START TRANSACTION; -- PostgreSQL and SQL Standard, switches autocommit off
SELECT SUM(BALANCE) FROM ACCOUNT;
Now the owner of account 1 makes a deposit of 50 dollars. This is done in transaction B:
START TRANSACTION;
UPDATE ACCOUNT SET BALANCE = BALANCE + 50 WHERE NO = 1;
UPDATE TOTAL SET AMOUNT = AMOUNT + 50;
COMMIT;
Finally, transaction A continues and reads the redundant sum:
SELECT AMOUNT FROM TOTAL;
It will see the increased value, which is different from the sum it computed (probably causing a false alarm).
In this example, transaction A did not read any table row twice, therefore this cannot be a non-repeatable read. However, it did not see a unique database state - some part of the information was from the old state before the update of transaction B, and some part from the new state after the update.
But this is certainly very related to a non-repeatable read: If A had read the ACCOUNT rows again, this would be a non-repeatable read. It seems that the same internal mechanisms that prevent non-repeatable reads also prevent this problem: One could keep read rows locked until the end of the transaction, or use multi-version concurrency control with the version at the begin of the transaction.
However, there is also one nice solution here, namely to get all data in one query. At least Oracle and PostgreSQL guarantee that a single query is evaluated with respect to only one state of the database:
SELECT SUM(BALANCE) AS AMOUNT, 'SUM' AS PART FROM ACCOUNT
UNION ALL
SELECT AMOUNT, 'TOTAL' AS PART FROM TOTAL;
In a formal model of transaction schedules with reads and writes this also looks very similar to a non-repeatable read: First A does read(x), then B does write(x), write(y), and then A does read(y). If only a single object would be involved, this would be a non-repeatable read.

Related

Is it possible to lock on a value of a column in SQL Server?

I have a table that looks like that:
Id GroupId
1 G1
2 G1
3 G2
4 G2
5 G2
It should at any time be possible to read all of the rows (committed only). When there will be an update I want to have a transaction that will lock on group id, i.e. there should at any given time be only one transaction that attempts to update per GroupId.
It should ideally be still possible to read all committed rows (i.e. other transaction/ordinary reads that will not try to acquire the "update per group lock" should be still able to read).
The reason I want to do this is that an update can not rely on "outdated" data. I.e. I do make some calculations in a transaction and another transaction cannot edit row with id 1 or add a new row with the same GroupId after these rows were read by the first transaction (even though the first transaction would never modify the row itself it will be dependent on it's value).
Another "nice to have" requirement is that sometimes I would need the same requirement "cross group", i.e. the update transaction would have to lock 2 groups at the same time. (This is not a dynamic number of groups, but rather just 2)

Here are some ideas. I don't think any of them are perfect - I think you will need to give yourself a set of use-cases and try them. Some of the situations I tried after applying locks
SELECTs with the WHERE filter as another group
SELECTs with the WHERE filter as the locked group
UPDATES on the table with the WHERE clause as another group
UPDATEs on the table where ID (not GrpID!) was not locked
UPDATEs on the table where the row was locked (e.g., IDs 1 and 2)
INSERTs into the table with that GrpId
I have the funny feeling that none of these will be 100%, but the most likely answer is the second one (setting the transaction isolation level). It will probably lock more than desired, but will give you the isolation you need.
Also one thing to remember: if you lock many rows (e.g., there are thousands of rows with the GrpId you want) then SQL Server can escalate the lock to be a full-table lock. (I believe the tipping point is 5000 locks, but not sure).
Old-school hackjob
At the start of your transaction, update all the relevant rows somehow e.g.,
BEGIN TRAN
UPDATE YourTable
SET GrpId = GrpId
WHERE GrpId = N'G1';
-- Do other stuff
COMMIT TRAN;
Nothing else can use them because (bravo!) they are a write within a transaction.
Convenient - set isolation level
See https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-locking-and-row-versioning-guide?view=sql-server-ver15#isolation-levels-in-the-
Before your transaction, set the isolation level high e.g., SERIALIZABLE.
You may want to read all the relevant rows at the start of your transaction (e.g., SELECT Grp FROM YourTable WHERE Grp = N'Grp1') to lock them from being updated.
Flexible but requires a lot of coding
Use resource locking with sp_getapplock and sp_releaseapplock.
These are used to lock resources, not tables or rows.
What is a resource? Well, anything you want it to be. In this case, I'd suggest 'Grp1', 'Grp2' etc. It doesn't actually lock rows. Instead, you ask (via sp_getapplock, or APPLOCK_TEST) whether you can get the resource lock. If so, continue. If not, then stop.
Anything code referring to these tables needs to be reviewed and potentially modified to ask if it's allowed to run or not. If something doesn't ask for permission and just does it, there's no actual real locks stopping it (except via any transactions you've explicity specified).
You also need to ensure that errors are handled appropriately (e.g., still releasing the app_lock) and that processes that are blocked are re-tried.

Using XLOCK In SELECT Statements

Is using XLOCK (Exclusive Lock) in SELECT statements considered bad practice?
Let's assume the simple scenario where a customer's account balance is $40. Two concurrent $20 puchase requests arrive. Transaction includes:
Read balance
If customer has enough money, deduct the price of the product from the balance
So without XLOCK:
T1(Transaction1) reads $40.
T2 reads $40.
T1 updates it to $20.
T2 updates it to $20.
But there should be $0 left in the account.
Is there a way to prevent this without the use of XLOCK? What are the alternatives?

When you perform an update, you should update directly into the data item to prevent these issues. One safe way to do this is demonstrated in the sample code below:
CREATE TABLE #CustomerBalance (CustID int not null, Balance decimal(9,2) not null)
INSERT INTO #CustomerBalance Values (1, 40.00)
DECLARE #TransactionAmount decimal(9,2) = 19.00
DECLARE #RemainingBalance decimal(9,2)
UPDATE #CustomerBalance
SET #RemainingBalance = Balance - #TransactionAmount,
Balance = #RemainingBalance
SELECT #RemainingBalance
(No column name)
21.00
One advantage of this method is that the row is locked as soon as the UPDATE statement starts executing. If two users are updating the value "simultaneously", because of how the database works, one will start updating the data before the other. The first UPDATE will prevent the second UPDATE from manipulating the data until the first one is completed. When the second UPDATE starts processing the record, it will see the value that has been updated into the Balance by the first update.
As a side effect of this, you will want to have code that checks the balance after your update, and roll back the value if you have "overdrawn" the balance, or whatever is necessary. That is why this sample code returns the remaining balance in the variable #RemainingBalance.

Depending on how you place the queries, isolation level READ COMMITTED should do the job.
Suppose the following code to be performed:
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
start transaction;
update account set balance=balance-20 where accountid = 'XY';
commit;
Assume T1 executes statement update account set balance=balance-20 where accountid = 'XY'; it will place a write lock on the record with accountid='XY'.
If now a second transaction T2 executes the same statement before T1 has committed, than the statement of T2 is blocked until T1 commits.
Afterwards, T2 continues. At the end, balance will be reduced by 40.

Your question is based on the assumption that using XLOCK is a bad practice. While it is correct that putting this hint everywhere all the time is generally not the best possible approach, there is no other way to achieve the required functionality in your particular situation.
When I encountered the same problem, I have found that the combination of XLOCK, HOLDLOCK placed in the verification select in the same transaction usually gets the job done. (I had a stored procedure that performs all necessary validations and then updated the Accounts table only if everything is fine. Yes, in a single transaction.)
However, there is one important caveat: if your database has RCSI enabled, other readers will be able to get past the lock by getting previous value from the version store. In this case, adding READCOMMITTEDLOCK turns off optimistic versioning for the row(s) in question and reverts the behaviour back to standard RC.

How to assign a sequential number (gapless) to a record on INSERT in a transactional sql database?

Let's say we have to store orders in the database and the requirement is that the orders should be numbered as YEAR/NUM where NUM is a number like 1, 2, 3,... without any gaps starting with 1 each year.
How to implement that the right way?
The first thought is:
last_num = get_int('select max(num) from orders where year = :current_year:')
next_num = last_num + 1
execute('insert into orders (year, num) values (:current_year:, :next_num:)');
That will do it in the most cases for most systems. But if you have very high load there is a possibility of a race condition that 2 threads ask for last_num simultaneously and obtain the same number. How to solve that? Do you need to do something with the transaction? Or something with locking a database table?
The solution should be database vendor independent. Just a theoretical transactional sql database.
UPDATE 1. Actually you can have a similar situation in a banking database where you have a field with how much money the guy has on his account. Now you need to add some money to his account (last_state + more_money). You can have the same race condition here at reading the last_state.

You can do this work for the transaction, but that requires a trigger. Most databases support the ANSI standard function ROW_NUMBER(), which allows you to do this on output:
select t.*,
row_number() over (partition by year order by id) as year_seqnum
from table t;
I would recommend having an auto_increment/identity/sequence id in the table or a creation date to capture the sequential ordering of rows (there can be gaps in such a column). The database can automatically update this field on input and you can use it to assign the sequence number per year later.
There are important reasons why having the database implement its own sequence number is a much, much better idea than trying to get a sequence number without holes.
Basically, to really do what you want, you have to lock the database for each insert transaction in order to get exactly the correct number -- with no gaps or duplicates. This is called a "serializable transaction". And, although supported by databases, is a very high bar performance-wise. In addition, deletions and perhaps updates to transactions become a nightmare. If you delete the first transaction of the year, you basically have to lock the entire year's worth of data in order to adjust the sequence numbers. And, there can be no inserts during this time.

It can be done, but consider if You can delete rows later and do not renumber, You would end up with gapes anyway.
You can make unique constraint on (num, year) and Your first thought would work well if chances of race conditions are low. Every collision would just mean one failed transaction, You can retry automatically.

A theoretical answer to theoretical question.
Gaps in sequence can appears when:
transaction A begins
A allocates (reserves) a new number in the sequence (say, 1)
transaction B begins
B allocates (reserves) next number in the sequence (say, 2)
A fails and rolls back
B commits with sequence value 2
The sequence value 1 is not used in the final committed data - it is a gap
In theory, to prevent these kind of gaps you need to make sure that no two transactions are running in parallel. It is relatively easy to implement this in practice - just lock the whole table for the duration of the transaction and make all concurrent transactions to wait in queue. Or set transaction isolation level to serializable.
In practice, it usually reduces the throughput of the system and people don't do it.
In your second example of banking database, I would not do it like you described. I store a simple list of all actions (deposits and withdrawals) that happened with the account (just a date and a positive or negative amount of the single transaction). I do not have a permanent field that contains the balance of the account.
When the statement is printed I sum up all actions (over this account) till the needed date to get the balance of the account at a given date.
So, there is no place for race condition at all with this approach.

Can an INSERT-SELECT query be subject to race conditions?

I have this query that attempts to add rows to the balances table if no corresponding row exists in the totals table. The query is run in a transaction using the default isolation level on PostgreSQL.
INSERT INTO balances (account_id, currency, amount)
SELECT t.account_id, t.currency, 0
FROM balances AS b
RIGHT OUTER JOIN totals USING (account_id, currency) AS t
WHERE b.id IS NULL
I have a UNIQUE constraint on balances (accountId, currency). I'm worried that I will get into a race condition situation that will lead to duplicate key errors if multiple sessions execute this query concurrently. I've seen many questions on that subject, but they all seem to involve either subqueries, multiple queries or pgSQL functions.
Since I'm not using any of those in my query, is it free from race conditions? If it isn't how can I fix it?

Yes it will fail with a duplicate key value violates unique constraint error. What I do is to place the insertion code in a try/except block and when the exception is thrown I catch it and retry. That simple. Unless the application has a huge amount of users it will work flawlessly.
In your query the default isolation level is enough since it is a single insert statement and there is no risk of phantom reads.
Notice that even when setting the isolation level to serializable the try/except block is not avoidable. From the manual about serializable:
like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures

The default transaction level is Read Committed. Phantom reads are possible in this level (see Table 13.1). While you are protected from seeing any weird effects in the totals table were you to update the totals, you are not protected from phantom reads in the balances table.
What this means can be explained when looking at a single query similar to yours that attempts the outer join twice (and only queries, does not insert anything). The fact that a balance is missing is not guaranteed to stay the same between the two "peeks" at the balances table. The sudden appearance of a balance that wasn't there when the same transaction looked for the first time, is called a "phantom read".
In your case, several concurrent statements can see that a balance is missing and nothing prevents them trying to insert it and error out.
To rule out phantom reads (and to fix your query), you need to execute in in the SERIALIZABLE isolation level prior to running your query:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

Difference between "read commited" and "repeatable read" in SQL Server

I think the above isolation levels are so alike. Could someone please describe with some nice examples what the main difference is?

Read committed is an isolation level that guarantees that any data read was committed at the moment is read. It simply restricts the reader from seeing any intermediate, uncommitted, 'dirty' read. It makes no promise whatsoever that if the transaction re-issues the read, will find the Same data, data is free to change after it was read.
Repeatable read is a higher isolation level, that in addition to the guarantees of the read committed level, it also guarantees that any data read cannot change, if the transaction reads the same data again, it will find the previously read data in place, unchanged, and available to read.
The next isolation level, serializable, makes an even stronger guarantee: in addition to everything repeatable read guarantees, it also guarantees that no new data can be seen by a subsequent read.
Say you have a table T with a column C with one row in it, say it has the value '1'. And consider you have a simple task like the following:
BEGIN TRANSACTION;
SELECT * FROM T;
WAITFOR DELAY '00:01:00'
SELECT * FROM T;
COMMIT;
That is a simple task that issue two reads from table T, with a delay of 1 minute between them.
under READ COMMITTED, the second SELECT may return any data. A concurrent transaction may update the record, delete it, insert new records. The second select will always see the new data.
under REPEATABLE READ the second SELECT is guaranteed to display at least the rows that were returned from the first SELECT unchanged. New rows may be added by a concurrent transaction in that one minute, but the existing rows cannot be deleted nor changed.
under SERIALIZABLE reads the second select is guaranteed to see exactly the same rows as the first. No row can change, nor deleted, nor new rows could be inserted by a concurrent transaction.
If you follow the logic above you can quickly realize that SERIALIZABLE transactions, while they may make life easy for you, are always completely blocking every possible concurrent operation, since they require that nobody can modify, delete nor insert any row. The default transaction isolation level of the .Net System.Transactions scope is serializable, and this usually explains the abysmal performance that results.
And finally, there is also the SNAPSHOT isolation level. SNAPSHOT isolation level makes the same guarantees as serializable, but not by requiring that no concurrent transaction can modify the data. Instead, it forces every reader to see its own version of the world (its own 'snapshot'). This makes it very easy to program against as well as very scalable as it does not block concurrent updates. However, that benefit comes with a price: extra server resource consumption.
Supplemental reads:
Isolation Levels in the Database Engine
Concurrency Effects
Choosing Row Versioning-based Isolation Levels

Repeatable Read
The state of the database is maintained from the start of the transaction. If you retrieve a value in session1, then update that value in session2, retrieving it again in session1 will return the same results. Reads are repeatable.
session1> BEGIN;
session1> SELECT firstname FROM names WHERE id = 7;
Aaron
session2> BEGIN;
session2> SELECT firstname FROM names WHERE id = 7;
Aaron
session2> UPDATE names SET firstname = 'Bob' WHERE id = 7;
session2> SELECT firstname FROM names WHERE id = 7;
Bob
session2> COMMIT;
session1> SELECT firstname FROM names WHERE id = 7;
Aaron
Read Committed
Within the context of a transaction, you will always retrieve the most recently committed value. If you retrieve a value in session1, update it in session2, then retrieve it in session1again, you will get the value as modified in session2. It reads the last committed row.
session1> BEGIN;
session1> SELECT firstname FROM names WHERE id = 7;
Aaron
session2> BEGIN;
session2> SELECT firstname FROM names WHERE id = 7;
Aaron
session2> UPDATE names SET firstname = 'Bob' WHERE id = 7;
session2> SELECT firstname FROM names WHERE id = 7;
Bob
session2> COMMIT;
session1> SELECT firstname FROM names WHERE id = 7;
Bob
Makes sense?

Simply the answer according to my reading and understanding to this thread and #remus-rusanu answer is based on this simple scenario:
There are two transactions A and B.
Transaction B is reading Table X
Transaction A is writing in table X
Transaction B is reading again in Table X.
ReadUncommitted: Transaction B can read uncommitted data from Transaction A and it could see different rows based on B writing. No lock at all
ReadCommitted: Transaction B can read ONLY committed data from Transaction A and it could see different rows based on COMMITTED only B writing. could we call it Simple Lock?
RepeatableRead: Transaction B will read the same data (rows) whatever Transaction A is doing. But Transaction A can change other rows. Rows level Block
Serialisable: Transaction B will read the same rows as before and Transaction A cannot read or write in the table. Table-level Block
Snapshot: every Transaction has its own copy and they are working on it. Each one has its own view

Old question which has an accepted answer already, but I like to think of these two isolation levels in terms of how they change the locking behavior in SQL Server. This might be helpful for those who are debugging deadlocks like I was.
READ COMMITTED (default)
Shared locks are taken in the SELECT and then released when the SELECT statement completes. This is how the system can guarantee that there are no dirty reads of uncommitted data. Other transactions can still change the underlying rows after your SELECT completes and before your transaction completes.
REPEATABLE READ
Shared locks are taken in the SELECT and then released only after the transaction completes. This is how the system can guarantee that the values you read will not change during the transaction (because they remain locked until the transaction finishes).

Trying to explain this doubt with simple diagrams.
Read Committed: Here in this isolation level, Transaction T1 will be reading the updated value of the X committed by Transaction T2.
Repeatable Read: In this isolation level, Transaction T1 will not consider the changes committed by the Transaction T2.

I think this picture can also be useful, it helps me as a reference when I want to quickly remember the differences between isolation levels (thanks to kudvenkat on youtube)

Please note that, the repeatable in repeatable read regards to a tuple, but not to the entire table. In ANSC isolation levels, phantom read anomaly can occur, which means read a table with the same where clause twice may return different return different result sets. Literally, it's not repeatable.

My observation on initial accepted solution.
Under RR (default mysql) - If a tx is open and a SELECT has been fired, another tx can NOT delete any row belonging to previous READ result set until previous tx is committed (in fact delete statement in the new tx will just hang), however the next tx can delete all rows from the table without any trouble. Btw, a next READ in previous tx will still see the old data until it is committed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas