Ensuring Atomicity sql - sql

I was just reading about RDBMS,
and one property of an RDBMS is
atomicity. So, if money is withdrawn
from an account and transferred to
another, either the transaction
will happen completely or not
at all. There are no partial
transactions. But how is actually
ensured?
Sql queries for the above scenario
might look like
(i) UPDATE accounts set balance = balance - amount WHERE ac_num = 101
(ii) UPDATE accounts set balance = balance + amount WHERE ac_num = 102
Which by no means ensures atomicity..
So how does it actually happen?

If you do
BEGIN TRANSACTION
UPDATE accounts set balance = balance - amount WHERE ac_num = 101
UPDATE accounts set balance = balance + amount WHERE ac_num = 102
COMMIT TRANSACTION
The database system will write notes to what is has done for changes on account 101. And then if the work on account 102 would fail, the RDBMS uses those notes to undo the work on 101.
Furthermore, when it has started work on account 101 is takes a lock on the database, so that no-one else can come and read the updated, but not committed data in account 101.
(A lock here is basically just a note somewhere "I am working here, do not touch.")

To be atomic, transactions need to:
Prevent other transactions from interfering with the rows they are writing or reading
Make sure that either all or none of the changes that the transaction makes, will be in the database when the transaction commits.
First one is achieved by locking rows that the transaction reads or writes during it's execution.
Second one is done so that transactions write their actions into a transaction log. This makes the database able to recover even when the server loses power during a transaction. In this case the recovery process will read the log, make sure that active (uncommited) transactions get aborted and changes made by them are canceled.

Related

PostgreSQL row read lock

Let’s say I have a table called Withdrawals (id, amount, user_id, status).
Whenever I a withdrawal is initiated this is the flow:
Verify if user has sufficient balance (which is calculated as sum of amount received - sum of withdrawals amount)
Insert row with amount, user_id and status=‘pending’
Call 3rd party software through gRPC to initiate a withdrawal (actually send money), wait for a response
Update row with status = ‘completed’ as soon we a positive response or delete the entry if the withdrawal failed.
However, I have a concurrency problem in this flow.
Let’s say the user makes 2 full balance withdrawal requests within ~50 ms difference:
Request 1
User has enough balance
Create Withdrawal (balance = 0)
Update withdrawal status
Request 2 (after ~50ms)
User has enough balance (which is not true, the other insert didn’t got stored yet)
Create Withdrawal (balance = negative )
Update withdrawal status
Right now, we are using redis to lock withdrawals to specific user if they are within x ms, to avoid this situation, however this is not the most robust solution. As we are developing an API for businesses right now, with our current solution, we would be blocking possible withdrawals that could be requested at the same time.
Is there any way to lock and make sure consequent insert queries wait based on the user_id of the Withdrawals table ?
This is a property of transaction isolation. There is a lot written about it and I would highly recommend the overview in Designing Data-Intensive Applications. I found it to be the most helpful description in bettering my personal understanding.
The default postgres level is READ COMMITTED which allows each of these concurrent transactions to see a similiar (funds available state) even though they should be dependent.
One way to address this would be to mark each of these transactions as "SERIALIZABLE" consistency.
SERIALIZABLE All statements of the current transaction can only see
rows committed before the first query or data-modification statement
was executed in this transaction. If a pattern of reads and writes
among concurrent serializable transactions would create a situation
which could not have occurred for any serial (one-at-a-time) execution
of those transactions, one of them will be rolled back with a
serialization_failure error.
This should enforce the correctness of your application at a cost to availability, Ie in this case the second transaction will not be allowed to modify the records and would be rejected, which would require a retry. For a POC or a low traffic application this is usually a perfectly acceptable first step as you can ensure correctness for right now.
Also in the book referenced above I think there was an example of how ATM's handle availability. They allow for this race condition and the user to overdraw if they are unable to connect to the centralized bank but bound the maximum withdraw to minimize the blast radius!
Another architectural way to address this is to take the transactions offline and make them asynchronous, so that each user invoked transaction is published to a queue, and then by having a single consumer of the queue you naturally avoid any race conditions. The tradeoff here is similar there is a fixed throughput available from a single worker, but it does help to address the correctness issue for right now :P
Locking across machines (like using redis across postgres/grpc) called distributed locking and has a good amount written about it https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

Using XLOCK In SELECT Statements

Is using XLOCK (Exclusive Lock) in SELECT statements considered bad practice?
Let's assume the simple scenario where a customer's account balance is $40. Two concurrent $20 puchase requests arrive. Transaction includes:
Read balance
If customer has enough money, deduct the price of the product from the balance
So without XLOCK:
T1(Transaction1) reads $40.
T2 reads $40.
T1 updates it to $20.
T2 updates it to $20.
But there should be $0 left in the account.
Is there a way to prevent this without the use of XLOCK? What are the alternatives?
When you perform an update, you should update directly into the data item to prevent these issues. One safe way to do this is demonstrated in the sample code below:
CREATE TABLE #CustomerBalance (CustID int not null, Balance decimal(9,2) not null)
INSERT INTO #CustomerBalance Values (1, 40.00)
DECLARE #TransactionAmount decimal(9,2) = 19.00
DECLARE #RemainingBalance decimal(9,2)
UPDATE #CustomerBalance
SET #RemainingBalance = Balance - #TransactionAmount,
Balance = #RemainingBalance
SELECT #RemainingBalance
(No column name)
21.00
One advantage of this method is that the row is locked as soon as the UPDATE statement starts executing. If two users are updating the value "simultaneously", because of how the database works, one will start updating the data before the other. The first UPDATE will prevent the second UPDATE from manipulating the data until the first one is completed. When the second UPDATE starts processing the record, it will see the value that has been updated into the Balance by the first update.
As a side effect of this, you will want to have code that checks the balance after your update, and roll back the value if you have "overdrawn" the balance, or whatever is necessary. That is why this sample code returns the remaining balance in the variable #RemainingBalance.
Depending on how you place the queries, isolation level READ COMMITTED should do the job.
Suppose the following code to be performed:
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
start transaction;
update account set balance=balance-20 where accountid = 'XY';
commit;
Assume T1 executes statement update account set balance=balance-20 where accountid = 'XY'; it will place a write lock on the record with accountid='XY'.
If now a second transaction T2 executes the same statement before T1 has committed, than the statement of T2 is blocked until T1 commits.
Afterwards, T2 continues. At the end, balance will be reduced by 40.
Your question is based on the assumption that using XLOCK is a bad practice. While it is correct that putting this hint everywhere all the time is generally not the best possible approach, there is no other way to achieve the required functionality in your particular situation.
When I encountered the same problem, I have found that the combination of XLOCK, HOLDLOCK placed in the verification select in the same transaction usually gets the job done. (I had a stored procedure that performs all necessary validations and then updated the Accounts table only if everything is fine. Yes, in a single transaction.)
However, there is one important caveat: if your database has RCSI enabled, other readers will be able to get past the lock by getting previous value from the version store. In this case, adding READCOMMITTEDLOCK turns off optimistic versioning for the row(s) in question and reverts the behaviour back to standard RC.

postgresql function or query performance

I have some general queries on executing Postgres function.I recently noticed that if I store the output of any arithmetic or business operation in a variable and then call it at the time of execution in the query instead of doing the operation at time of execution it saved lot of time.
But I am not aware of any practices to be followed in general to reduce the time taken and also improve performance as I am new to Postgres.
Beware of read-modify-write cycles and transaction anomalies.
It's fine to cache values locally so long as you're careful about the scope with which you cache it and with cache invalidation. Be very careful about storing values past the lifetime of the transaction you read it in.
Read-modify-write
You must also be careful not to use that cached value as an input into a calculation that you write back to the database unless you SELECT ... FOR UPDATE the value in a transaction that stays open during the write, you use a SERIALIZABLE transaction, or you use some form of optimistic concurrency control.
If you aren't careful you can get yourself in real trouble, with classics like the banking concurrency example where account id=1 transfers $100 to accounts id=2 and id=3:
session1 session2
begin; begin;
select balance
from account
where id=1; => 100
select balance
from account
where id = 1; => 100
update account
set balance = balance + 100
where id = 2; -- this is safe
update account
set balance = balance + 100
where id = 3; -- this is safe
update account
set balance = 0 -- 100 - 100 = 0
where id = 1;
update account
set balance = 0 -- 100 - 100 = 0
where id = 1;
commit;
commit;
Whoops! You just added $100 to two people's accounts but only took $100 out of id=1's. The updates against id=2 and id=3 were OK because they did an in-place modification of the balance (balance = balance + 100). The updates to id=1 were not, because they read the value, modified it client side, and wrote a new value.
This is what I mean by a read-modify-write cycle.
It would've been safe if we'd used SELECT ... FOR UPDATE when reading the balance, because the second transaction would've got stuck until the 1st committed. But it would've been better still if we'd avoided the read-copy-write cycle and just done the updates in-place.
Caching
Caching is fine - but can introduce anomalies when the underlying data is updated but your cache doesn't get flushed and refreshed.
Cache invalidation is, in general, a hard problem, but Pg has some tools that help.
In particular, listen and notify, invoked from triggers, can be used to eagerly flush data from a cache stored in memcached/redis/whatever via a helper daemon. That means you're much less likely to have to flush large chunks of cache or drop the whole cache whenever something changes.
You also need to make decisions about how out of date it's acceptable for something to be. Sometimes you just don't care if a value is 5 seconds out of date. Or half an hour. Or a week. It depends on the application, the datum in question, etc.
There's nothing particularly wrong with storing values in variables.
If you're storing values so you can write SQL in a procedural, step-by-step way instead of a set-oriented way, then you'd probably be better off not doing that. SQL is a set-oriented language; it usually performs better if you write set-oriented SQL.
The risk of storing values and using them later is that the data underlying those values might have changed since you stored them. Whether that's a real problem is application-specific, but it's usually a problem best avoided.

race condition UPDATE modification of credit column - what happens on rollback?

ok, I tried searching and have not found an answer to this - I am curious how the ROLLBACK handles race conditions. For example:
If I have a table (CompanyAccount) which keeps track of how many credits an company has available for purchase (there is only one row in a database table per company) and there are potentially multiple users from the same company who can decrement the credits from the single company account, what happens in case of an error when a ROLLBACK occurs?
Example:
Assumptions: I have written the update properly to calculate the "Credit" new balance instead of guessing what the new credit balance is (i.e. we don't try to tell the UPDATE statement what the new Credit balance/value is, we say take whatever is in the credit column and subtract my decrement value in the UPDATE statement)...
here is an example of how the update statement is written:
UPDATE dbo.CompanyAccount
SET Credit = Credit - #DecrementAmount
WHERE CompanyAccountId = #CompanyAccountId
If the "Credit" column has 10,000 credits. User A causes a decrement of 4,000 credits and User B causes a decrement of 1000 credits. For some reason a rollback is triggered during User A's decrement (there are about a 1/2 dozen more tables with rows getting INSERTED during the TRANSACTION). If User A wins the race condition and the new balance is 6,000 (but not yet COMMIT'ed) what happens if User B's decrement occurs before the rollback is applied? does the balance column go from 6,000 to 5,000 and then gets ROLLBACK to 10,000?
I am not too clear on how the ROLLBACK will handle this. Perhaps I am over-simplifying. Can someone please tell me if I misunderstand how ROLLBACK will work or if there are other risks I need to worry about for this style.
Thanks for your input.
In the example you have given there will be no problem.
The first transaction will have an exclusive lock meaning the second one can not modify that row until after the first one has committed or rolled back. It will just have to wait (blocked) until the lock is released.
It gets a bit more complicated if you have multiple statements. You should probably read up on different isolation levels and how they can allow or prevent such phenomena as "lost updates".
Rollback is part of the transaction and locks will be maintained during the rollback. The *A*tomic in ACID.
User B will not start until all locks are released.
What happens:
User A locks rows
User B won't see the rows until locks are released
User A rolls back, release locks, changes never happened.
User B sees the rows. -1000 will result in 9000
However, if User B has already read the balance then it my be inconsistent when it comes to UPDATE. It depends on what you're actually doing and in what order, hence the need to understand isolation levels (and the issues with phantom and non-repeatable reads)
An alternative to SERIALIZABLE or REPEATABLE READ may to use sp_getapplock in transaction mode to semaphore parts of the transaction.

MySQL: Transactions vs Locking Tables

I'm a bit confused with transactions vs locking tables to ensure database integrity and make sure a SELECT and UPDATE remain in sync and no other connection interferes with it. I need to:
SELECT * FROM table WHERE (...) LIMIT 1
if (condition passes) {
// Update row I got from the select
UPDATE table SET column = "value" WHERE (...)
... other logic (including INSERT some data) ...
}
I need to ensure that no other queries will interfere and perform the same SELECT (reading the 'old value' before that connection finishes updating the row.
I know I can default to LOCK TABLES table to just make sure that only 1 connection is doing this at a time, and unlock it when I'm done, but that seems like overkill. Would wrapping that in a transaction do the same thing (ensuring no other connection attempts the same process while another is still processing)? Or would a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE be better?
Locking tables prevents other DB users from affecting the rows/tables you've locked. But locks, in and of themselves, will NOT ensure that your logic comes out in a consistent state.
Think of a banking system. When you pay a bill online, there's at least two accounts affected by the transaction: Your account, from which the money is taken. And the receiver's account, into which the money is transferred. And the bank's account, into which they'll happily deposit all the service fees charged on the transaction. Given (as everyone knows these days) that banks are extraordinarily stupid, let's say their system works like this:
$balance = "GET BALANCE FROM your ACCOUNT";
if ($balance < $amount_being_paid) {
charge_huge_overdraft_fees();
}
$balance = $balance - $amount_being paid;
UPDATE your ACCOUNT SET BALANCE = $balance;
$balance = "GET BALANCE FROM receiver ACCOUNT"
charge_insane_transaction_fee();
$balance = $balance + $amount_being_paid
UPDATE receiver ACCOUNT SET BALANCE = $balance
Now, with no locks and no transactions, this system is vulnerable to various race conditions, the biggest of which is multiple payments being performed on your account, or the receiver's account in parallel. While your code has your balance retrieved and is doing the huge_overdraft_fees() and whatnot, it's entirely possible that some other payment will be running the same type of code in parallel. They'll be retrieve your balance (say, $100), do their transactions (take out the $20 you're paying, and the $30 they're screwing you over with), and now both code paths have two different balances: $80 and $70. Depending on which ones finishes last, you'll end up with either of those two balances in your account, instead of the $50 you should have ended up with ($100 - $20 - $30). In this case, "bank error in your favor".
Now, let's say you use locks. Your bill payment ($20) hits the pipe first, so it wins and locks your account record. Now you've got exclusive use, and can deduct the $20 from the balance, and write the new balance back in peace... and your account ends up with $80 as is expected. But... uhoh... You try to go update the receiver's account, and it's locked, and locked longer than the code allows, timing out your transaction... We're dealing with stupid banks, so instead of having proper error handling, the code just pulls an exit(), and your $20 vanishes into a puff of electrons. Now you're out $20, and you still owe $20 to the receiver, and your telephone gets repossessed.
So... enter transactions. You start a transaction, you debit your account $20, you try to credit the receiver with $20... and something blows up again. But this time, instead of exit(), the code can just do rollback, and poof, your $20 is magically added back to your account.
In the end, it boils down to this:
Locks keep anyone else from interfering with any database records you're dealing with. Transactions keep any "later" errors from interfering with "earlier" things you've done. Neither alone can guarantee that things work out ok in the end. But together, they do.
in tomorrow's lesson: The Joy of Deadlocks.
I've started to research the same topic for the same reasons as you indicated in your question. I was confused by the answers given in SO due to them being partial answers and not providing the big picture. After I read couple documentation pages from different RDMS providers these are my takes:
TRANSACTIONS
Statements are database commands mainly to read and modify the data in the database. Transactions are scope of single or multiple statement executions. They provide two things:
A mechanism which guaranties that all statements in a transaction are executed correctly or in case of a single error any data modified by those statements will be reverted to its last correct state (i.e. rollback). What this mechanism provides is called atomicity.
A mechanism which guaranties that concurrent read statements can view the data without the occurrence of some or all phenomena described below.
Dirty read: A transaction reads data written by a concurrent
uncommitted transaction.
Nonrepeatable read: A transaction re-reads data it has previously read
and finds that data has been modified by another transaction (that
committed since the initial read).
Phantom read: A transaction re-executes a query returning a set of
rows that satisfy a search condition and finds that the set of rows
satisfying the condition has changed due to another recently-committed
transaction.
Serialization anomaly: The result of successfully committing a group
of transactions is inconsistent with all possible orderings of running
those transactions one at a time.
What this mechanism provides is called isolation and the mechanism which lets the statements to chose which phenomena should not occur in a transaction is called isolation levels.
As an example this is the isolation-level / phenomena table for PostgreSQL:
If any of the described promises is broken by the database system, changes are rolled back and the caller notified about it.
How these mechanisms are implemented to provide these guaranties is described below.
LOCK TYPES
Exclusive Locks: When an exclusive lock acquired over a resource no other exclusive lock can be acquired over that resource. Exclusive locks are always acquired before a modify statement (INSERT, UPDATE or DELETE) and they are released after the transaction is finished. To explicitly acquire exclusive locks before a modify statement you can use hints like FOR UPDATE(PostgreSQL, MySQL) or UPDLOCK (T-SQL).
Shared Locks: Multiple shared locks can be acquired over a resource. However, shared locks and exclusive locks can not be acquired at the same time over a resource. Shared locks might or might not be acquired before a read statement (SELECT, JOIN) based on database implementation of isolation levels.
LOCK RESOURCE RANGES
Row: single row the statements executes on.
Range: a specific range based on the condition given in the statement (SELECT ... WHERE).
Table: whole table. (Mostly used to prevent deadlocks on big statements like batch update.)
As an example the default shared lock behavior of different isolation levels for SQL-Server :
DEADLOCKS
One of the downsides of locking mechanism is deadlocks. A deadlock occurs when a statement enters a waiting state because a requested resource is held by another waiting statement, which in turn is waiting for another resource held by another waiting statement. In such case database system detects the deadlock and terminates one of the transactions. Careless use of locks can increase the chance of deadlocks however they can occur even without human error.
SNAPSHOTS (DATA VERSIONING)
This is a isolation mechanism which provides to a statement a copy of the data taken at a specific time.
Statement beginning: provides data copy to the statement taken at the beginning of the statement execution. It also helps for the rollback mechanism by keeping this data until transaction is finished.
Transaction beginning: provides data copy to the statement taken at the beginning of the transaction.
All of those mechanisms together provide consistency.
When it comes to Optimistic and Pessimistic locks, they are just namings for the classification of approaches to concurrency problem.
Pessimistic concurrency control:
A system of locks prevents users from modifying data in a way that
affects other users. After a user performs an action that causes a
lock to be applied, other users cannot perform actions that would
conflict with the lock until the owner releases it. This is called
pessimistic control because it is mainly used in environments where
there is high contention for data, where the cost of protecting data
with locks is less than the cost of rolling back transactions if
concurrency conflicts occur.
Optimistic concurrency control:
In optimistic concurrency control, users do not lock data when they
read it. When a user updates data, the system checks to see if another
user changed the data after it was read. If another user updated the
data, an error is raised. Typically, the user receiving the error
rolls back the transaction and starts over. This is called optimistic
because it is mainly used in environments where there is low
contention for data, and where the cost of occasionally rolling back a
transaction is lower than the cost of locking data when read.
For example by default PostgreSQL uses snapshots to make sure the read data didn't change and rolls back if it changed which is an optimistic approach. However, SQL-Server use read locks by default to provide these promises.
The implementation details might change according to database system you chose. However, according to database standards they need to provide those stated transaction guarantees in one way or another using these mechanisms. If you want to know more about the topic or about a specific implementation details below are some useful links for you.
SQL-Server - Transaction Locking and Row Versioning Guide
PostgreSQL - Transaction Isolation
PostgreSQL - Explicit Locking
MySQL - Consistent Nonlocking Reads
MySQL - Locking
Understanding Isolation Levels (Video)
You want a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE inside a transaction, as you said, since normally SELECTs, no matter whether they are in a transaction or not, will not lock a table. Which one you choose would depend on whether you want other transactions to be able to read that row while your transaction is in progress.
http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
START TRANSACTION WITH CONSISTENT SNAPSHOT will not do the trick for you, as other transactions can still come along and modify that row. This is mentioned right at the top of the link below.
If other sessions simultaneously
update the same table [...] you may
see the table in a state that never
existed in the database.
http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
Transaction concepts and locks are different. However, transaction used locks to help it to follow the ACID principles.
If you want to the table to prevent others to read/write at the same time point while you are read/write, you need a lock to do this.
If you want to make sure the data integrity and consistence, you had better use transactions.
I think mixed concepts of isolation levels in transactions with locks.
Please search isolation levels of transactions, SERIALIZE should be the level you want.
I had a similar problem when attempting a IF NOT EXISTS ... and then performing an INSERT which caused a race condition when multiple threads were updating the same table.
I found the solution to the problem here: How to write INSERT IF NOT EXISTS queries in standard SQL
I realise this does not directly answer your question but the same principle of performing an check and insert as a single statement is very useful; you should be able to modify it to perform your update.
I'd use a
START TRANSACTION WITH CONSISTENT SNAPSHOT;
to begin with, and a
COMMIT;
to end with.
Anything you do in between is isolated from the others users of your database if your storage engine supports transactions (which is InnoDB).
You are confused with lock & transaction. They are two different things in RMDB. Lock prevents concurrent operations while transaction focuses on data isolation. Check out this great article for the clarification and some graceful solution.