PostgreSQL row read lock - sql

Let’s say I have a table called Withdrawals (id, amount, user_id, status).
Whenever I a withdrawal is initiated this is the flow:
Verify if user has sufficient balance (which is calculated as sum of amount received - sum of withdrawals amount)
Insert row with amount, user_id and status=‘pending’
Call 3rd party software through gRPC to initiate a withdrawal (actually send money), wait for a response
Update row with status = ‘completed’ as soon we a positive response or delete the entry if the withdrawal failed.
However, I have a concurrency problem in this flow.
Let’s say the user makes 2 full balance withdrawal requests within ~50 ms difference:
Request 1
User has enough balance
Create Withdrawal (balance = 0)
Update withdrawal status
Request 2 (after ~50ms)
User has enough balance (which is not true, the other insert didn’t got stored yet)
Create Withdrawal (balance = negative )
Update withdrawal status
Right now, we are using redis to lock withdrawals to specific user if they are within x ms, to avoid this situation, however this is not the most robust solution. As we are developing an API for businesses right now, with our current solution, we would be blocking possible withdrawals that could be requested at the same time.
Is there any way to lock and make sure consequent insert queries wait based on the user_id of the Withdrawals table ?

This is a property of transaction isolation. There is a lot written about it and I would highly recommend the overview in Designing Data-Intensive Applications. I found it to be the most helpful description in bettering my personal understanding.
The default postgres level is READ COMMITTED which allows each of these concurrent transactions to see a similiar (funds available state) even though they should be dependent.
One way to address this would be to mark each of these transactions as "SERIALIZABLE" consistency.
SERIALIZABLE All statements of the current transaction can only see
rows committed before the first query or data-modification statement
was executed in this transaction. If a pattern of reads and writes
among concurrent serializable transactions would create a situation
which could not have occurred for any serial (one-at-a-time) execution
of those transactions, one of them will be rolled back with a
serialization_failure error.
This should enforce the correctness of your application at a cost to availability, Ie in this case the second transaction will not be allowed to modify the records and would be rejected, which would require a retry. For a POC or a low traffic application this is usually a perfectly acceptable first step as you can ensure correctness for right now.
Also in the book referenced above I think there was an example of how ATM's handle availability. They allow for this race condition and the user to overdraw if they are unable to connect to the centralized bank but bound the maximum withdraw to minimize the blast radius!
Another architectural way to address this is to take the transactions offline and make them asynchronous, so that each user invoked transaction is published to a queue, and then by having a single consumer of the queue you naturally avoid any race conditions. The tradeoff here is similar there is a fixed throughput available from a single worker, but it does help to address the correctness issue for right now :P
Locking across machines (like using redis across postgres/grpc) called distributed locking and has a good amount written about it https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

Related

select for update waiting for network operation?

I'm implementing a online shop.
I'm wondering if it is ok to use select for update locking for an order record.
Following are the order state changes that I consider locking.
payment processing: buyer pays for the order, order goes from waiting-for-payment to paid
order cancelling: buyer or seller cancels the order
order confirm: seller confirms the order so that buyer can't cancel the order anymore
For instance, when buyer cancels an order, and seller confirms the order.
Without locking, it is possible that buyer performs the cancelation, and seller confirms the order at the same time.
But with locking, either buyer cancels or seller confirms.
So far so good, my question is this: locking an order instance waiting for network operation (payment processing) would be too big performance overhead even though it's only a row in the table?
I'm using postgresql
You are using Database Locking mechanism to implement Business Logic. That is a bad idea.
Instead introduce a [Transaction State] flag field and analyse it to apply Business Logic. That would enable flexibility in business scenarios when you can have multiple different Transaction States and complex Business Rules applicable for each of the states.
Better still use Transaction Processing History table with full log of Transaction States over time.
Update:
The status should change only if it is consistent with the history. If the payment has failed there is no point in marking the order payed and then rolling it back.
For every change there should be a list of requirements verified before the action takes place. Check there are items in the basket and customer confirmed delivery address and total price is greater then the discounts before requesting payment.
There are infinite scenarios of order state change and having a separate code for each state that would also include the entire history is impractical. Order can become available for dispatch after payment confirmation or customer return or replacement on non-delivery or pre-ordered items arriving from the supplier or any number of other situations.
Better keep track of order state full history for each order to analyse business scenarios and choose the best next action accordingly.

What is the best practice database design for transactions aggregation?

I am designing a database which will hold transaction level data. It will work the same way as a bank account - debits/credits to an Account Number.
What is the best / most efficient way of obtaining the aggregation of these transactions.
I was thinking about using a summary table and then adding these to a list of today's transactions in order to derive how much each account has (i.e their balance).
I want this to be scalable (ie. 1 billion transactions) so don't want to have to perform database hits to the main fact table as it will need to find all the debits/credits associated with a desired account number scanning potentially a billion rows.
Thanks, any help or resources would be awesome.
(Have been working in Banks for almost 10years. Here is how it is actually done).
TLDR: your idea is good.
Every now and then you store the balance somewhere else ("carry forward balance"). E.g. every month or so (or aver a given number of transactions). To calculate the actual balance (or any balance in the past) you accumulate all relevant transactions going back in time until the most recent balance you kept ("carry forward balance"), which you need to add, of course.
The "current" balance is not kept anywhere. Just alone for the locking problems you would have if you'd update this balance all the time. (In real banks you'll hit some bank internal accounts with almost every single transactions. There are plenty of bank internal accounts to be able to get the figures required by law. These accounts are hit very often and thus would cause locking issues when you'd like to update them with every transaction. Instead every transactions is just insert — even the carry forward balances are just inserts).
Also in real banks you have many use cases which make this approach more favourable:
Being able to get back dated balances at any time - Being able to get balances based on different dates for any time (e.g. value date vs. transaction date).
Reversals/Cancellations are a fun of it's own. Imagine to reverse a transaction from two weeks ago and still keep all of the above going.
You see, this is a long story. However, the answer to your question is: yes, you cannot accumulate an ever increasing number of transactions, you need to keep intermediate balances to limit the number to accumulate if needed. Hitting the main table for a limited number of rows, should be no issue.
Make sure your main query uses an Index-Only Scan.
Do an Object Oriented Design, Create table for objects example Account, Transaction etc. Here's a good website for your reference. But there's a lot more on the web discussing OODBMS. The reference I gave is just my basis when I started doing an OODBMS.

transactions and balance

I work on contracting Company database " sql server " . I'm lost what’s the best solutions to calculate their customers balance accounts.
Balance table: create table for balance and another for transactions. So my application add any transactions to transactions table and calculate the balance according to balance table value.
Calculate balance using query: so I'll create transactions table only.
Note: the records may be up to 2 million records for each year, so I think they will need to backup it every year or some thing like that.
any new ideas or comments ?!
I would have a transactions table and a balances table as well, if I were you. Let's consider for example that you have 1 000 000 users. If a user has 20 transactions on average, then getting balance from a transaction table would be roughly 20x slower than getting balance from a balances table. Also, it is better to have something than not having that something.
So, I would choose to create a balances table without thinking twice.
Comments on your 2 ways:
Good solution if you have much more queries than updates (100 times or more). So, you add new transaction, recalculate balance and store it. You can do it in one transaction but it can take a lot of time and block user action. So, you can do it later (for example, update balances onces a minute/hour/day). Pros: fast reading. Cons: possible difference between balance value and sum of transactions or increasing user action time
Good solution if you have much more updates than reads (for example, trading system with a lot of transactions). Updating current balance can take time and may be worthless, because another transaction has already came :) so, you can calculate balance at runtime, on demand. Pros: always actual balance. Cons: calculating balance can take time.
As you see, it depends on your payload profile (reads or writes). I'll advice you to begin with second variant - it's easy to implement and good DB indexies can help you to get sum very fast (2 millions per year - not so much as it looks). But, it's up to you.
Definitely you must have a separate Balance table beside transaction table. Otherwise during read balance your performance will be slower day by day as transaction increasing and transactions will be costly as other users may lock the transaction table to read balance at the same time.
This question would seem to have a lot of opinion, and I was tempted to close it.
But, in any environment where I've been where customers have "balances", a critical part of the business is knowing the current balance for each customer. This means having a historical transaction table, a current balance amount, and an auditing process to ensure that the two are aligned.
The current balance would be maintained whenever the database is changed. The "standard" method is to use triggers. My preferred method is to encapsulate data changes in stored procedures, and have the logic for the summarization in the same procedures used to modify the transaction data.

Is editing simultaneous data in database data really matter?

I worked at a inventory tracking web application. I got some problem when designing the user interface, especially editing purchase list.
Here is the overview data structure of my Purchase table
+------+-------+------+------+-----------
| ID | Name | Price| QTY | Date |
+------+-------+------+------+-----------
|B03-13|Goods1 |10000 |10 |2014-3-10
|B03-14|Goods2 |10000 |20 |2014-3-10
|B03-13|Goods3 |20000 |5 |2014-4-10
the avg price table
+------+----------+----------
| ID | Avg Price| Date |
+------+----------+----------
|B03-13|1000 |2014-3-10
|B03-13|500 |2014-3-10
|B03-13|2000 |2014-4-10
to calculate the moving average price
(Price * total) / total stock
Imagine that we have 1000 row of data, when we edit a data say it the third data, of course the application will calculate again the entire data. Editing a row will caused final average price changes. So I have to calculate entire row after it(after row which user edited), checking the "avg price" table and updating related data to it, every changes user make changes.
In this kind of situation should we abort the "edit data" feature, or just allow user to edit last data they inserted?
Thank you for your answer, I hope it will useful for other people who have same problem
There are multiple aspects of your question, and it is unforunately very broad. I'll focus on some rather arbitrarily chosen details, others may address other aspects.
How to deal with related changes to multiple DB tables by just one user?
This is done by putting all the changes into a single transaction. That way, when you alter a puchase amount or add a new purchase, the average price of a given good will be changed at the same time. To an outside observer, they can either see the old data, or the new data, but not the inconsistent data where one table is changed, but another isn't yet. Of course the this has to take into account the implementation details of the database, and the type of ACID guarantees offered by its transaction system and storage backend.
How to deal with changes to the same DB table(s) by multiple users?
One way is to lock the relevant database rows, preventing others from changing them, as soon as a user interface session exposing given data for modification is started.
Another way is add a Generation column to all tables changed by the users. When user-interface data is fetched, the generation number is fetched as well. When a change transaction is submitted, the queries are all qualified with where Purchase.Generation = ... etc. That way any change originating with stale data will fail.
Yet another way, interoperating with the Generation column, is to submit relevant transactions not tied to a generation, but as change transactions that don't care about existing value. For example, if you have received a certain number of widgets at a certain price, in the Purchase table you're either adding to an existing row, or creating a new row. In the AvgPrice table, you're atomically modifying the value based on relevant values in the Purchase table. Thus it doesn't matter that another user might have purchased the same item in the meantime: the database will stay consistent no matter what, and your action of getting 100 more items will result in just that: a counter somewhere will go up (either in an existing or in a new row). This is just like if this was a deposit transaction on a bank account.
Of course, from a business perspective, we may still want such a transaction to fail, since perhaps the original user doesn't need to buy those 100 items anymore.
How to present the user with a failed transaction?
This depends on the context. Sometimes, as in the case of a bank deposit, you don't do anything, just retry the same transaction, since the transaction has a "delta" semantics rather than "replace with new value" semantics. A withdrawal is similar, but not quite: it is a "delta" transaction, but it can fail in two ways:
A concurrent transaction was executing and aborted this transaction. You always retry in such case.
A transaction fails because the business constraints are not met anymore. For example, in the meantime the balance fell below the minimum needed to satisfy the withdrawal. This is a hard failure and a retry won't fix it.
Other times, the user can be presented with the updated data alongside of the data used to generate a transaction, with highlights on what has changed. This depends on how the data is presented. In an inventory table, for example, you could visually indicate the fields that were changed "in the meantime". The user would then be prompted to re-enter the desired values or deltas for those fields, to explicitly confirm that the same changes are still desired. This would force the reconsideration of a widget purchase in light of someone else already having ordered the same widgets, and so on.

MySQL: Transactions vs Locking Tables

I'm a bit confused with transactions vs locking tables to ensure database integrity and make sure a SELECT and UPDATE remain in sync and no other connection interferes with it. I need to:
SELECT * FROM table WHERE (...) LIMIT 1
if (condition passes) {
// Update row I got from the select
UPDATE table SET column = "value" WHERE (...)
... other logic (including INSERT some data) ...
}
I need to ensure that no other queries will interfere and perform the same SELECT (reading the 'old value' before that connection finishes updating the row.
I know I can default to LOCK TABLES table to just make sure that only 1 connection is doing this at a time, and unlock it when I'm done, but that seems like overkill. Would wrapping that in a transaction do the same thing (ensuring no other connection attempts the same process while another is still processing)? Or would a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE be better?
Locking tables prevents other DB users from affecting the rows/tables you've locked. But locks, in and of themselves, will NOT ensure that your logic comes out in a consistent state.
Think of a banking system. When you pay a bill online, there's at least two accounts affected by the transaction: Your account, from which the money is taken. And the receiver's account, into which the money is transferred. And the bank's account, into which they'll happily deposit all the service fees charged on the transaction. Given (as everyone knows these days) that banks are extraordinarily stupid, let's say their system works like this:
$balance = "GET BALANCE FROM your ACCOUNT";
if ($balance < $amount_being_paid) {
charge_huge_overdraft_fees();
}
$balance = $balance - $amount_being paid;
UPDATE your ACCOUNT SET BALANCE = $balance;
$balance = "GET BALANCE FROM receiver ACCOUNT"
charge_insane_transaction_fee();
$balance = $balance + $amount_being_paid
UPDATE receiver ACCOUNT SET BALANCE = $balance
Now, with no locks and no transactions, this system is vulnerable to various race conditions, the biggest of which is multiple payments being performed on your account, or the receiver's account in parallel. While your code has your balance retrieved and is doing the huge_overdraft_fees() and whatnot, it's entirely possible that some other payment will be running the same type of code in parallel. They'll be retrieve your balance (say, $100), do their transactions (take out the $20 you're paying, and the $30 they're screwing you over with), and now both code paths have two different balances: $80 and $70. Depending on which ones finishes last, you'll end up with either of those two balances in your account, instead of the $50 you should have ended up with ($100 - $20 - $30). In this case, "bank error in your favor".
Now, let's say you use locks. Your bill payment ($20) hits the pipe first, so it wins and locks your account record. Now you've got exclusive use, and can deduct the $20 from the balance, and write the new balance back in peace... and your account ends up with $80 as is expected. But... uhoh... You try to go update the receiver's account, and it's locked, and locked longer than the code allows, timing out your transaction... We're dealing with stupid banks, so instead of having proper error handling, the code just pulls an exit(), and your $20 vanishes into a puff of electrons. Now you're out $20, and you still owe $20 to the receiver, and your telephone gets repossessed.
So... enter transactions. You start a transaction, you debit your account $20, you try to credit the receiver with $20... and something blows up again. But this time, instead of exit(), the code can just do rollback, and poof, your $20 is magically added back to your account.
In the end, it boils down to this:
Locks keep anyone else from interfering with any database records you're dealing with. Transactions keep any "later" errors from interfering with "earlier" things you've done. Neither alone can guarantee that things work out ok in the end. But together, they do.
in tomorrow's lesson: The Joy of Deadlocks.
I've started to research the same topic for the same reasons as you indicated in your question. I was confused by the answers given in SO due to them being partial answers and not providing the big picture. After I read couple documentation pages from different RDMS providers these are my takes:
TRANSACTIONS
Statements are database commands mainly to read and modify the data in the database. Transactions are scope of single or multiple statement executions. They provide two things:
A mechanism which guaranties that all statements in a transaction are executed correctly or in case of a single error any data modified by those statements will be reverted to its last correct state (i.e. rollback). What this mechanism provides is called atomicity.
A mechanism which guaranties that concurrent read statements can view the data without the occurrence of some or all phenomena described below.
Dirty read: A transaction reads data written by a concurrent
uncommitted transaction.
Nonrepeatable read: A transaction re-reads data it has previously read
and finds that data has been modified by another transaction (that
committed since the initial read).
Phantom read: A transaction re-executes a query returning a set of
rows that satisfy a search condition and finds that the set of rows
satisfying the condition has changed due to another recently-committed
transaction.
Serialization anomaly: The result of successfully committing a group
of transactions is inconsistent with all possible orderings of running
those transactions one at a time.
What this mechanism provides is called isolation and the mechanism which lets the statements to chose which phenomena should not occur in a transaction is called isolation levels.
As an example this is the isolation-level / phenomena table for PostgreSQL:
If any of the described promises is broken by the database system, changes are rolled back and the caller notified about it.
How these mechanisms are implemented to provide these guaranties is described below.
LOCK TYPES
Exclusive Locks: When an exclusive lock acquired over a resource no other exclusive lock can be acquired over that resource. Exclusive locks are always acquired before a modify statement (INSERT, UPDATE or DELETE) and they are released after the transaction is finished. To explicitly acquire exclusive locks before a modify statement you can use hints like FOR UPDATE(PostgreSQL, MySQL) or UPDLOCK (T-SQL).
Shared Locks: Multiple shared locks can be acquired over a resource. However, shared locks and exclusive locks can not be acquired at the same time over a resource. Shared locks might or might not be acquired before a read statement (SELECT, JOIN) based on database implementation of isolation levels.
LOCK RESOURCE RANGES
Row: single row the statements executes on.
Range: a specific range based on the condition given in the statement (SELECT ... WHERE).
Table: whole table. (Mostly used to prevent deadlocks on big statements like batch update.)
As an example the default shared lock behavior of different isolation levels for SQL-Server :
DEADLOCKS
One of the downsides of locking mechanism is deadlocks. A deadlock occurs when a statement enters a waiting state because a requested resource is held by another waiting statement, which in turn is waiting for another resource held by another waiting statement. In such case database system detects the deadlock and terminates one of the transactions. Careless use of locks can increase the chance of deadlocks however they can occur even without human error.
SNAPSHOTS (DATA VERSIONING)
This is a isolation mechanism which provides to a statement a copy of the data taken at a specific time.
Statement beginning: provides data copy to the statement taken at the beginning of the statement execution. It also helps for the rollback mechanism by keeping this data until transaction is finished.
Transaction beginning: provides data copy to the statement taken at the beginning of the transaction.
All of those mechanisms together provide consistency.
When it comes to Optimistic and Pessimistic locks, they are just namings for the classification of approaches to concurrency problem.
Pessimistic concurrency control:
A system of locks prevents users from modifying data in a way that
affects other users. After a user performs an action that causes a
lock to be applied, other users cannot perform actions that would
conflict with the lock until the owner releases it. This is called
pessimistic control because it is mainly used in environments where
there is high contention for data, where the cost of protecting data
with locks is less than the cost of rolling back transactions if
concurrency conflicts occur.
Optimistic concurrency control:
In optimistic concurrency control, users do not lock data when they
read it. When a user updates data, the system checks to see if another
user changed the data after it was read. If another user updated the
data, an error is raised. Typically, the user receiving the error
rolls back the transaction and starts over. This is called optimistic
because it is mainly used in environments where there is low
contention for data, and where the cost of occasionally rolling back a
transaction is lower than the cost of locking data when read.
For example by default PostgreSQL uses snapshots to make sure the read data didn't change and rolls back if it changed which is an optimistic approach. However, SQL-Server use read locks by default to provide these promises.
The implementation details might change according to database system you chose. However, according to database standards they need to provide those stated transaction guarantees in one way or another using these mechanisms. If you want to know more about the topic or about a specific implementation details below are some useful links for you.
SQL-Server - Transaction Locking and Row Versioning Guide
PostgreSQL - Transaction Isolation
PostgreSQL - Explicit Locking
MySQL - Consistent Nonlocking Reads
MySQL - Locking
Understanding Isolation Levels (Video)
You want a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE inside a transaction, as you said, since normally SELECTs, no matter whether they are in a transaction or not, will not lock a table. Which one you choose would depend on whether you want other transactions to be able to read that row while your transaction is in progress.
http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
START TRANSACTION WITH CONSISTENT SNAPSHOT will not do the trick for you, as other transactions can still come along and modify that row. This is mentioned right at the top of the link below.
If other sessions simultaneously
update the same table [...] you may
see the table in a state that never
existed in the database.
http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
Transaction concepts and locks are different. However, transaction used locks to help it to follow the ACID principles.
If you want to the table to prevent others to read/write at the same time point while you are read/write, you need a lock to do this.
If you want to make sure the data integrity and consistence, you had better use transactions.
I think mixed concepts of isolation levels in transactions with locks.
Please search isolation levels of transactions, SERIALIZE should be the level you want.
I had a similar problem when attempting a IF NOT EXISTS ... and then performing an INSERT which caused a race condition when multiple threads were updating the same table.
I found the solution to the problem here: How to write INSERT IF NOT EXISTS queries in standard SQL
I realise this does not directly answer your question but the same principle of performing an check and insert as a single statement is very useful; you should be able to modify it to perform your update.
I'd use a
START TRANSACTION WITH CONSISTENT SNAPSHOT;
to begin with, and a
COMMIT;
to end with.
Anything you do in between is isolated from the others users of your database if your storage engine supports transactions (which is InnoDB).
You are confused with lock & transaction. They are two different things in RMDB. Lock prevents concurrent operations while transaction focuses on data isolation. Check out this great article for the clarification and some graceful solution.