In typeorm with replication, am I hitting the master instance whenever I am inside an EntityManager transaction? - sql

I am having some race condition issues with an Aurora PostgreSQL DB hosted on AWS RDS. An example similar to what is happening is:
A Group table has a column userEntranceLimits.
A UserEntrance table has a column userId and a column groupId.
The count of a UserEntrance's distinct users for a specific group may not exceed the group's userEntranceLimits.
Inside a transaction for creating a new UserEntrance, I check if the current count of UserEntrances for that group is >= the limit.
If it is not, I proceed to create a new UserEntrance.
In some rare occasions, a group appears with more UserEntrances than its limit.
I initially thought that the race condition was because of out-of-sync read replicas. However, if I am understanding this part of typeorm's code correctly, the transactions will always execute in master. Is that right? If that's the case, then I suppose the actual solution to the race condition is changing the isolation level of the transaction to something else, e.g. SERIALIZABLE. Is any of this that I'm thinking true?

If you use it in way described in typeorm docs calling directly entityManager.transaction() or #Transaction() decorator and if you use connection, provided by transaction, then YES, transactions are using master.
As I debugged typeorm sources, while starting transaction, entityManager.queryRunner is undefined. So as per EntityManager implementation, queryRunner will be created for each transaction with no replication mode provided, so default (master) will be used.
So problem is more probable in your logic than in transaction configuration.

Related

How to consistently track all new rows in a SQL database table

What I am trying to do
I am developing a web service, which runs in multiple server instances, all accessing the same RDBMS (PostgreSQL). While the database is needed for persistence, it contains very little data, which is why every server instance has a cache of all the data. Further the application is really simple in that it only ever inserts new rows in rather simple tables and selects that data in a scheduled fashion from all server instances (no updates or changes... only inserts and reads).
The way it is currently implemented
basically I have a table which roughly looks like this:
id BIGSERIAL,
creation_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- further data columns...
The server is doing something like this every couple of seconds (pseudocode):
get all rows with creation_timestamp > lastMaxTimestamp
lastMaxTimestamp = max timestamp for all data just retrieved
insert new rows into application cache
The issue I am running into
The application skips certain rows when updating the caches. I analyzed the issue and figured out, that the problem is caused in the following way:
one server instance is creating a new row in the context of a transaction. An id for the new row is retrieved from the associated sequence (id=n) and the creation_timestamp (with value ts_1) is set.
another server does the same in the context of a different transaction. The new row in this transaction gets id=n+1 and a creation_timestamp ts_2 (where ts_1 < ts_2).
transaction 2 finishes before transaction 1
one of the servers executes a "select all rows with creation_timestamp > lastMaxTimestamp". It gets row n+1, but not n1. It sets lastMaxTimestamp to ts_2.
transaction 1 completes
some time later the server from step 4 executes "select all rows with creation_timestamp > lastMaxTimestamp" again. But since lastMaxTimestamp=ts_2 and ts_2>ts_1 the row n will never be read on that server.
Note: CURRENT_TIMESTAMP has the same value during a transaction, which is the transaction start time.
So the application gets inconsistent data into its cache and can't get new rows based on the insertion timestamp OR based on the sequence id. Transaction isolation levels don't really change anything about the situation, since the problem is created in essence by transaction 2 finishing before transaction 1.
My question
Am I missing something? I am thinking there must be a straightforward way to get all new rows of a RDBMS, but I can't come up with a simple solution... at least with a simple solution that is consistent. Extensive locking (e.g. of tables) wouldn't be acceptable because of performance reasons. Simply trying to ensure to get all ids from that sequence seems like a) a complicated solution and b) can't be done easily, since rollbacks during transactions can happen (which would lead to sequence ids not being used).
Anyone has the solution?
After a lot of searching, I found the right keywords to google for... "transaction commit timestamp" to leads to all sorts of transaction timestamp tracking and system columns like xmin:
https://dba.stackexchange.com/questions/232273/is-there-way-to-get-transaction-commit-timestamp-in-postgres
This post has some more detailed information:
Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)
In short:
you can turn on a postgresql option to track timestamps of commits and compare those instead of the current_timestamps/clock_timestamps inside the transaction
it seems though, that it is only tracked when a transaction is completed - not when it is commited, which makes the solution not bullet proof. There are also further issue to consider like transaction id (xmin) rollover for example
logical decoding / replication is something to look into for a proper solution
Thanks to everyone trying to help me find an answer. I hope this summary is useful to someone in the future.

Global revision without locking

Given this set of rules, would it be possible to implement this in SQL?
Two transactions that don't modify the same rows should be able to run concurrently. No locks should occur (or at least their use should be minimized as much as possible).
Transactions can only read committed data.
A revision is defined as an integer value in the system.
A new transaction must be able to increment and query a new revision. This revision will be applied to every rows that the transaction modifies.
No 2 transactions can share the same revision.
A transaction X that is committed before transaction Y must have a revision lower than the one assigned to transaction Y.
I want to use integer as the revision in order to optimize how I query all changes since a specific revision. Something like this:
SELECT * FROM [DummyTable] WHERE [DummyTable].[Revision] > clientRevision
My current solution uses an SQL table [GlobalRevision] with a single row [LastRevision] to keep the latest revision. All my transactions' isolation level are set to Snapshot.
The problem with this solution is that the [GlobalRevision] table with the single row [LastRevision] becomes a point of contention. This is because I must increment the revision at the start of a transaction so that I can apply the new revision to the modified rows. This will keep a lock on the [LastRevision] row throughout the duration of the transaction, killing the concurrency. Even though two concurrent transactions modify totally different rows, they cannot be executed concurrently (Rule #1: Failed).
Is there any pattern in SQL to solve this kind of issue? One solution is to use Guids and keep an history of revisions (like git revisions) but this is less easier than just having an integer that we can compare to see if a revision is newer than another one.
UPDATE:
The business case for this is to create a Baas system (Backend as a service) with data synchronization between client and server. Here are some use cases for this kind of system:
Client while online modifies an asset, pushes the update to the server, server updates DB [this is where my question relates to], server sends update notifications to interested clients that synchronize their local data with the new changes.
Client connects to server, client requests a pull to the server, server finds all changes that were applied after client's revision and return them to the client, client applies the changes and sets its new revision.
...
As you can see, the global revision lets me put a revision on every changes committed on the server and from this revision, I can determine what updates need to be sent to the clients depending on their specific revision.
This needs to scale to multiple thousands of users that can push updates in parallel and those changes must be synchronized to other connected users. So the longer it takes to execute a transaction, the longer it takes for other users to receive the change notifications.
I want to avoid as much as possible contention for this reason. I am not an expert in SQL so I just want to make sure there is not something I am missing that would let me do that easily.
Probably the easiest thing for you to try would be to use a SEQUENCE for your revision number, assuming you're at SQL 2012 or newer. This is a lighter-weight way of generating an auto-incrementing value that you can use as a revision ID per your rules. Acquiring them at scale should be far less subject to the contention issues you describe than using a full-fledged table.
You do need to know that you could end up with revision number gaps if a given transaction rolled back, because SEQUENCE values operate outside of transactional scope. From the article:
Sequence numbers are generated outside the scope of the current
transaction. They are consumed whether the transaction using the
sequence number is committed or rolled back.
If you can relax the requirement for an integer revision number and settle for knowing what the data was at a given point in time, you might be able to use Change Data Capture, or, in SQL 2016, Temporal Tables. Both of these technologies allow you to "turn back time" and see what the data looked like at a known timestamp.

Rails ActiveRecord - how can I lock a table for reading?

I have some Rails ActiveRecord code that looks like this:
new_account_number = Model.maximum(:account_number)
# Some processing that usually involves incrementing
# the new account number by one.
Model.create(foo: 12, bar: 34, account_number: new_account_number)
This code works fine on its own, but I have some background jobs that are processed by DelayedJob workers. There are two workers and if they both start processing a batch of jobs that deal with this code, they end up creating new Model records that has the same account_number, because of the delay between finding the maximum and creating a new record with an even higher account number.
For now, I have solved it by adding a uniqueness constraint at database level to the models table and then retry by re-selecting the maximum in case this constraint triggers an exception.
However it feels like a hack.
Adding auto incrementing at database level to the account_number column is not an option, because the account_number assigning entails more than just incrementing.
Ideally I would like to lock the table in question for reading, so no other can execute the maximum select query against the table until I am done. However, I'm not sure how to go about that. I'm using Postgresql.
Based on the ActiveRecord::Locking docs it looks like Rails doesn't provide a built-in API for table-level locks.
But you can still do this with raw SQL. For Postgres, this looks like
ActiveRecord::Base.transaction do
ActiveRecord::Base.connection.execute('LOCK table_name IN ACCESS EXCLUSIVE MODE')
...
end
The lock must be acquired within a transaction, and is automatically freed once the transaction ends.
Note that the SQL you use here will be different depending on your database.
Obviously locking the entire table is not elegant or efficient, but for small apps, for some time, it may indeed be the best solution. It's simple and easy to reason about. In general, an advisory lock is a better fit for this kind of data race.
There are already answers on how to lock the entire table, but I believe you should try to avoid that. Instead I believe you should give advisory locks a look. It makes sure the same block of code isn't executed on two machines simultaneously, while still keeping the table open for other business.
It still uses the database, but it doesn't lock your tables.
You can use the gem called "with_advisory_lock" like this:
Model.with_advisory_lock("ADVISORY_LOCK_NAME") do
# Your code
end
https://github.com/ClosureTree/with_advisory_lock
It doesn't work with SQLite.
Setting unique constraint IS NOT a hack. It is thing that makes your data consistent.
By the way you have a few more options here:
Lock some DB resource (e.g. it could be a unique record) using
SELECT FOR UPDATE or PostreSQL's Advisory Locks (see docs).
Use a sequence (docs).
The main difference between two approaches is #1 does not allow to have gaps in your numbers because other session will wait for transaction commit and #2 allows.
you don't have to lock the hall table to lock a piece of code for a single process at a time. locking a full table causes performence problems.you can lock a single same row all the time with "with_lock" method.this way code is fully protected. no extra gem is needed. it also creates a transaction. like this:
m = Model.order(:id).first
m.with_lock do #aquire lock
#some code here for a single process at a time
end #release lock
Well, technically it's the same to lock a table or to always lock a record of another table before accessing the table.
So you may have another table with max one record, alway lock that record with http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html before read/write from the table you want to lock:
LockTable.last.with_lock do
// the things that needed for your table
end

Lock issues on large recordset

I have a database table that I use as a queue system, where separate process that talk to each other create and read entries in the table. For example, when a user initiates a search an entry is created, then another process that runs every second or two will pick up that new entry, update the status and then do a search, updating the entry again when the search is complete. This all seems to work well with thousands of searches per hour.
However, I have a master admin screen that lets me view the status of all of these 'jobs' but it runs very slowly. I basically return all entries in the table for the last hour so I can keep an eye on what's going on. I think that I am running into lock issues of some sort. I only need to read each entry, and don't really care if it the data is a little bit out of date. I just use a standard 'Select * from Table' statement so maybe it is waiting for other locks to expire before returning data as the jobs are constantly updating the data.
Would this be handled better by a certain kind of cursor to return each row one at a time, etc? Any other ideas?
Thanks
If you really don't care if the data is a bit out of date... or if you only need the data to be 99.99% accurate, consider using WITH (NOLOCK):
SELECT * FROM Table WITH (NOLOCK);
This will instruct your query to use the READ UNCOMMITTED ISOLATION LEVEL, which has the following behavior:
Specifies that dirty reads are allowed. No shared locks are issued to
prevent other transactions from modifying data read by the current
transaction, and exclusive locks set by other transactions do not
block the current transaction from reading the locked data.
Be aware that NOLOCK may cause some inaccuracies in your data, so it probably isn't a good idea to use it throughout the rest of your system.
You need FROM yourtable WITH (NOLOCK) table hint.
You may also want to look at transaction isolation in your update process, if you aren't already
An alternative to NOLOCK (which can lead to very bad things, such as missed rows or duplicated rows) is to allow read committed snapshot isolation at the database level and then issue your query with:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;

MySQL: Transactions vs Locking Tables

I'm a bit confused with transactions vs locking tables to ensure database integrity and make sure a SELECT and UPDATE remain in sync and no other connection interferes with it. I need to:
SELECT * FROM table WHERE (...) LIMIT 1
if (condition passes) {
// Update row I got from the select
UPDATE table SET column = "value" WHERE (...)
... other logic (including INSERT some data) ...
}
I need to ensure that no other queries will interfere and perform the same SELECT (reading the 'old value' before that connection finishes updating the row.
I know I can default to LOCK TABLES table to just make sure that only 1 connection is doing this at a time, and unlock it when I'm done, but that seems like overkill. Would wrapping that in a transaction do the same thing (ensuring no other connection attempts the same process while another is still processing)? Or would a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE be better?
Locking tables prevents other DB users from affecting the rows/tables you've locked. But locks, in and of themselves, will NOT ensure that your logic comes out in a consistent state.
Think of a banking system. When you pay a bill online, there's at least two accounts affected by the transaction: Your account, from which the money is taken. And the receiver's account, into which the money is transferred. And the bank's account, into which they'll happily deposit all the service fees charged on the transaction. Given (as everyone knows these days) that banks are extraordinarily stupid, let's say their system works like this:
$balance = "GET BALANCE FROM your ACCOUNT";
if ($balance < $amount_being_paid) {
charge_huge_overdraft_fees();
}
$balance = $balance - $amount_being paid;
UPDATE your ACCOUNT SET BALANCE = $balance;
$balance = "GET BALANCE FROM receiver ACCOUNT"
charge_insane_transaction_fee();
$balance = $balance + $amount_being_paid
UPDATE receiver ACCOUNT SET BALANCE = $balance
Now, with no locks and no transactions, this system is vulnerable to various race conditions, the biggest of which is multiple payments being performed on your account, or the receiver's account in parallel. While your code has your balance retrieved and is doing the huge_overdraft_fees() and whatnot, it's entirely possible that some other payment will be running the same type of code in parallel. They'll be retrieve your balance (say, $100), do their transactions (take out the $20 you're paying, and the $30 they're screwing you over with), and now both code paths have two different balances: $80 and $70. Depending on which ones finishes last, you'll end up with either of those two balances in your account, instead of the $50 you should have ended up with ($100 - $20 - $30). In this case, "bank error in your favor".
Now, let's say you use locks. Your bill payment ($20) hits the pipe first, so it wins and locks your account record. Now you've got exclusive use, and can deduct the $20 from the balance, and write the new balance back in peace... and your account ends up with $80 as is expected. But... uhoh... You try to go update the receiver's account, and it's locked, and locked longer than the code allows, timing out your transaction... We're dealing with stupid banks, so instead of having proper error handling, the code just pulls an exit(), and your $20 vanishes into a puff of electrons. Now you're out $20, and you still owe $20 to the receiver, and your telephone gets repossessed.
So... enter transactions. You start a transaction, you debit your account $20, you try to credit the receiver with $20... and something blows up again. But this time, instead of exit(), the code can just do rollback, and poof, your $20 is magically added back to your account.
In the end, it boils down to this:
Locks keep anyone else from interfering with any database records you're dealing with. Transactions keep any "later" errors from interfering with "earlier" things you've done. Neither alone can guarantee that things work out ok in the end. But together, they do.
in tomorrow's lesson: The Joy of Deadlocks.
I've started to research the same topic for the same reasons as you indicated in your question. I was confused by the answers given in SO due to them being partial answers and not providing the big picture. After I read couple documentation pages from different RDMS providers these are my takes:
TRANSACTIONS
Statements are database commands mainly to read and modify the data in the database. Transactions are scope of single or multiple statement executions. They provide two things:
A mechanism which guaranties that all statements in a transaction are executed correctly or in case of a single error any data modified by those statements will be reverted to its last correct state (i.e. rollback). What this mechanism provides is called atomicity.
A mechanism which guaranties that concurrent read statements can view the data without the occurrence of some or all phenomena described below.
Dirty read: A transaction reads data written by a concurrent
uncommitted transaction.
Nonrepeatable read: A transaction re-reads data it has previously read
and finds that data has been modified by another transaction (that
committed since the initial read).
Phantom read: A transaction re-executes a query returning a set of
rows that satisfy a search condition and finds that the set of rows
satisfying the condition has changed due to another recently-committed
transaction.
Serialization anomaly: The result of successfully committing a group
of transactions is inconsistent with all possible orderings of running
those transactions one at a time.
What this mechanism provides is called isolation and the mechanism which lets the statements to chose which phenomena should not occur in a transaction is called isolation levels.
As an example this is the isolation-level / phenomena table for PostgreSQL:
If any of the described promises is broken by the database system, changes are rolled back and the caller notified about it.
How these mechanisms are implemented to provide these guaranties is described below.
LOCK TYPES
Exclusive Locks: When an exclusive lock acquired over a resource no other exclusive lock can be acquired over that resource. Exclusive locks are always acquired before a modify statement (INSERT, UPDATE or DELETE) and they are released after the transaction is finished. To explicitly acquire exclusive locks before a modify statement you can use hints like FOR UPDATE(PostgreSQL, MySQL) or UPDLOCK (T-SQL).
Shared Locks: Multiple shared locks can be acquired over a resource. However, shared locks and exclusive locks can not be acquired at the same time over a resource. Shared locks might or might not be acquired before a read statement (SELECT, JOIN) based on database implementation of isolation levels.
LOCK RESOURCE RANGES
Row: single row the statements executes on.
Range: a specific range based on the condition given in the statement (SELECT ... WHERE).
Table: whole table. (Mostly used to prevent deadlocks on big statements like batch update.)
As an example the default shared lock behavior of different isolation levels for SQL-Server :
DEADLOCKS
One of the downsides of locking mechanism is deadlocks. A deadlock occurs when a statement enters a waiting state because a requested resource is held by another waiting statement, which in turn is waiting for another resource held by another waiting statement. In such case database system detects the deadlock and terminates one of the transactions. Careless use of locks can increase the chance of deadlocks however they can occur even without human error.
SNAPSHOTS (DATA VERSIONING)
This is a isolation mechanism which provides to a statement a copy of the data taken at a specific time.
Statement beginning: provides data copy to the statement taken at the beginning of the statement execution. It also helps for the rollback mechanism by keeping this data until transaction is finished.
Transaction beginning: provides data copy to the statement taken at the beginning of the transaction.
All of those mechanisms together provide consistency.
When it comes to Optimistic and Pessimistic locks, they are just namings for the classification of approaches to concurrency problem.
Pessimistic concurrency control:
A system of locks prevents users from modifying data in a way that
affects other users. After a user performs an action that causes a
lock to be applied, other users cannot perform actions that would
conflict with the lock until the owner releases it. This is called
pessimistic control because it is mainly used in environments where
there is high contention for data, where the cost of protecting data
with locks is less than the cost of rolling back transactions if
concurrency conflicts occur.
Optimistic concurrency control:
In optimistic concurrency control, users do not lock data when they
read it. When a user updates data, the system checks to see if another
user changed the data after it was read. If another user updated the
data, an error is raised. Typically, the user receiving the error
rolls back the transaction and starts over. This is called optimistic
because it is mainly used in environments where there is low
contention for data, and where the cost of occasionally rolling back a
transaction is lower than the cost of locking data when read.
For example by default PostgreSQL uses snapshots to make sure the read data didn't change and rolls back if it changed which is an optimistic approach. However, SQL-Server use read locks by default to provide these promises.
The implementation details might change according to database system you chose. However, according to database standards they need to provide those stated transaction guarantees in one way or another using these mechanisms. If you want to know more about the topic or about a specific implementation details below are some useful links for you.
SQL-Server - Transaction Locking and Row Versioning Guide
PostgreSQL - Transaction Isolation
PostgreSQL - Explicit Locking
MySQL - Consistent Nonlocking Reads
MySQL - Locking
Understanding Isolation Levels (Video)
You want a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE inside a transaction, as you said, since normally SELECTs, no matter whether they are in a transaction or not, will not lock a table. Which one you choose would depend on whether you want other transactions to be able to read that row while your transaction is in progress.
http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
START TRANSACTION WITH CONSISTENT SNAPSHOT will not do the trick for you, as other transactions can still come along and modify that row. This is mentioned right at the top of the link below.
If other sessions simultaneously
update the same table [...] you may
see the table in a state that never
existed in the database.
http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
Transaction concepts and locks are different. However, transaction used locks to help it to follow the ACID principles.
If you want to the table to prevent others to read/write at the same time point while you are read/write, you need a lock to do this.
If you want to make sure the data integrity and consistence, you had better use transactions.
I think mixed concepts of isolation levels in transactions with locks.
Please search isolation levels of transactions, SERIALIZE should be the level you want.
I had a similar problem when attempting a IF NOT EXISTS ... and then performing an INSERT which caused a race condition when multiple threads were updating the same table.
I found the solution to the problem here: How to write INSERT IF NOT EXISTS queries in standard SQL
I realise this does not directly answer your question but the same principle of performing an check and insert as a single statement is very useful; you should be able to modify it to perform your update.
I'd use a
START TRANSACTION WITH CONSISTENT SNAPSHOT;
to begin with, and a
COMMIT;
to end with.
Anything you do in between is isolated from the others users of your database if your storage engine supports transactions (which is InnoDB).
You are confused with lock & transaction. They are two different things in RMDB. Lock prevents concurrent operations while transaction focuses on data isolation. Check out this great article for the clarification and some graceful solution.