Do database transactions prevent race conditions? - sql

It's not entirely clear to me what transactions in database systems do. I know they can be used to rollback a list of updates completely (e.g. deduct money on one account and add it to another), but is that all they do? Specifically, can they be used to prevent race conditions? For example:
// Java/JPA example
em.getTransaction().begin();
User u = em.find(User.class, 123);
u.credits += 10;
em.persist(u); // Note added in 2016: this line is actually not needed
em.getTransaction().commit();
(I know this could probably be written as a single update query, but that's not alway the case)
Is this code protected against race conditions?
I'm mostly interested in MySQL5 + InnoDB, but general answers are welcome too.

TL/DR: Transactions do not inherently prevent all race conditions. You still need locking, abort-and-retry handling, or other protective measures in all real-world database implementations. Transactions are not a secret sauce you can add to your queries to make them safe from all concurrency effects.
Isolation
What you're getting at with your question is the I in ACID - isolation. The academically pure idea is that transactions should provide perfect isolation, so that the result is the same as if every transaction executed serially. In reality that's rarely the case in real RDBMS implementations; capabilities vary by implementation, and the rules can be weakened by use of a weaker isolation level like READ COMMITTED. In practice you cannot assume that transactions prevent all race conditions, even at SERIALIZABLE isolation.
Some RDBMSs have stronger abilities than others. For example, PostgreSQL 9.2 and newer have quite good SERIALIZABLE isolation that detects most (but not all) possible interactions between transactions and aborts all but one of the conflicting transactions. So it can run transactions in parallel quite safely.
Few, if any3, systems have truly perfect SERIALIZABLE isolation that prevents all possible races and anomalies, including issues like lock escalation and lock ordering deadlocks.
Even with strong isolation some systems (like PostgreSQL) will abort conflicting transactions, rather than making them wait and running them serially. Your app must remember what it was doing and re-try the transaction. So while the transaction has prevented concurrency-related anomalies from being stored to the DB, it's done so in a manner that is not transparent to the application.
Atomicity
Arguably the primary purpose of a database transaction is that it provides for atomic commit. The changes do not take effect until you commit the transaction. When you commit, the changes all take effect at the same instant as far as other transactions are concerned. No transaction can ever see just some of the changes a transaction makes1,2. Similarly, if you ROLLBACK, then none of the transaction's changes ever get seen by any other transaction; it's as if your transaction never existed.
That's the A in ACID.
Durability
Another is durability - the D in ACID. It specifies that when you commit a transaction it must truly be saved to storage that will survive a fault like power loss or a sudden reboot.
Consistency:
See wikipedia
Optimistic concurrency control
Rather than using locking and/or high isolation levels, it's common for ORMs like Hibernate, EclipseLink, etc to use optimistic concurrency control (often called "optimistic locking") to overcome the limitations of weaker isolation levels while preserving performance.
A key feature of this approach is that it lets you span work across multiple transactions, which is a big plus with systems that have high user counts and may have long delays between interactions with any given user.
References
In addition to the in-text links, see the PostgreSQL documentation chapter on locking, isolation and concurrency. Even if you're using a different RDBMS you'll learn a lot from the concepts it explains.
1I'm ignoring the rarely implemented READ UNCOMMITTED isolation level here for simplicity; it permits dirty reads.
2As #meriton points out, the corollary isn't necessarily true. Phantom reads occur in anything below SERIALIZABLE. One part of an in-progress transaction doesn't see some changes (by a not-yet-committed transaction), then the next part of the in-progress transaction does see the changes when the other transaction commits.
3 Well, IIRC SQLite2 does by virtue of locking the whole database when a write is attempted, but that's not what I'd call an ideal solution to concurrency issues.

The database tier supports atomicity of transactions to varying degrees, called isolation levels. Check the documentation of your database management system for the isolation levels supported, and their trade-offs. The strongest isolation level, Serializable, requires transactions to execute as if they were executed one by one. This is typically implemented by using exclusive locks in the database. This can be cause deadlocks, which the database management system detects and fixes by rolling back some involved transactions. This approach is often referred to as pessimistic locking.
Many object-relational mappers (including JPA providers) also support optimistic locking, where update conflicts are not prevented in the database, but detected in the application tier, which then rolls back the transaction. If you have optimistic locking enabled, a typical execution of your example code would emit the following sql queries:
select id, version, credits from user where id = 123;
Let's say this returns (123, 13, 100).
update user set version = 14, credit = 110 where id = 123 and version = 13;
The database tells us how many rows where updated. If it was one, there was no conflicting update. If it was zero, a conflicting update occurred, and the JPA provider will do
rollback;
and throw an exception so application code can handle the failed transaction, for instance by retrying.
Summary: With either approach, your statement can be made safe from race conditions.

It depends on isolation level (in serializable it will prevent race condition, since generally in serializable isolation level transactions are processed in sequence, not in paralell (or at least exclusive locking is used, so transactions, that modify the same rows, are performed in sequence).
In order to prevent the race condition, better manually lock the record (mysql for example supports 'select ... for update' statement, which aquires write-lock on the selected records)

It depends on the specific rdbms. Generally, transactions acquire locks as decided during the query evaluation plan. Some can request table level locks, other column level, other record level, the second is preferred for performance. The short answer to your question is yes.
In other words, a transaction is meant to group a set of queries and represent them as an atomic operation. If the operation fails the changes are rolledback. I don't exactly know what the adapter you're using does, but if it conforms to the definition of transactions you should be fine.
While this guarantees prevention of race conditions, it doesn't explicitly prevent starvation or deadlocks. The transaction lock manager is in charge of that. Table locks are sometime used, but they come with a hefty price of reducing the number of concurrent operations.

Related

Choosing transaction isolation levels

I don't have a specific example here, I was just trying to understand the different levels of transaction isolation and how one might go about deciding which is best for a given situation.
I'm trying to think of situations in which I would want a transaction that is not serializable, other than to possibly increase performance in situations where I'm willing to give up a little data integrity.
Can anybody provide an example of a situation in which "read uncommitted", "read committed", and/or "repeatable read" would be the preferable isolation level?
Using the serializable isolation level does not only have advantages, but also disadvantages:
You have to accept increased performance overhead.
You have to handle serialization errors by redoing transactions, which complicates your application code and hurts performance if it happens often.
I'll come up with use cases for the other transaction levels. This list is of course not complete:
READ UNCOMMITTED: If you request this isolation level, you will actually get READ COMMITTED. So this isolation level is irrelevant. On database systems that use read locks, you use that isolation level to avoid them.
READ COMMITTED: This is the best isolation level if you are ready to deal with concurrent transactions yourself by locking rows that you want to be stable. The big advantage is that you never have to deal with serialization errors (except when you get a deadlock).
REPEATABLE READ: This isolation level is perfect for long running read-only transactions that want to see a consistent state of the database. The prime example is pg_dump.

Storing transactions into a NoSQL Database

We are planning the next project and thinking about to store transactions into a NoSQL Database. Basically it is an application where the user can collect some points (like payback) and later pay with points.
The main idea was to store the transactions itself into the noSQL Database and mysql server only stores the current balance.
So my question is this a good approach to handle that or should I just use a mysql database?
The problem why I was thinking about using noSQL is, that we assume there are a lot of queries in a short time.
Using polyglot persistence increases the operation load, so if it's possible to solve the task with one data store then it's better not to introduce additional entities.
In your case it seems that you want to have auditable transaction history, consistent current balance and you don't want to abandon the transaction guarantees. It's true that almost all modern NoSQL solution don't support ACID transactions out of the box, but most of them support primitives which allow you to implement transactions on the application level.
If a data store supports per key linearizability and compare-and-set (document level atomicity) then it's enough to implement client-side transactions, more over you have several options to choose from:
If you need Serializable isolation level then you can follow the same algorithm which Google use for the Percolator system or Cockroach Labs for CockroachDB. I've blogged about it and create a step-by-step visualization, I hope it will help you to understand the main idea behind the algorithm.
If you expect high contention but it's fine for you to have Read Committed isolation level then please take a look on the RAMP transactions by Peter Bailis.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.

2PL, Rigorous vs Strict Model, Is there any benefit?

In 2PL (two phase locking), what advantage(s) does the rigorous model have over the strict model?
I) There is no advantage over the strict model.
II) In contrast to the strict model, it guarantees that starvation cannot occur.
III) In contrast to the strict model, it guarantees that deadlock cannot occur.
IV) In contrast to the strict model, there is no need to predict data needed in the future.
My note says all of the above are false. I am a bit confused. Can someone clarify for me why all of this is false?
What is Two-Phase Locking (2PL) Protocol ?
A transaction is two-phase locked if:
before reading x, it sets a read lock on x
before writing x, it sets a write lock on x
it holds each lock until after it executes the corresponding operation
after its first unlock operation, it requests no new locks
Now, what is Strict phase locking ?
Here a transaction must hold all its exclusive locks till it commits/aborts.
But ,whats rigorous 2PL ?
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
Much deeper :
Strict 2PL :
Same as 2PL but Hold all exclusive locks until the transaction has already successfully committed or aborted. –It guarantees cascadeless recoverability
Rigorous 2PL :
Same as Strict 2PL but Hold all locks until the transaction has already successfully committed or aborted. –It is used in dynamic environments
where data access patterns are not known before hand.
There is no deadlock. Also, a younger transaction requesting an item held by an
older transaction is aborted and restart with the same timestamp, starvation is avoided.
I hope that above
clear explanations with diagram must have made you clear about the
concept and advantages of rigorous over the other.
Thanks
I - there is an advantage
Look at this lecture note from UCLA:
Rigorous two-phase locking has the advantages of strict 2PL. In addition
it has the property that for two conflicting transactions, their commit order
is their serializability order. In some systems users might expect this behavior.
These lecture notes have an example (the model in the example is strict - not rigorous):
Consider two transactions conducted at the same site in which a long running transaction T1 which reads x is ordered before a short transaction T2 that writes x. T2 returns first, showing an update version of x long before T1 completes based on the old version.
II and III - does not affect deadlocks/starvation
Rigorous 2PL means that all locks are released after the transaction ends as opposed to strict where read-only locks may be released earlier. This doesn't affect deadlocks or starvation as those occur in the expanding phase (a transaction cannot acquire the needed lock). In a deadlock both processes are always in the expanding phase.
IV - both need to know the needed data for locking in the expanding phase - shrinking phase varies
Strict: I don't know the usual implementation details of strict 2PL but if a read lock is released before a transaction ends there has to be a knowledge (100% sure prediction if you like) that the lock is not needed later in the transaction.
Rigorous: All the read locks are released at the end of a transaction and the transaction never has to evaluate if it should release a read lock or keep it for later reads in the transaction.
Is rigorous or strict more used/preferred?
Which of those two models to use would depend on the situation. Modern DBMS use more complex concurrency handling than simple rigorous or strict 2PL. Having said that judging by the Wikipedia article on two-phase locking the rigorous (SS2PL) model is more widely used:
SS2PL [rigorous] has been the concurrency control protocol of choice for most database systems and utilized since their early days in the 1970s. [...]
2PL in its general form, as well as when combined with Strictness, i.e., Strict 2PL (S2PL), are not known to be utilized in practice. The popular SS2PL does not require marking "end of phase-1" as 2PL and S2PL do, and thus is simpler to implement. Also, unlike the general 2PL, SS2PL provides, as mentioned above, the useful Strictness and Commitment ordering properties. [...]
SS2PL Vs. S2PL: Both provide Serializability and Strictness. Since S2PL is a super class of SS2PL it may, in principle, provide more concurrency. However, no concurrency advantage is typically practically noticed (exactly same locking exists for both, with practically not much earlier lock release for S2PL), and the overhead of dealing with an end-of-phase-1 mechanism in S2PL, separate from transaction-end, is not justified. Also, while SS2PL provides Commitment ordering, S2PL does not. This explains the preference of SS2PL over S2PL. [...]
Transaction T2 in the above example does not follow 2PL and S2PL as a lock request (lock B) is done after the lock release unlock(A) - hence violating the protocol.
Rigorous two phase locking is similar to strict two phase locking with two major differences:
In strict two phase locking the shared locks are released in
shrinking phase, but in rigorous two phase locking all the shared
and exclusive locks are kept until the end of the transaction.
In rigorous two phase locking we do not need to know the access
pattern of locks on data items beforehand so it is more appropriate
for dynamic environments while in strict two phase locking the
access pattern of locks should be specified at the start of
transaction.
So the fourth option is the correct one.

Fastest locking strategy (isolation level) for single user batch job

We have a SQL Server 2005 database that contains preprocessed data for reports.
All data is deleted and rebuild from scratch every night, based on data from the production database.
During the night the job is the only thing running on that server, so I have no simultaneous access to my data.
We are currently using the default READ_COMMITTED isolation level.
I understand that SQL Server will put locks on tables, for reading and writing. Since no one else is touching my tables (both the tables I read and those I write) while my job is running, would it be faster to specify WITH (NOLOCK), or using exclusive table locks (TABLOCKX) ?
Thanks for any hints,
Yves
You probably won't notice any difference from a CPU-usage standpoint. READ COMMITTED does not even take S-locks on rows that sit on pages that do not have uncommitted rows on them. This is a little known optimization in SQL Server for the very common READ COMMITTED isolation level.
I recommend that you consider READ UNCOMMITTED (which is exactly equivalent to NOLOCK) or TABLOCK because it allows SQL Server to scan tables in allocation order (an IAM scan) as opposed to logical index order. This is good for IO patterns and depending on the degree fragmentation can make a significant impact (or none at all).
For bulk writes look into the relevant guides put out by Microsoft. Make sure you take advantage of minimally-logged writes. TABLOCKX can come into play here.

What is the best default transaction isolation level for an ERP, if any?

Short background: We are just starting to migrate/reimplement an ERP system to Java with Hibernate, targeting a concurrent user count of 50-100 users using the system. We use MS SQL Server as database server, which is good enough for this loads.
Now, the old system doesn't use any transactions at all and relies for critical parts (e.g. stock changes) on setting manual locks (using flags) and releasing them. That's something like manual transaction management. But there are sometimes problems with data inconsistency. In the new system we would like to use transactions to wipe out these problems.
Now the question: What would be a good/reasonable default transaction isolation level to use for an ERP system, given a usage of about 85% OLTP and 15% OLAP? Or should I always decide on a per task basis, which transaction level to use?
And as a reminder the four transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE
99 times out of 100, read committed is the right answer. That ensures that you only see changes that have been committed by the other session (and, thus, results that are consistent, assuming you've designed your transactions correctly). But it doesn't impose the locking overhead (particularly in non-Oracle databases) that repeatable read or serializable impose.
Very occasionally, you may want to run a report where you are willing to sacrifice accuracy for speed and set a read uncommitted isolation level. That's rarely a good idea, but it is occasionally a reasonably acceptable workaround to lock contention issues.
Serializable and repeatable read are occasionally used when you have a process that needs to see a consistent set of data over the entire run regardless of what other transactions are doing at the time. It may be appropriate to set a month-end reconciliation process to serializable, for example, if there is a lot of procedureal code, a possibility that users are going to be making changes while the process is running and a requirement that the process needs to ensure that it is always seeing the data as it existed at the time the reconciliation started.
Don't forget about SNAPSHOT, which is right below SERIALIZABLE.
It depends on how important it is for the data to be accurate in the reports. It really is a task-by-task thing.
It really depends a lot on how you design your application, the easy answer is just run at READ_COMMITTED.
You can make an argument that if you design your system with it in mind that you could use READ_UNCOMMITTED as the default and only increase the isolation level when you need it. The vast majority of your transactions are going to succeed anyway so reading uncommitted data won't be a big deal.
The way isolation levels effect your queries depends on your target database. For instance databases like Sybase and MSSQL must lock more resources when you run READ_COMMITTED, than databases like Oracle.
For SQL Server (and probably most major RDBMS), I'd stick with the default. For SQL Server, this is READ COMMITTED. Anything more and you start overtaxing the DB, anything less and you've got consistency issues.
Read Uncommitted is definitely the underdog in most forums. However, there are reasons to use it that go beyond a matter of "speed versus accuracy" that is often pointed out.
Let's say you have:
Transaction T1: Writes B, Reads A, (some more work), Commit.
Transaction T2: Writes A, Reads B, (some more work), Commit.
With read committed, the transactions above won't release until committing. Then you can run into a situation where T1 is waiting for T2 to release A, and T2 is waiting for T1 to release B. Here the two transactions collide in a lock.
You could re-write those procedures to avoid this scenario (example: acquire resources always in alphabetical order!). Still, with too many concurrent users and tens of thousands of lines of code, this problem may become both very likely and very difficult to diagnose and resolve.
The alternative is using Read Uncommitted. Then you design your transactions assuming that there may be dirty reads. I personally find this problem much more localized and treatable than the interlocking trainwrecks.
The issues from dirty reads can be preempted by
(1) Rollbacks: don't. This should be the last line of defense in case of hardware failure, network failure or program crash only.
(2) Use application locks to create a locking mechanism that operates
at a higher level of abstraction, where each lock is closer to a
real-world resource or action.