How to know when a transaction scheme is serializable?

How to know when a transaction scheme is serializable? - sql

I'm studying SQL and need to know whether a certain transaction scheme is serializable. I understand the method of determining this is making a graph with the transactions as nodes and direction between the nodes and if the graph is cyclic then the scheme is not serializable. But what does it mean and what determines whether there is a directed edge in the graph from one transaction to the other? Is serialization in this case the same kind of serialization as writing objects to disk?
Thanks for any insight

Transaction serialization has nothing to do with object serialization. The serializable transaction isolation level, when fully implemented, ensures that the behavior of any set of concurrent serializable transactions is consistent with some serial (one-at-a-time) sequence of execution -- as though the transactions had been run one at a time. This means that if you can show that a database transaction will do the right thing when it is run alone, it will do the right thing in any mix of serializable transactions, or it will roll back with a serialization failure so that it can be retried from the start.
Serializable transaction isolation can be enforced in many ways. The most common scheme is strict two-phase locking (S2PL). This one is so common that you often see answers on SO which discuss things only in terms of this technique. There are also optimistic concurrency control (OCC), serializable snapshot isolation (SSI), and others.
PostgreSQL versions before 9.1, MS SQL Server in some configurations, and all versions of Oracle don't actually provide serializable transactions. They let you ask for them, but actually provide snapshot isolation. PostgreSQL versions starting with 9.1 use SSI when serializable transaction isolation is requested.
It's not possible to thoroughly discuss how any of these techniques work in an SO answer, but to summarize the techniques mentioned above:
Under S2PL every write within a transaction acquires a lock which cannot be shared with anything, and every read within the transaction acquires a lock which can be shared with other reads but can not be shared with a write. The read locks need to cover "gaps" in scanned indexes. Locks are held until the end of the transaction and released atomically with the work of the transaction becoming visible to other transactions. If the blocking creates a cycle, this is called a "deadlock", and one of the transactions involved in the cycle is rolled back.
Under OCC a transaction keeps track of what data it has used, without locking it. When transaction commit is requested, the transaction checks whether any other transaction modified any of its data and committed. If so, the commit request fails and the work is rolled back.
Under SSI writes block each other, but reads don't block writes and writes don't block reads. There is tracking of read-write dependencies to look for patterns of visibility which would create a cycle in the apparent order of execution. If a "dangerous structure" is found, which means that a cycle in the apparent order of execution is possible, one of the transactions involved in the possible cycle is rolled back. It is more like OCC than S2PL, but doesn't have as many rollbacks under higher contention.
Full disclosure: I teamed with Dan R.K. Ports of MIT to implement the new SSI-based serializable transactions in PostgreSQL 9.1.

Serialization means that transaction can be executed in a serial way, one after the other (nothing to do with object serialization), basically a transaction its serializable if regardless of the order these are interleaved the result will be as if they were executed in a serial way, if the graph its cyclic then it is not serializable and there is some risk of conflict, here is where your isolation level will help to decide wheter the transaction should be executed in a serial way, meaning first one and then the other or wheter it should try to execute it in an interleaved way hoping there is no conflicts.
Its not a complete answer but i hope this will help.

Related

How are serializable isolation violations detected?

Does anyone know how SQL databases detect serializable isolation violations (SIV's)? It seems like simply brute forcing every permutation of transaction executions to find a match for the concurrent execution results to verify serializability wouldn't scale.
According to this paper from a third party researcher: https://amazonredshiftresearchproject.org/white_papers/downloads/multi_version_concurrency_control_and_serialization_isolation_failure.pdf
SIV's occur when two transactions are occurring at the same time and the more recent one commits some deleted rows that the less recent transaction later tries to delete as well. This is a situation that MVCC is unable to deal with and thus has to abort with SIV.
This makes sense for detecting SIV's involving queries that delete rows in MVCC, but I don't understand how SIV's are detected when only select and insert queries are used. For example, this example in AWS docs: https://aws.amazon.com/premiumsupport/knowledge-center/redshift-serializable-isolation/
Does anyone have any idea?

Let me simplify things down as a lot of what is going on is complicated and it is easy to miss the forest for the trees.
2 transaction are in flight (BEGIN) and both are using their own
database state that matches the database state at the time the BEGIN
occurred.
Each transaction modifies a table that is part of the
other transaction's initial state.
That's it. Redshift doesn't "know" that the changes that the other transaction is material to the results this transaction is making. Just that it COULD be material. Since it COULD be material then the serialization hazard exists and one transaction is aborted to prevent the possibility of indeterminant results.
There's a lot of complexity and nuance to this topic that only is important if you are trying to understand why certain cases, timings, and SQL worked and others didn't. This gets into predicate locking which is how Redshift "knows" if some change being made somewhere else is effecting a part of the initial state that is material to this transaction. I.E. a bunch of bookkeeping. This is why the "select * from tab1" matters in the linked knowledge-center article - it creates the "predicate lock" for this transaction.

PostgreSQL detects serialization violations using a heuristics. Reading data causes predicate locks (SIReadLock) to be taken, and it checks for dangerous structures, which necessarily occur in every serialization violation. That means that you can get false positive serialization errors, but never false negatives.
This is all described in the documentation and in the scientific paper referenced there, and we can hope that Amazon didn't hack up PostgreSQL too badly in that area.

How does transaction isolation level work with respect to read/writes and read/write locks?

I understand the dirty read, non-repeatable read and phantom read issue.
Also I have read about isolation levels: read uncommitted, read committed, repeatable read, serializable.
I also understand that reading results in a shared lock. To get a shared lock there shouldnt already be an active exlcusive lock. Where as insert/update/delete results in an exclusive lock. To get an exclusive lock there shouldn't be any other exclusive or shared lock active.
For each level, none of the articles I have read explain the isolation level concept with respect to:
Whether the level is applicable to a read or write transaction or both.
Whether reading/writing enforces any read/write locks different to the above explanation
Transaction is a all or nothing concept with regards to write. Whereas is transaction isolation level a concept with regards to reads only?
If anyone can enlighten regarding these points for each level then it will be very helpful.

You might find these articles by Paul White to be very useful.
But in answer to your questions:
Firstly, Shared vs Exclusive locks define what is allowed to happen concurrently against the lock. The isolation level defines how much is locked and how long for.
Isolation level is applicable to both types of transactions. SNAPSHOT in particular has different effects depending whether a write is involved or not.
There are also Intent locks, which are equivalent versions of other locks and allow a lock to be escalated from Page or Row Lock to Table/Partition.
You also have Schema Modification locks, which prevent anyone changing the table/column (or index) definitions from underneath you (this is applicable even to NOLOCK).
The isolation level defines how much gets locked, is it a row or a range? It also indicates what happens to a lock after it has been used. Is it held until the end of the transaction, or is released as soon as a commit happens?

Combining code that relies on different transaction isolation levels in Postgres

I have two functions which both require a transaction. One is calling the other. I have code that can nest such transactions using SAVEPOINT into a single one.
If they have the same transaction isolation level there is no problem. Now, if they do not, is there still way I could 'correctly' combine the transactions?
What would be the risk, other than decreased performance, if I ran both transaction under the most restrictive isolation level of the two?

In this situation, yes, generally you can combine transaction into the more restrictive isolation level.
The risk is pretty much that higher isolation level is going to catch more serialisation errors (i.e. ERROR: could not serialize access due to concurrent update in REPEATABLE READ and ERROR: could not serialize access due to read/write dependencies among transactions in SERIALIZABLE). The typical way to handle these serialisation failures is to retry the transactions, but you should verify whether this makes sense within the context of your application.
Another possible error that might occur is dead locks. Postgres should detect these and break the dead lock (after which the failing transaction should retry), but if you can, you should always try to write your application so dead locks can't exists in the first place. Generally, the main technique to avoid dead lock is to make sure that all applications that acquires any locks (implicit or explicit locks) to acquire those locks in consistent order.
You may need to take special care if your application needs to make requests to another external service, as you may need to verify whether the retry are going to cause you to make unwanted duplicate requests, especially if these external requests are not idempotent.

For a long running report, do I use a read only or serializable transaction?

I have a long running report written in SQL*Plus with a couple of SELECTs.
I'd like to change the transaction isolation level to get a consistent view on the data. I found two possible solutions:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
and
SET TRANSACTION READ ONLY;
Which one do I use for a report and why? Any performance implications? Any implications on other sessions (this is a production database).
Please note that the question is specifically about the two options above, not about the various isolation levels.
Does SERIALIZABLE blocks changes to a table that is queried for my report?
I would naively assume that READ ONLY is a little bit less stressful for the database, as there are no data changes to be expected. Is this true, does Oracle take advantage of that?

In Oracle, you can really choose between SERIALIZABLE and READ COMMITTED.
READ ONLY is the same as serializable, in regard to the way it sees other sessions' changes, with the exception it does not allow table modifications.
With SERIALIZABLE or READ ONLY your queries won't see the changes made to the database after your serializable transaction had begun.
With READ COMMITTED, your queries won't see the changes made to the database during the queries' lifetime.
SERIALIZABLE READ COMMITTED ANOTHER SESSION
(or READ ONLY)
Change 1
Transaction start Transaction start
Change 2
Query1 Start Query1 Start
... ... Change 3
Query1 End Query1 End
Query2 Start Query2 Start
... ... Change 4
Query2 End Query2 End
With serializable, query1 and query2 will only see change1.
With read committed, query1 will see changes 1 and 2, and query2 will see changes 1 through 3.

This article from the Oracle documentation gives a lot of detailed info about the different transaction isolation levels. http://docs.oracle.com/cd/B10501_01/server.920/a96524/c21cnsis.htm
In your example, it sounds like you are wanting Serializable. Oracle does not block when reading data, so using serializable in your read-only query should not block queries or crud operations in other transactions.
As mentioned in other answers, using the read only isolation level is similar to using serializable, except that read only does not allow inserts, updates, or deletes. However, since read only is not an SQL standard and serializable is, then I would use serializable in this situation since it should accomplish the same thing, will be clear for other developers in the future, and because Oracle provides more detailed documentation about what is going with "behind the scenes" with the serializable isolation level.
Here is some info about serializable, from the article referenced above (I added some comments in square brackets for clarification):
Serializable isolation mode provides somewhat more consistency by
protecting against phantoms [reading inserts from other transactions] and nonrepeatable reads [reading updates/deletes from other transactions] and can be
important where a read/write transaction executes a query more than
once.
Unlike other implementations of serializable isolation, which lock
blocks for read as well as write, Oracle provides nonblocking queries [non-blocking reads]
and the fine granularity of row-level locking, both of which reduce
write/write contention.

.. a lot of questions.
Both isolation levels are equivalent for sessions that only use selects. So it does not make a difference which one you choose. (READ ONLY is not a ANSI Standard)
Except performance influences, there are no implications from other sessions or to other sessions inside a session with transaction isolation level SERIALIZABLE or READ ONLY, unless you commit anything in this session (SERIALIZABLE only).
Performance of your select inside these two isolation levels should not differ because you don't change data there.
Performance using one of these two isolation levels compared against Oracle default READ COMMITTED is not optimal. Especially if a lot of data is changing during your SERIALIZABLE transaction, you can expect a performance downside.
I would naively assume that READ ONLY is a little bit less stressful for the database, as there are no data changes to be expected. Is this true, does Oracle take advantage of that?
=> No.
hope this helps.

Interesting question. I believe that SERIALIZABLE and READ ONLY would have the same "tax" on the database, and would be greater than that of READ COMITTED (usually the default). There shouldn't be a significant performance difference to you or other concurrent users. However, if the database can't maintain your read consistency due to too small of UNDO tablespace or too short undo_retention (default is 15 minutes), then your query will fail with the infamous ORA-01555. Other users shouldn't experience pain, unless there are other users trying to do something similar. Ask your DBA what the undo_retention parameter is set at and how big the UNDO tablespace and whether or not it's autoextensible.
If there's a similarly sized non-prod environment, try benchmarking your queries with different isolation levels. Check the database for user locks after the 1st query runs (or have your DBA check if you don't have enough privileges). Do this test several times, each with different isolation levels. Basically, documentation is great, but an experiment is often quicker and unequivocal.
Finally, to dodge the issue completely, is there any way you could combine your two queries into one, perhaps with a UNION ALL? This depends largely on the relationship of your two queries. If so, then the question becomes moot. The one combined query would be self-consistent.

Sql isolation levels, Read and Write locks

A bit lame question but I got confused...
Difference between isolation levels as far as I understood is how they managed their locks (http://en.wikipedia.org/wiki/Isolation_(database_systems)). So as mentioned in the article there are Read, Write and Range locks but there is no definition what they are itself.
What are you allowed to do and what not. When I googled for it there was nothing concrete
and instead I got confused with new terms like Pessimistic Lock an Optimistic Lock, Exclusive lock, Gap lock and so on. I'd be pleased if someone give me a short overview and maybe point me a good bunch materials to enlighten myself.
My initial question which started the research of isolation levels was:
What happens when I have concurrent inserts (different users of web app) into one table when my transactions isolation level is READ_COMMITED. Is the whole table locked or not?
Or generally what happens down there :) ?
Thanks in advance !

What happens when I have concurrent inserts (different users of web
app) into one table when my transactions isolation level is
READ_COMMITED.
"Read committed" means that other sessions cannot see the newly inserted row until its transaction is committed. A SQL statement that runs without an explicit transaction is wrapped in an implicit one, so "read committed" affects all inserts.
Some databases implement "read committed" with locks. For example, a read lock can be placed on the inserted row, preventing other tractions from reading it. Other databases, like Oracle, use multiversion concurrency control. That means they can represent a version of the database before the insert. This allows them to implement "read committed" without locks.

With my understanding, isolation level will decide how and when the locks are to be acquired and released.
Ref: http://aboutsqlserver.com/2011/04/28/locking-in-microsoft-sql-server-part-2-locks-and-transaction-isolation-levels/

This is what I was looking for ...
http://en.wikipedia.org/wiki/Two-phase_locking

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas