My SQLite-based application currently uses transactions - both for being able to rollback and for improving performance. I'm considering replacing all transactions with savepoints. The reason is that the application is multi-threaded (yes, sqlite is configured to be thread-safe), and in some cases a transaction might get started by two threads in the same time (on the same db).
It there a reason NOT to do it?
Are there any pitfalls I need to be aware of?
Do I just replace BEGIN, COMMIT, ROLLBACK with SAVEPOINT xyz, RELEASE SAVEPOINT xyz, ROLLBACK TO SAVEPOINT xyz?
It there a reason NOT to do it?
Yes. It won't solve any of the problems that you outlined. Save points are primarily used to be able to do partial rollbacks of data. The outer transaction or savepoint is what actually is committed. Nothing is really fully saved until that outermost savepoint is released thus updating the DB. You are right back back to the same problem that you have with standard transactions.
Are there any pitfalls I need to be aware of?
Yes. Transactions or savepoints in a multithreaded application can deadlock fairly easily if you are update the same data in two different threads which I assume is the heart of the matter. There is no difference between the two in this regard. You should be aware of what you are updating in each thread and synchronize accordingly.
In short, unless you have the need to do partial transaction rollback, savepoints really wont give you much (other than the fact that they are named.)
There is no silver bullet here. It sounds like you need to do a serious analyses of your application and the data that may be updated in multiple threads and add some synchronization in you application if needed.
Related
Does anyone know how SQL databases detect serializable isolation violations (SIV's)? It seems like simply brute forcing every permutation of transaction executions to find a match for the concurrent execution results to verify serializability wouldn't scale.
According to this paper from a third party researcher: https://amazonredshiftresearchproject.org/white_papers/downloads/multi_version_concurrency_control_and_serialization_isolation_failure.pdf
SIV's occur when two transactions are occurring at the same time and the more recent one commits some deleted rows that the less recent transaction later tries to delete as well. This is a situation that MVCC is unable to deal with and thus has to abort with SIV.
This makes sense for detecting SIV's involving queries that delete rows in MVCC, but I don't understand how SIV's are detected when only select and insert queries are used. For example, this example in AWS docs: https://aws.amazon.com/premiumsupport/knowledge-center/redshift-serializable-isolation/
Does anyone have any idea?
Let me simplify things down as a lot of what is going on is complicated and it is easy to miss the forest for the trees.
2 transaction are in flight (BEGIN) and both are using their own
database state that matches the database state at the time the BEGIN
occurred.
Each transaction modifies a table that is part of the
other transaction's initial state.
That's it. Redshift doesn't "know" that the changes that the other transaction is material to the results this transaction is making. Just that it COULD be material. Since it COULD be material then the serialization hazard exists and one transaction is aborted to prevent the possibility of indeterminant results.
There's a lot of complexity and nuance to this topic that only is important if you are trying to understand why certain cases, timings, and SQL worked and others didn't. This gets into predicate locking which is how Redshift "knows" if some change being made somewhere else is effecting a part of the initial state that is material to this transaction. I.E. a bunch of bookkeeping. This is why the "select * from tab1" matters in the linked knowledge-center article - it creates the "predicate lock" for this transaction.
PostgreSQL detects serialization violations using a heuristics. Reading data causes predicate locks (SIReadLock) to be taken, and it checks for dangerous structures, which necessarily occur in every serialization violation. That means that you can get false positive serialization errors, but never false negatives.
This is all described in the documentation and in the scientific paper referenced there, and we can hope that Amazon didn't hack up PostgreSQL too badly in that area.
I just realized I've had a headache for years. Well, metaphorically speaking. In reality I was looking at my database structure and somehow just realized I never use transactions. Doh.
There's a lot of data on the internet about transactions (begin transaction, rollback, commit, etc.), but surprisingly not much detail about exactly why they are vital, and just exactly how vital?
I understand the concept of handling if something goes wrong. This made sense when one is doing multiple updates, for example, in multiple tables in one go, but this is bad practice as far as I know and I don't do this. All of my queries just update one table. If a query errors, it cancels, transaction or no transaction. What else could go wrong or potentially corrupt a one table update, besides my pulling the plug out of my server?
In other words, my question is,
exactly how vital is it that i implement transactions on all of my tables - I am fully blasphemous for not having them, or does it really matter that much?
UPDATE
+1 to invisal, who pointed out that queries are automatically wrapped as transactions, which I did not know. Pointed out multiple good references on the subject of my question.
This made a lot of sense when one is doing multiple updates, for
example, in multiple tables in one go. But basically all of my queries
just update one table at a time. If a query errors, it cancels,
transaction or no transaction.
In your case, it does nothing. A single statement has its own transaction itself. For more information you can read the existed question and answers:
What does a transaction around a single statement do?
Transaction necessary for single update query?
Do i need transaction for joined query?
Most important property of the database is to keep your data, reliably.
Database reliability is assured by conforming to ACID principles (Atomicity, Consistency, Isolation, Durability). In the context of databases, a single logical operation on the data is called a transaction. Without transactions, such reliability would not be possible.
In addition to reliability, using transactions properly lets you improve performance of some data operations considerably. For example, you can start transaction, insert a lot of data (say 100k rows), and only then commit. Server does not have to actually write to disk until commit is called, effectively batching data in memory. This allows to improve performance a lot.
You should be aware that every updating action against your database is performed inside a transaction, even if only 1 table (SQL server automatically creates a transaction for it).
The reason for always doing transactions is to ensure ACID as others have mentioned. Here I'd like to elaborate on the isolation point. Without transaction isolation, you may have problems with: read uncommitted, unrepeatable read, phantom read,..
it depends if you are updating one table and one row, then the only advantage is going to be in the logging... but if you update multiple row in a table at one time... without transactions you could still run into somecurruption
Well it depends, SQL is most of the times used for supporting data for some host languages like c, c++, java, php, c# and others. Well I have not worked with much technologies.. but if you are using following combinations then here is my point of view:
SQL with C / C++ : Commit Required
SQL with Java : Not Required
SQL with C# : Not Required
SQL with PHP : Not Required
And it also depends which SQL you are using. It would also depend from different flavors of SQL like Oracle SQL, SQL Server, SQLite, MySQL etc...
When you are using Oracle SQL in its console, like Oracle 11g, Oracle 10g etc... COMMIT is required.
And as far as corruption of table and data is concerned. YES it happens, I had a very bad experience with it. So, if you pull out your wire or something while you are updating in your table, then you might end up with a massive disaster.
Well concluding, I will suggest you to do commit.
I'm studying SQL and need to know whether a certain transaction scheme is serializable. I understand the method of determining this is making a graph with the transactions as nodes and direction between the nodes and if the graph is cyclic then the scheme is not serializable. But what does it mean and what determines whether there is a directed edge in the graph from one transaction to the other? Is serialization in this case the same kind of serialization as writing objects to disk?
Thanks for any insight
Transaction serialization has nothing to do with object serialization. The serializable transaction isolation level, when fully implemented, ensures that the behavior of any set of concurrent serializable transactions is consistent with some serial (one-at-a-time) sequence of execution -- as though the transactions had been run one at a time. This means that if you can show that a database transaction will do the right thing when it is run alone, it will do the right thing in any mix of serializable transactions, or it will roll back with a serialization failure so that it can be retried from the start.
Serializable transaction isolation can be enforced in many ways. The most common scheme is strict two-phase locking (S2PL). This one is so common that you often see answers on SO which discuss things only in terms of this technique. There are also optimistic concurrency control (OCC), serializable snapshot isolation (SSI), and others.
PostgreSQL versions before 9.1, MS SQL Server in some configurations, and all versions of Oracle don't actually provide serializable transactions. They let you ask for them, but actually provide snapshot isolation. PostgreSQL versions starting with 9.1 use SSI when serializable transaction isolation is requested.
It's not possible to thoroughly discuss how any of these techniques work in an SO answer, but to summarize the techniques mentioned above:
Under S2PL every write within a transaction acquires a lock which cannot be shared with anything, and every read within the transaction acquires a lock which can be shared with other reads but can not be shared with a write. The read locks need to cover "gaps" in scanned indexes. Locks are held until the end of the transaction and released atomically with the work of the transaction becoming visible to other transactions. If the blocking creates a cycle, this is called a "deadlock", and one of the transactions involved in the cycle is rolled back.
Under OCC a transaction keeps track of what data it has used, without locking it. When transaction commit is requested, the transaction checks whether any other transaction modified any of its data and committed. If so, the commit request fails and the work is rolled back.
Under SSI writes block each other, but reads don't block writes and writes don't block reads. There is tracking of read-write dependencies to look for patterns of visibility which would create a cycle in the apparent order of execution. If a "dangerous structure" is found, which means that a cycle in the apparent order of execution is possible, one of the transactions involved in the possible cycle is rolled back. It is more like OCC than S2PL, but doesn't have as many rollbacks under higher contention.
Full disclosure: I teamed with Dan R.K. Ports of MIT to implement the new SSI-based serializable transactions in PostgreSQL 9.1.
Serialization means that transaction can be executed in a serial way, one after the other (nothing to do with object serialization), basically a transaction its serializable if regardless of the order these are interleaved the result will be as if they were executed in a serial way, if the graph its cyclic then it is not serializable and there is some risk of conflict, here is where your isolation level will help to decide wheter the transaction should be executed in a serial way, meaning first one and then the other or wheter it should try to execute it in an interleaved way hoping there is no conflicts.
Its not a complete answer but i hope this will help.
We all know about techniques to prevent db deadlocks - acquire locks in the same order, etc. But at some point, systems under pressure may simply suffer from deadlocks here and there. Should we simply accept that and always be prepared to retry when a deadlock occurs or should deadlocks be considered absolutely verboten and should we do everything in our power to prevent them?
The answer is yes.
You should do everything in your power to prevent them, but are you ever going to be satisfied that you've made them impossible?
Do everything in your power to prevent them, and be prepared to retry when they occur. :)
Keep in mind that "doing everything in your power" can mean things like queueing batch updates, making inserts into temp tables and then merging those into the main tables later and other non-trivial techniques. Be sure to check your transaction isolation level and your lock escalation policy.
This will probably be closed, but the world is trending to NoSQL solutions to this problem, breaking problems up so that guaranteed consistency isn't required from the datasource meaning that locks aren't required.
Facebook would be a good example of this, it doesn't matter when everyone sees your update, or if different users around the world see different versions of your profile. As long as the update works or eventually fails, that is good enough.
I am encountering very infrequent yet annoying SQL deadlocks on a .NET 2.0 webapp running on top of MS SQL Server 2005. In the past, we have been dealing with the SQL deadlocks in the very empirical way - basically tweaking the queries until it work.
Yet, I found this approach very unsatisfactory: time consuming and unreliable. I would highly prefer to follow deterministic query patterns that would ensure by design that no SQL deadlock will be encountered - ever.
For example, in C# multithreaded programming, a simple design rule such as the locks must be taken following their lexicographical order ensures that no deadlock will ever happen.
Are there any SQL coding patterns guaranteed to be deadlock-proof?
Writing deadlock-proof code is really hard. Even when you access the tables in the same order you may still get deadlocks [1]. I wrote a post on my blog that elaborates through some approaches that will help you avoid and resolve deadlock situations.
If you want to ensure two statements/transactions will never deadlock you may be able to achieve it by observing which locks each statement consumes using the sp_lock system stored procedure. To do this you have to either be very fast or use an open transaction with a holdlock hint.
Notes:
Any SELECT statement that needs more than one lock at once can deadlock against an intelligently designed transaction which grabs the locks in reverse order.
Zero deadlocks is basically an incredibly costly problem in the general case because you must know all the tables/obj that you're going to read and modify for every running transaction (this includes SELECTs). The general philosophy is called ordered strict two-phase locking (not to be confused with two-phase commit) (http://en.wikipedia.org/wiki/Two_phase_locking ; even 2PL does not guarantee no deadlocks)
Very few DBMS actually implement strict 2PL because of the massive performance hit such a thing causes (there are no free lunches) while all your transactions wait around for even simple SELECT statements to be executed.
Anyway, if this is something you're really interested in, take a look at SET ISOLATION LEVEL in SQL Server. You can tweak that as necessary. http://en.wikipedia.org/wiki/Isolation_level
For more info, see wikipedia on Serializability: http://en.wikipedia.org/wiki/Serializability
That said -- a great analogy is like source code revisions: check in early and often. Keep your transactions small (in # of SQL statements, # of rows modified) and quick (wall clock time helps avoid collisions with others). It may be nice and tidy to do a LOT of things in a single transaction -- and in general I agree with that philosophy -- but if you're experiencing a lot of deadlocks, you may break the trans up into smaller ones and then check their status in the application as you move along. TRAN 1 - OK Y/N? If Y, send TRAN 2 - OK Y/N? etc. etc
As an aside, in my many years of being a DBA and also a developer (of multiuser DB apps measuring thousands of concurrent users) I have never found deadlocks to be such a massive problem that I needed special cognizance of it (or to change isolation levels willy-nilly, etc).
There is no magic general purpose solution to this problem that work in practice. You can push concurrency to the application but this can be very complex especially if you need to coordinate with other programs running in separate memory spaces.
General answers to reduce deadlock opportunities:
Basic query optimization (proper index use) hotspot avoidanant design, hold transactions for shortest possible times...etc.
When possible set reasonable query timeouts so that if a deadlock should occur it is self-clearing after the timeout period expires.
Deadlocks in MSSQL are often due to its default read concurrency model so its very important not to depend on it - assume Oracle style MVCC in all designs. Use snapshot isolation or if possible the READ UNCOMMITED isolation level.
I believe the following useful read/write pattern is dead lock proof given some constraints:
Constraints:
One table
An index or PK is used for read/write so engine does not resort to table locks.
A batch of records can be read using a single SQL where clause.
Using SQL Server terminology.
Write Cycle:
All writes within a single "Read Committed" transaction.
The first update in the transaction is to a specific, always-present record
within each update group.
Multiple records may then be written in any order. (They are "protected"
by the write to the first record).
Read Cycle:
The default read committed transaction level
No transaction
Read records as a single select statement.
Benefits:
Secondary write cycles are blocked at the write of first record until the first write transaction completes entirely.
Reads are blocked/queued/executed atomically between the write commits.
Achieve transaction level consistency w/o resorting to "Serializable".
I need this to work too so please comment/correct!!
As you said, always access tables in the same order is a very good way to avoid deadlocks. Furthermore, shorten your transactions as much as possible.
Another cool trick is to combine 2 sql statements in one whenever you can. Single statements are always transactional. For example use "UPDATE ... SELECT" or "INSERT ... SELECT", use "##ERROR" and "##ROWCOUNT" instead of "SELECT COUNT" or "IF (EXISTS ...)"
Lastly, make sure that your calling code can handle deadlocks by reposting the query a configurable amount of times. Sometimes it just happens, it's normal behaviour and your application must be able to deal with it.
In addition to consistent sequence of lock acquisition - another path is explicit use of locking and isolation hints to reduce time/resources wasted unintentionally acquiring locks such as shared-intent during read.
Something that none has mentioned (surprisingly), is that where SQL server is concerned many locking problems can be eliminated with the right set of covering indexes for a DB's query workload. Why? Because it can greatly reduce the number of bookmark lookups into a table's clustered index (assuming it's not a heap), thus reducing contention and locking.
If you have enough design control over your app, restrict your updates / inserts to specific stored procedures and remove update / insert privileges from the database roles used by the app (only explicitly allow updates through those stored procedures).
Isolate your database connections to a specific class in your app (every connection must come from this class) and specify that "query only" connections set the isolation level to "dirty read" ... the equivalent to a (nolock) on every join.
That way you isolate the activities that can cause locks (to specific stored procedures) and take "simple reads" out of the "locking loop".
Quick answer is no, there is no guaranteed technique.
I don't see how you can make any application deadlock proof in general as a design principle if it has any non-trivial throughput. If you pre-emptively lock all the resources you could potentially need in a process in the same order even if you don't end up needing them, you risk the more costly issue where the second process is waiting to acquire the first lock it needs, and your availability is impacted. And as the number of resources in your system grows, even trivial processes have to lock them all in the same order to prevent deadlocks.
The best way to solve SQL deadlock problems, like most performance and availability problems is to look at the workload in the profiler and understand the behavior.
Not a direct answer to your question, but food for thought:
http://en.wikipedia.org/wiki/Dining_philosophers_problem
The "Dining philosophers problem" is an old thought experiment for examining the deadlock problem. Reading about it might help you find a solution to your particular circumstance.