DBMS Transaction and Serializable - sql

We know:
A schedule in which transactions are aligned in such a way that one
transaction is executed first. When the first transaction completes
its cycle then next transaction is executed. Transactions are ordered
one after other. This type of schedule is called serial schedule as
transactions are executed in a serial manner.
and
This execution does no harm if two transactions are mutually
independent and working on different segment of data but in case these
two transactions are working on same data, results may vary.
so what is the benefit of finding that two transaction is Serialize Schedule? if the result is vary, what is the benefit ?

When transactions access the same variables but are executed serially (ie not executed simultaneously) there is a sense in which "results may vary" from when there is only one transaction ever executing (possibly repeatedly). With serial transactions we don't know what order the (non-overlapping) transactions are executed in. All we know at the start of the execution of a repeating transaction is that other transactions may have changed variables since the end of the last execution of the repeating transaction. (Although we generally know something about how they have been left.)
There is nothing wrong with such "varying results" because they just reflect that the transactions were requested at varying times.
When transactions access the same variables and are executed simultaneously (ie not serially) then then for each transaction "results may vary" (in another sense) from how we normally understand the code. That normal understanding relies on only one transaction executing at a time. Eg normally if code reads a variable twice without writing to it then we expect to get the same value. But that's not guaranteed if another transaction writes to it in between reads. Eg normally if code reads a variable then we expect to get the value that the variable actually had. But that's not guaranteed if we get some of its bytes and then another transaction writes to it and then we get the rest of the bytes from that new value.
But if transactions are serializable then they can be executed non-serially (with overlap) but with the same result as if they were executed serially (with no overlap). Then code means what it normally means when there is only one transaction executing.
So we have to make sure that the system acts as if transactions are serial or else we have no idea what our program does.
A serializable schedule is an interleaving of operations from multiple transactions that gives the same result as some serial(ized) schedule. The benefit of executing a serializable schedule that is different from just doing all of one transaction's operations after another's is improved throughput from doing multiple operations from multiple transactions at the same time.
PS
Your quotes appear on a web page that is a mess. It doesn't even define "serializable schedule". The text between your quotations is
In a multi-transaction environment, serial schedules are considered as
benchmark. The execution sequence of instruction in a transaction
cannot be changed but two transactions can have their instruction
executed in random fashion.
But the second sentence should start But in a non-serial schedule.... Because in a serial schedule "Transactions are ordered one after other." So the "results may vary" in the quotation is in a non-serial schedule.
But you did not respond to my comment:
Does "This execution" refer to a serial execution of transactions or
to a non-serial execution of transactions? (What came before your
second quote?)

Related

SQL code to avoid the unnecessary start of a calculation if it was started my some other process?

I do have the stored procedure to calculate some facts (say usp_calculate). It fills the cache-like table. The part of the table (determined by the arguments of the procedure) table must be recalculated every 20 minutes. Basically, the usp_calculate returns early if cached data is fresh, or it spends say a minute to calculate... and returns after that.
The usp_calculate is shared by more outer procedures that needs the data. How should I prevent starting the time-consuming part of the procedure if it was already started by some other process? How can I implement a kind of signaling and waiting for the result instead of starting the calculation again?
Context: I do have an SQL stored procedure named say usp_products. It finally performs a SELECT that returns the rows for a product code, a product name, and calculated information -- special price for a customer, and for the storage location. There is a lot of combinations (customers, price lists, other conditions) that prevent to precalculate the information by a separate process. It must be calculated on-demand, for the specific combination.
The third party database that is the source of the information is not designed for detecting changes. Anyway, the time condition (not older than 20 minutes) is considered a good enough to consider the data "fresh".
The building block for this is probably going to be application locks.
You can obtain an exclusive application lock using sp_getapplock. You can then determine if the data is "fresh enough" either using freshness information inherently contained in that data or a separate table that you use for tracking this.
At this point, if necessary, you refresh the data and update the freshness information.
Finally, you release the lock using sp_releaseapplock and let all of the other callers have their chance to acquire the lock and discover that the data is fresh.

Several same time requested queries execution sequence

For example one user executes query like this:
UPDATE table SET column = 100;
And second user:
UPDATE table SET column = 200;
And lets say, these two queries are requested exactly same time, same seconds, same nanoseconds (or minimal time measurement unit, which is for this DB), then which query will be executed first and which one second?
Will database in this case choose queries sequence just randomly?
p.s. I don't tag some concrete database, I think this mechanism for all major RDBMS are similar. Or may be not?
RDBMS's implement a set of properties abbreviated (and called) ACID. Wikipedia explains the concept.
Basically, ACID-compliant databases lock the data at some level (table, page, and row locks are typical). In principle, only one write lock can be acquired for the same object at the same time. So, the database will arbitrarily lock the row for one of the transactions.
What happens to the other transaction? That depends, but one of two things should happen:
The transaction waits until the lock is available. So "in the end", it will assign the value (lose the lock, win the war ;).
The transaction will "timeout" because the appropriate row(s) are not available.
Your case is rather more complicated, because all rows in a table are affected. In the end, though, all rows should have the same value in an ACID-compliant database.
I should note that major databases are (usually) ACID-compliant. However, even though they have locks and transactions and similar mechanisms, the details can and do vary among databases.
Usually, the DML operations are done by acquiring DML locks, with the help of which the operations are made atomic and consistent. So, in your case, either of the query will be given the DML lock and executed and then the second one will go ahead in the similar fashion. which one goes first and second is not known as such.

How is concurrency control implemented in any ORDBMS

I have a weird question about concurrency control in ORDBMS. This is completely theoretical.
I have two transactions T1 and T2 trying to update a particular row on a table.
Now both the transactions T1 and T2 hits the database simultaneously.
By simultaneously, I mean both hits at the same time calculated till nanoseconds.
So if both the transactions have a timestamp that is exactly same, then how does a DBMS (be it Oracle, DB2, SQL Server) identifies which transaction to process first and which transaction to process later.
I understand that a row level lock will be achieved by one transaction and the other will wait till the lock is released. But how will it identify whether T1 or T2 will acquire the lock. Is there some other parameter that is taken into account other than timestamp.
Thanks
Nirmalya
This question seems to be related more to concurrency control of DBMS in general, rather then of ORDBMS.
Anyway, as far as I know, even if two requests are issued exactly at the same time, they will be processed sequentially by the scheduler, which is responsible for acquiring locks and assigning timestamps. Obviously, only the scheduler is sequential: after scheduling, the queries can be processed parallely, if this is allowed by the locks and by timestamps ordering.
Wikipedia has an accurate explanation of timestamp-based concurrency control: https://en.wikipedia.org/wiki/Timestamp-based_concurrency_control. There you can see that there are few assumptions to be made. Look at the first two:
Every timestamp value is unique and accurately represents an instant in time.
No two timestamps can be the same.
These assumption can be guaranteed only using a thread-safe scheduler, which assings timestamps to transactions sequentially. Also in lock-based concurrency control the scheduler must be thread-safe. Otherwise when it locks a record it cannot be sure that no another transactions acquired a lock on the same record.

Batch Job Transaction

Lets say there are 100 records in the database for a batch job, when batch job runs and pick up those 100 record and then start processing. During process if error occurs at 10th record then should I rollback all 9 records which are already been processed.
How can we design this scenario??? Your suggestions are welcomed.
I believe you're asking if you should roll back successful records if an error occurs part-way through batch processing.
You want your DB updates to occur in transactions in a way that leaves database records consistent and legal (with respect to DB and business rules) after each transaction is committed or rolled back.
If each item in your list of 100 records can be processed and recorded individually, then I'd suggest using a flag of some sort (this could be a status field, as well) to indicate whether each record has been processed, then loop through the records to update each one. If you encounter an error, note that somewhere (log file, exception, error table... your call) and move on. When you're done, you'll have logged which records were successful and which failed. You should then be able to go back and fix whatever caused the problem(s) on bad records and re-process the skipped records.
If all 100 of your records must succeed or fail together, then you'll need to wrap your updates in a transaction so they all succeed or fail as one. This will work for dozens of records or (maybe) hundreds of records, but trying to scale up to thousands of records in the same transaction could create scalability problems (performance and contention issues), so you'd want a different solution for a pattern like that.
There is different transaction granularity possible. Java (JTA) allows for multiple writes in a single commit.
Open transaction
write record
write record
write record
error
roll-back
Most databases also support transactions that can handle multiple rows or record writes.
This is very common.
Take a look at SAVEPOINTS - they basically allow inner-transaction transactions. So you can do quite a lot of work, make a SAVEPOINT, and do more work, only rolling back to the last save point. If things go really wrong, you can just roll the entire transaction back.

MySQL: Transactions vs Locking Tables

I'm a bit confused with transactions vs locking tables to ensure database integrity and make sure a SELECT and UPDATE remain in sync and no other connection interferes with it. I need to:
SELECT * FROM table WHERE (...) LIMIT 1
if (condition passes) {
// Update row I got from the select
UPDATE table SET column = "value" WHERE (...)
... other logic (including INSERT some data) ...
}
I need to ensure that no other queries will interfere and perform the same SELECT (reading the 'old value' before that connection finishes updating the row.
I know I can default to LOCK TABLES table to just make sure that only 1 connection is doing this at a time, and unlock it when I'm done, but that seems like overkill. Would wrapping that in a transaction do the same thing (ensuring no other connection attempts the same process while another is still processing)? Or would a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE be better?
Locking tables prevents other DB users from affecting the rows/tables you've locked. But locks, in and of themselves, will NOT ensure that your logic comes out in a consistent state.
Think of a banking system. When you pay a bill online, there's at least two accounts affected by the transaction: Your account, from which the money is taken. And the receiver's account, into which the money is transferred. And the bank's account, into which they'll happily deposit all the service fees charged on the transaction. Given (as everyone knows these days) that banks are extraordinarily stupid, let's say their system works like this:
$balance = "GET BALANCE FROM your ACCOUNT";
if ($balance < $amount_being_paid) {
charge_huge_overdraft_fees();
}
$balance = $balance - $amount_being paid;
UPDATE your ACCOUNT SET BALANCE = $balance;
$balance = "GET BALANCE FROM receiver ACCOUNT"
charge_insane_transaction_fee();
$balance = $balance + $amount_being_paid
UPDATE receiver ACCOUNT SET BALANCE = $balance
Now, with no locks and no transactions, this system is vulnerable to various race conditions, the biggest of which is multiple payments being performed on your account, or the receiver's account in parallel. While your code has your balance retrieved and is doing the huge_overdraft_fees() and whatnot, it's entirely possible that some other payment will be running the same type of code in parallel. They'll be retrieve your balance (say, $100), do their transactions (take out the $20 you're paying, and the $30 they're screwing you over with), and now both code paths have two different balances: $80 and $70. Depending on which ones finishes last, you'll end up with either of those two balances in your account, instead of the $50 you should have ended up with ($100 - $20 - $30). In this case, "bank error in your favor".
Now, let's say you use locks. Your bill payment ($20) hits the pipe first, so it wins and locks your account record. Now you've got exclusive use, and can deduct the $20 from the balance, and write the new balance back in peace... and your account ends up with $80 as is expected. But... uhoh... You try to go update the receiver's account, and it's locked, and locked longer than the code allows, timing out your transaction... We're dealing with stupid banks, so instead of having proper error handling, the code just pulls an exit(), and your $20 vanishes into a puff of electrons. Now you're out $20, and you still owe $20 to the receiver, and your telephone gets repossessed.
So... enter transactions. You start a transaction, you debit your account $20, you try to credit the receiver with $20... and something blows up again. But this time, instead of exit(), the code can just do rollback, and poof, your $20 is magically added back to your account.
In the end, it boils down to this:
Locks keep anyone else from interfering with any database records you're dealing with. Transactions keep any "later" errors from interfering with "earlier" things you've done. Neither alone can guarantee that things work out ok in the end. But together, they do.
in tomorrow's lesson: The Joy of Deadlocks.
I've started to research the same topic for the same reasons as you indicated in your question. I was confused by the answers given in SO due to them being partial answers and not providing the big picture. After I read couple documentation pages from different RDMS providers these are my takes:
TRANSACTIONS
Statements are database commands mainly to read and modify the data in the database. Transactions are scope of single or multiple statement executions. They provide two things:
A mechanism which guaranties that all statements in a transaction are executed correctly or in case of a single error any data modified by those statements will be reverted to its last correct state (i.e. rollback). What this mechanism provides is called atomicity.
A mechanism which guaranties that concurrent read statements can view the data without the occurrence of some or all phenomena described below.
Dirty read: A transaction reads data written by a concurrent
uncommitted transaction.
Nonrepeatable read: A transaction re-reads data it has previously read
and finds that data has been modified by another transaction (that
committed since the initial read).
Phantom read: A transaction re-executes a query returning a set of
rows that satisfy a search condition and finds that the set of rows
satisfying the condition has changed due to another recently-committed
transaction.
Serialization anomaly: The result of successfully committing a group
of transactions is inconsistent with all possible orderings of running
those transactions one at a time.
What this mechanism provides is called isolation and the mechanism which lets the statements to chose which phenomena should not occur in a transaction is called isolation levels.
As an example this is the isolation-level / phenomena table for PostgreSQL:
If any of the described promises is broken by the database system, changes are rolled back and the caller notified about it.
How these mechanisms are implemented to provide these guaranties is described below.
LOCK TYPES
Exclusive Locks: When an exclusive lock acquired over a resource no other exclusive lock can be acquired over that resource. Exclusive locks are always acquired before a modify statement (INSERT, UPDATE or DELETE) and they are released after the transaction is finished. To explicitly acquire exclusive locks before a modify statement you can use hints like FOR UPDATE(PostgreSQL, MySQL) or UPDLOCK (T-SQL).
Shared Locks: Multiple shared locks can be acquired over a resource. However, shared locks and exclusive locks can not be acquired at the same time over a resource. Shared locks might or might not be acquired before a read statement (SELECT, JOIN) based on database implementation of isolation levels.
LOCK RESOURCE RANGES
Row: single row the statements executes on.
Range: a specific range based on the condition given in the statement (SELECT ... WHERE).
Table: whole table. (Mostly used to prevent deadlocks on big statements like batch update.)
As an example the default shared lock behavior of different isolation levels for SQL-Server :
DEADLOCKS
One of the downsides of locking mechanism is deadlocks. A deadlock occurs when a statement enters a waiting state because a requested resource is held by another waiting statement, which in turn is waiting for another resource held by another waiting statement. In such case database system detects the deadlock and terminates one of the transactions. Careless use of locks can increase the chance of deadlocks however they can occur even without human error.
SNAPSHOTS (DATA VERSIONING)
This is a isolation mechanism which provides to a statement a copy of the data taken at a specific time.
Statement beginning: provides data copy to the statement taken at the beginning of the statement execution. It also helps for the rollback mechanism by keeping this data until transaction is finished.
Transaction beginning: provides data copy to the statement taken at the beginning of the transaction.
All of those mechanisms together provide consistency.
When it comes to Optimistic and Pessimistic locks, they are just namings for the classification of approaches to concurrency problem.
Pessimistic concurrency control:
A system of locks prevents users from modifying data in a way that
affects other users. After a user performs an action that causes a
lock to be applied, other users cannot perform actions that would
conflict with the lock until the owner releases it. This is called
pessimistic control because it is mainly used in environments where
there is high contention for data, where the cost of protecting data
with locks is less than the cost of rolling back transactions if
concurrency conflicts occur.
Optimistic concurrency control:
In optimistic concurrency control, users do not lock data when they
read it. When a user updates data, the system checks to see if another
user changed the data after it was read. If another user updated the
data, an error is raised. Typically, the user receiving the error
rolls back the transaction and starts over. This is called optimistic
because it is mainly used in environments where there is low
contention for data, and where the cost of occasionally rolling back a
transaction is lower than the cost of locking data when read.
For example by default PostgreSQL uses snapshots to make sure the read data didn't change and rolls back if it changed which is an optimistic approach. However, SQL-Server use read locks by default to provide these promises.
The implementation details might change according to database system you chose. However, according to database standards they need to provide those stated transaction guarantees in one way or another using these mechanisms. If you want to know more about the topic or about a specific implementation details below are some useful links for you.
SQL-Server - Transaction Locking and Row Versioning Guide
PostgreSQL - Transaction Isolation
PostgreSQL - Explicit Locking
MySQL - Consistent Nonlocking Reads
MySQL - Locking
Understanding Isolation Levels (Video)
You want a SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE inside a transaction, as you said, since normally SELECTs, no matter whether they are in a transaction or not, will not lock a table. Which one you choose would depend on whether you want other transactions to be able to read that row while your transaction is in progress.
http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
START TRANSACTION WITH CONSISTENT SNAPSHOT will not do the trick for you, as other transactions can still come along and modify that row. This is mentioned right at the top of the link below.
If other sessions simultaneously
update the same table [...] you may
see the table in a state that never
existed in the database.
http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
Transaction concepts and locks are different. However, transaction used locks to help it to follow the ACID principles.
If you want to the table to prevent others to read/write at the same time point while you are read/write, you need a lock to do this.
If you want to make sure the data integrity and consistence, you had better use transactions.
I think mixed concepts of isolation levels in transactions with locks.
Please search isolation levels of transactions, SERIALIZE should be the level you want.
I had a similar problem when attempting a IF NOT EXISTS ... and then performing an INSERT which caused a race condition when multiple threads were updating the same table.
I found the solution to the problem here: How to write INSERT IF NOT EXISTS queries in standard SQL
I realise this does not directly answer your question but the same principle of performing an check and insert as a single statement is very useful; you should be able to modify it to perform your update.
I'd use a
START TRANSACTION WITH CONSISTENT SNAPSHOT;
to begin with, and a
COMMIT;
to end with.
Anything you do in between is isolated from the others users of your database if your storage engine supports transactions (which is InnoDB).
You are confused with lock & transaction. They are two different things in RMDB. Lock prevents concurrent operations while transaction focuses on data isolation. Check out this great article for the clarification and some graceful solution.