To my understanding, database can postpone writing to table files to boost IO performance. When transaction is COMMITted, data are written to the WAL files.
I'm curious, how much delayed the writing to table files can be. In particular, when I use a simple SELECT, e.g.
SELECT * from myTable;
after COMMIT, is it possible that database has to retrieve data from the WAL files in addition to the table files?
The documentation talks about being able to postpone the flushing of data pages:
If we follow this procedure, we do not need to flush data pages to disk on every transaction commit.
What WAL files allow an RDBMS to do is keeping "dirty" data pages in memory and flushing them to disk at a later time. It does not work in a way that the data pages are modified at a later time.
So the answer to your question is "No, a SELECT is always retrieving data from the data pages, not from the WAL files".
PostgreSQL does not read from WAL for this purpose during normal operations. It would only read WAL in order to apply changes to the data files after a crash (or during replication onto another server).
When data from ordinary data files is changed, the pages of the data files are kept in shared memory (in shared_buffers) until they are written to disk. Any other processes wanting to see that data will find it in that shared memory, and it will be in its changed form. They will always look for it in shared_buffers before they try to read it from disk, so no one will ever see the stale on-disk version of the data, except for the recovery process after a crash.
How is consistency maintained in transactional systems in the event of a power outage?
For example, consider the operation below that are performed within a transaction.
UPDATE table SET someColumn = someColumn + 1 WHERE <whatever>
In the event of a sudden power outage, how can we ensure that the operation is completed or not?
From SQL server docs:
SQL Server uses a write-ahead logging (WAL) algorithm, which guarantees that no data modifications are written to disk before the associated log record is written to disk. This maintains the ACID properties for a transaction. ... The log records must be written to disk before the associated dirty page is removed from the buffer cache and written to disk.
As far as I understand, using the example of the SQL increment operation above, the following operations occur:
Write a record to the transaction log.
Flush changes to disk.
When a sudden shutdown occurs, how does the system know if the incremental operation has completed? How does it understand from when to rollback? (a shutdown can occur after adding a log to transaction log, but before flushing changes to disk, is it?)
And I have another (second) question, it is similar to this one.
How can we guarantee atomicity? For example,
UPDATE table SET someColumn = someColumn + 1 AND otherColumn = otherColumn + 2 WHERE <whatever>
How can we ensure that in the event of a sudden power off, the field otherColumn will also be updated, or no fields will be updated?
Write-Ahead Logging (WAL). See eg Transaction Log Architecture
a shutdown can occur after adding a log to transaction log, but before flushing changes to disk, is it?
Log records are added to the transaction log in memory, but when a transaction is committed SQL Server waits for confirmation that all the log records have been written to durable media. And on restart the transaction will be rolled back if all its log records haven't been flushed to disk.
And SQL Server requires the storage system to flush changes to disk, disabling any non-persistent write caching.
As I understood reading some articles in internet, SQL Server has a buffer cache where it stores pages, and when insert statement is executed, the modified data is written only to that buffer in memory, not to the disk.
And when system checkpoint comes all dirty pages are flushed to disk.
Does this mean that when we execute insert statement and get a return value that everything was ok, the data might still not be written to disk and in theory if system crash occurs before checkpoint, the dirty pages wont be saved to disk although we received information that everything was ok and the transaction is commited?
No. Because you totally ignore the 2nd part of the mechanism - the LOG FILE. The Log file keeps all changed pages and is flushed to disc. In case of a crash, upon start, the server will replay the changes from the log file.
i am interesting to know know how the isolation level"READ COMMITTED" is provided in Oracle DB implementation. I already know that DB makes records in REDO log, but for now i think that that REDO log is only used to repeat the transaction in case when some unpredictable crash will happen during the transaction. Also i know that DBWR writes the "dirty blocks" every time the REDO log file is filled. But my question is: if DBWR writes "dirty"(changed blocks) to the disk, how isolation level"READ COMMITTED" is provided. I mean during writing DBWR writes data directly to data files or in some special "place" on disk that is visible from current transaction and invisible from other transaction? So after the COMMIT this "place" becomes visible and that's all ? How this works in reality? Sorry for bad English.
In addition to the REDO log, you also have the UNDO tablespace.
When updating data, the old value is stored in the UNDO tablespace. When Oracle sees that you would be reading uncommitted data for a record, it reconstructs the old value from there.
UNDO is also used during database recovery: In addition to re-applying writes that have been committed but not made it to the database files before the crash, the opposite can also take place: rolling back uncommitted changes to database files that happened before the crash.
I read about Voltdb's command log. The command log records the transaction invocations instead of each row change as in a write-ahead log. By recording only the invocation, the command logs are kept to a bare minimum, limiting the impact the disk I/O will have on performance.
Can anyone explain the database theory behind why Voltdb uses a command log and why the standard SQL databases such as Postgres, MySQL, SQLServer, Oracle use a write-ahead log?
I think it is better to rephrase:
Why does new distributed VoltDB use a command log over write-ahead log?
Let's do an experiment and imagine you are going to write your own storage/database implementation. Undoubtedly you are advanced enough to abstract a file system and use block storage along with some additional optimizations.
Some basic terminology:
State : stored information at a given point of time
Command : directive to the storage to change its state
So your database may look like the following:
Next step is to execute some command:
Please note several important aspects:
A command may affect many stored entities, so many blocks will get dirty
Next state is a function of the current state and the command
Some intermediate states can be skipped, because it is enough to have a chain of commands instead.
Finally, you need to guarantee data integrity.
Write-Ahead Logging - central concept is that State changes should be logged before any heavy update to permanent storage. Following our idea we can log incremental changes for each block.
Command Logging - central concept is to log only Command, which is used to produce the state.
There are Pros and Cons for both approaches. Write-Ahead log contains all changed data, Command log will require addition processing, but fast and lightweight.
VoltDB: Command Logging and Recovery
The key to command logging is that it logs the invocations, not the
consequences, of the transactions. By recording only the invocation,
the command logs are kept to a bare minimum, limiting the impact the disk I/O will
have on performance.
Additional notes
SQLite: Write-Ahead Logging
The traditional rollback journal works by writing a copy of the
original unchanged database content into a separate rollback journal
file and then writing changes directly into the database file.
A COMMIT occurs when a special record indicating a commit is appended
to the WAL. Thus a COMMIT can happen without ever writing to the
original database, which allows readers to continue operating from the
original unaltered database while changes are simultaneously being
committed into the WAL.
PostgreSQL: Write-Ahead Logging (WAL)
Using WAL results in a significantly reduced number of disk writes,
because only the log file needs to be flushed to disk to guarantee
that a transaction is committed, rather than every data file changed
by the transaction.
The log file is written sequentially, and so the
cost of syncing the log is much less than the cost of flushing the
data pages. This is especially true for servers handling many small
transactions touching different parts of the data store. Furthermore,
when the server is processing many small concurrent transactions, one
fsync of the log file may suffice to commit many transactions.
Conclusion
Command Logging:
is faster
has lower footprint
has heavier "Replay" procedure
requires frequent snapshot
Write Ahead Logging is a technique to provide atomicity. Better Command Logging performance should also improve transaction processing. Databases on 1 Foot
Confirmation
VoltDB Blog: Intro to VoltDB Command Logging
One advantage of command logging over ARIES style logging is that a
transaction can be logged before execution begins instead of executing
the transaction and waiting for the log data to flush to disk. Another
advantage is that the IO throughput necessary for a command log is
bounded by the network used to relay commands and, in the case of
Gig-E, this throughput can be satisfied by cheap commodity disks.
It is important to remember VoltDB is distributed by its nature. So transactions are a little bit tricky to handle and performance impact is noticeable.
VoltDB Blog: VoltDB’s New Command Logging Feature
The command log in VoltDB consists of stored procedure invocations and
their parameters. A log is created at each node, and each log is
replicated because all work is replicated to multiple nodes. This
results in a replicated command log that can be de-duped at replay
time. Because VoltDB transactions are strongly ordered, the command
log contains ordering information as well. Thus the replay can occur
in the exact order the original transactions ran in, with the full
transaction isolation VoltDB offers. Since the invocations themselves
are often smaller than the modified data, and can be logged before
they are committed, this approach has a very modest effect on
performance. This means VoltDB users can achieve the same kind of
stratospheric performance numbers, with additional durability
assurances.
From the description of Postgres' write ahead http://www.postgresql.org/docs/9.1/static/wal-intro.html and VoltDB's command log (which you referenced), I can't see much difference at all. It appears to be the identical concept with a different name.
Both sync only the log file to the disk but not the data so that the data could be recovered by replaying the log file.
Section 10.4 of VoltDB explains that their community version does not have command log so it would not pass the ACID test. Even in the enterprise edition, I don't see the details of their transaction isolation (e.g. http://www.postgresql.org/docs/9.1/static/transaction-iso.html) needed to make me comfortable that VoltDB is as serious as Postges.
With WAL, readers read from pages from unflushed logs. No modification is made to the main DB. With command logging, you have no ability to read from the command log.
Command logging is therefore vastly different. VoltDB uses command logging to create recovery points and ensure durability, sure - but it is writing to the main db store (RAM) in real time - with all the attendant locking issues, etc.
The way I read it is as follows: (My own opinion)
Command Logging as described here logs only transactions as they occur and not what happens in or to them. Ok, so here is the magic piece... If you want to rollback you need to restore the last snapshot and then you can replay all the transactions that were applied after that (Described in the link above). So effectively you are restoring a backup and re-applying all your scripts, only VoltDB has now automated it for you.
The real difference that i see with this is that you cannot rollback to a point in time logically as with a normal transaction log. Normal transaction logs (MSSQL, MySQL etc.) can easily rollback to a point in time (in the correct setup) as the transactions can be 'reversed'.
Interresting question comes up - referring to the pos by pedz, will it always pass the ACID test even with the Command Log? Will do some more reading...
Add: Did more reading and I don't think this is a good idea for very big and busy transactional databases. A DB snapshot is automatically created when the Command Logs fill up, to save you from big transaction logs and the IO used for this? You are going to incur large IO amounts with your snapshots being done at a regular interval and you are also using your memory to the brink. Alos, in my view you lose your ability to rollback easily to a point in time before the last automatic snapshot - think this will get very tricky to manage.
I'll rather stick to Transaction Logs for Transactional systems. It's proven and it works.
Its really just a matter of granularity. They log operations at the level of stored procedures, most RDBMS log at the level of individual statements (and 'lower'). Also their blurb regarding advantages is a bit of a red herring:
One advantage of command logging over ARIES style logging is that a
transaction can be logged before execution begins instead of executing
the transaction and waiting for the log data to flush to disk.
They have to wait for the command to be logged too, its just a much smaller record.
If I'm not mistaken VoltDB's unit of transaction is a stored proc. Traditional RDBMS usually need to support ad-hoc transactions containing any number of statements, so procedure-level logging is out of the question. Furthermore stored procedures are often not truly deterministic in traditional RDBMS (i.e. given params+log+data always produce same output), which they would have to be for this to work.
Nevertheless the performance improvements would be substantial for this constrained RDBMS model.
Few terminologies before I start explaining:
Logging schemes: The database uses logging schemes such as Shadow paging, Write Ahead Log (WAL), to implement concurrency, isolation, and durability (how is a different topic).
In order to understand why WAL is better, let's see an issue with shadow paging. In shadow paging, the database uses a master version and a shadow version of the database so that if the table size is 1 billion and the buffer pool manager does not have enough memory to hold all the tuple (records) in the memory the dirty pages are not written to the master version until the transaction(s) are not committed.
Once all the transactions are committed, the flag is switched and the shadow version becomes the master version. In the diagram above there are Page 3 and Page 5 that are old and can be garbage collected.
The issue with this approach is a large number of fragmented tuples left behind which is randomly located, this is slower as compared to if the dirty pages are sequentially accessed, and this is what Write Ahead Log does.
The other advantage of using WAL is the runtime performance (as you are not doing random IO to flush out the pages) but slower recovery time. Whereas, with shadow paging, the recovery performance is faster (which is required occasionally).