Is it meaningful to monitor physical memory usage on AIX? - aix

Due to AIX's special memory-using algorithm, is it meaning to monitor the physical memory usage in order to find out the memory bottleneck during performance tuning?
If not, then what kind of KPI am i supposed to keep eyes on so as to determine whether we need to enlarge the RAM capacity or not?
Thanks

If a program requires more memory that is available as RAM, the OS will start swapping memory sections to disk as it sees fit. You'll need to monitor the output of vmstat and look for paging activity. I don't have access to an AIX machine now to illustrate with an example, but I recall the man page is pretty good at explaining what data is represented there.
Also, this looks to be a good writeup about another AIX specfic systems monitoring tool, and watching your systems overall memory (svgmon).
http://www.aixhealthcheck.com/blog.php?id=255
To track the size of your individual application instance(s), there are several options, with the most common being ps. Again, you'll have to check the man page to get information on which options to use. There are several columns for memory sz per process. You can compare those values to the overall memory that's available on your machine, and understand, by tracking over time, if your application is only increasing is memory, or if it releases memory when it is done with a task.
Finally, there's quite a body of information from IBM on performance tuning for AIX, but I was never able to find a road map guide to reading that information. A lot of it assumes you know facts and features that aren't explained in the current doc set, so you then have to try and find an explanation, which oftens leads to searching for yet another layer of explanations. ! :^/
IHTH.

Related

is it recommended to use SPI flash to run code instead internal flash due to memory limitation of internal flash?

We used the LPC546xx family microcontroller in our project, currently, at the initial stage, we are finalizing the software and hardware requirements. The basic firmware size (which contains RTOS, 3rd party stack, library, etc...) currently is 480 KB. Now once full application developed than the size will exceed the internal flash size (512KB) and plus we needed storage which can hold firmware update image separately.
So we planned to use SPI flash (S25LP064A-JBLE, http://www.issi.com/WW/pdf/IS25LP032-064-128.pdf, serial flash memory) of 4MB\8MB to boot and run firmware.
is it recommended to run code from SPI flash? how can I map external flash memory directly to CPU memory space? Can anyone give an example that contains this memory mapping(linker script etc..) or demo application in which LPC546xx uses SPI FLASH?
Generally speaking it's not recommended, or differently put: the closer to the CPU the better. Both S25LP064A and LPC546xx however support XIP, so it is viable.
This is not a trivial issue as many aspects are affecting. I.e. issue is best avoided and should really have been ironed out in the planning stage. Embedded Systems are more about compromising than anything and making the right/better choices takes skill end experience.
Same question with replies on the NXP forum: link
512K of NVRAM is huge. There are almost certainly room for optimisations even if 3'rd party libraries are used.
On a related note this discussion concerning XIP should give valuable insight: link.
I would strongly encourage use of file-systems if not done already, for which external storage is much better suited. The further from the computational unit, the more relevant. That's not XIP and the penalty is copy-to-RAM either way you do it. I.e. performance will be slower. But in my experience, the need for speed has often-times not been thoroughly considered and at least partially greatly overestimated.
Regarding your mentioning of RTOS and FW-upgrade:
Unless it's a poor RTOS there's file-system awareness built in. Especially for FW upgrading (Note: you'll need room for 3 images, factory reset included), unless already supported by the SoC-vendor by some other means (OTA), it will make life much easier and less risky. If there's no FS-awareness, it can be added.
FW upgrade requires a lot of extra storage. More if simpler. Simpler is however also safer which especially for FW upgrades matters hugely. In the simplest case (binary flat image), you'll need at least twice the amount of memory you're already consuming.
All-in-all: I think the direction you're going is viable and depending on the actual situation perhaps your only choice.

How to properly assign huge heap space for JVM

Im trying to work around an issue which has been bugging me for a while. In a nutshell: on which basis should one assign a max heap space for resource-hogging application and is there a downside for tit being too large?
I have an application used to visualize huge medical datas, which can eat up to several gigabytes of memory if several imaging volumes are opened size by side. Caching the data to be viewed is essential for fluent workflow. The software is supported with windows workstations and is started with a bootloader, which assigns the heap size and launches the main application. The actual memory needed by main application is directly proportional to the data being viewed and cannot be determined by the bootloader, because it would require reading the data, which would, ultimately, consume too much time.
So, to ensure that the JVM has enough memory during launch we set up xmx as large as we dare based, by current design, on the max physical memory of the workstation. However, is there any downside to this? I've read (from a post from 2008) that it is possible for native processes to hog up excess heap space, which can lead to memory errors during runtime. Should I maybe also sniff for free virtualmemory or paging file size prior to assigning heap space? How would you deal with this situation?
Oh, and this is my first post to these forums. Nice to meet you all and be gentle! :)
Update:
Thanks for all the answers. I'm not sure if I put my words right, but my problem rose from the fact that I have zero knowledge of the hardware this software will be run on but would, nevertheless, like to assign as much heap space for the software as possible.
I came to a solution of assigning a heap of 70% of physical memory IF there is sufficient amount of virtual memory available - less otherwise.
You can have heap sizes of around 28 GB with little impact on performance esp if you have large objects. (lots of small objects can impact GC pause times)
Heap sizes of 100 GB are possible but have down sides, mostly because they can have high pause times. If you use Azul Zing, it can handle much larger heap sizes significantly more gracefully.
The main limitation is the size of your memory. If you heap exceeds that, your application and your computer will run very slower/be unusable.
A standard way around these issues with mapping software (which has to be able to map the whole world for example) is it break your images into tiles. This way you only display the image which is one the screen (or portions which are on the screen) If you need to be able to zoom in and out you might need to store data at two to four levels of scale. Using this approach you can view a map of the whole world on your phone.
Best to not set JVM max memory to greater than 60-70% of workstation memory, in some cases even lower, for two main reasons. First, what the JVM consumes on the physical machine can be 20% or more greater than heap, due to GC mechanics. Second, the representation of a particular data entity in the JVM heap may not be the only physical copy of that entity in the machine's RAM, as the OS has caches and buffers and so forth around the various IO devices from which it grabs these objects.

Embedded app and wearing out flash disks

I have an embedded app that needs to do a lot of writing to a flash disk (or other). We cannot use a hard disk due to the environment. This is an industrial system subject to vibration and explosive fuel vapour.
The trouble is, flash has a lifecycle of around 100000 write cycles. Ample for your digital camera. Wears out after a year in our scenario.
Any alternatives that people have found work for them?
I was thinking of using FRAM but it's been done before here and it's slow and small.
As Nils says, commercial compact flash cards, and drive replacements (NAND) have wear levelling.
If you are using cheap onboard (NOR) flash you might have to do this yourself.
The best way is some sort of ring buffer where you are only appending data and then overwriting a full drive. Remember flash can only erase a full block (page) but can then append individual bytes to existing data in that page.
Also can you buffer a page in RAM and then write once or do you have to have individual bytes committed at all times?
Most app sheets for embedded processors will have examples of this.
You really need to provide much more information:
how much capacity do you need?
what costs are acceptable?
what physical form factor do you need?
what lifetime do you want?
If your storage needs aren't particularly huge and you can deal with the cost, There are battery-backed SRAM parts (up to at least 2 Megabytes per part) that are as fast as RAM (that's what they are) and have no limit on number of writes. But they cost a lot more than flash.
You could also get a drive with a SATA interface that's populated with DRAM.
This post referes to using embedded linux. Not sure if this is what you want.
I have a not to differnt system, but for medical use. We use a NOR flash for all parts that have low update frequency and NAND flash for the rest. I would recoment using UBI/UBIFS for the top layer om the MTD disk. UBI/UBIFS takes care of all the underlying problems for you. If you then design your system to have a lot larger physical flash than you need. Example: You need 100MB and then design your HW with 1GB flash. Then the data can be shuffeld around by UBI without any interaction from systems above.
UBIFS documentation
UBI documentation
As Michael Burr pointed out, we need more info. (Please answer his questions.)
I have an additional question: What kind of interface is this? PATA? SATA? USB?
As others have pointed out, any decent Flash Drive will provide some kind of wear leveling. Look for this in the datasheet for the device. Many vendors will boast about their wear-leveling technique.
You mention 100000 cycles. This seems pretty low to me. Most "industrial grade" flash drives can do a lot more than that (millions). Make sure you aren't using a bargain-basement device. A good flash drive will usually include an equation or calculator tool you can use to figure out the expected lifespan of the device.
(I can say from personal experience that some brands of flash drives hold up a lot better than others, particularly the "industrial" ones. Our drives go through some pretty brutal usage scenarios.)
The other thing that can help a lot is capacity. The higher capacity of flash drive, the more room the wear-leveling algorithm has to work with, which means a longer lifespan.
The other thing you can look at doing is software techniques to minimize the wearing of the flash components. Do you have a pagefile/swapfile? Maybe you don't need it. If you are creating/deleting lots of temporary files, move this to a RAM disk. Remember, it is erasure/reprogramming cycles that usually wears out a flash cell, so reducing those operations will usually help.
Use SD cards that have a built-in wear leveling controller. That way the write cycles get distributed over all the flash blocks and you get a very long life out of your flash.
I was thinking of using FRAM but it's
been done before here and it's slow
and small.
Compare with nvSRAM; that may provide the performance you need.
I have used a Compact Flash card in a embedded system with great success. It has a onboard controller that does all the thinking for you. Not all Compact Flash controllers are equal so get one that is a recent design and was intended to be used as a hard drive replacement as they have better wear levelling algorithms.

What does programming for PS3's Cell Processor entail?

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.

Are disk sector writes atomic?

Clarified Question:
When the OS sends the command to write a sector to disk is it atomic? i.e. Write of new data succeeds fully or old data is left intact should the power fail immediately following the write command. I don't care about what happens in multiple sector writes - torn pages are acceptable.
Old Question:
Say you have old data X on disk, you write new data Y over it, and a tree falls on the power line during that write. With no fancy UPS or battery backed disk controller, you can end up with a torn page, where the data on disk is part X and part Y. Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?
I've been trying to understand the design of ACID systems like databases, and to my naive thinking, it seems firebird, which does not use a write-ahead log, is relying that a given write will not destroy old data (X) - only fail to fully write new data (Y). That means that if part of X is being overwritten, only the part of X that is being overwritten can be changed, not the part of X we intend to keep.
To clarify, this means if you have a page sized buffer, say 4096 bytes, filled with half Y, half X that we want to keep - and we tell the OS to write that buffer over X, there is no situation short of serious disk failure where the half X that we want to keep is corrupted during the write.
The traditional (SCSI, ATA) disk protocol specifications don't guarantee that any/every sector write is atomic in the event of sudden power loss (but see below for discussion of the NVMe spec). However, it seems tacitly agreed that non-ancient "real" disks quietly try their best to offer this behaviour (e.g. Linux kernel developer Christoph Hellwig mentions this off-hand in the 2017 presentation "Failure-Atomic file updates for Linux").
When it comes to synthetic disks (e.g. network attached block devices, certain types of RAID etc.) things are less clear and they may or may not offer sector atomicity guarantees while legally behaving per their given spec. Imagine a RAID 1 array (without a journal) comprised of a disk that offers 512 byte sized sectors but where the other disk offered a 4KiB sized sector thus forcing the RAID to expose a sector size of 4KiB. As a thought experiment, you can construct a scenario where each individual disk offers sector atomicity (relative to its own sector size) but where the RAID device does not in the face of power loss. This is because it would depend on whether the 512 byte sector disk was the one being read by the RAID and how many of the 8 512-byte sectors compromising the 4KiB RAID sector it had written before the power failed.
Sometimes specifications offer atomicity guarantees but only on certain write commands. The SCSI disk spec is an example of this and the optional WRITE ATOMIC(16) command can even give a guarantee beyond a sector but being optional it's rarely implemented (and thus rarely used). The more commonly implemented COMPARE AND WRITE is also atomic (potentially across multiple sectors too) but again it's optional for a SCSI device and comes with different semantics to a plain write...
Curiously, the NVMe spec was written in such a way to guarantee sector atomicity thanks to Linux kernel developer Matthew Wilcox. Devices that are compliant with that spec have to offer a guarantee of sector write atomicity and may choose to offer contiguous multi-sector atomicity up to a specified limit (see the AWUPF field). However, it's unclear how you can discover and use any multi-sector guarantee if you aren't currently in a position to send raw NVMe commands...
Andy Rudoff is an engineer who talks about investigations he has done on the topic of write atomicity. His presentation "Protecting SW From Itself: Powerfail Atomicity for Block Writes" (slides) has a section of video where he talks about how power failure impacts in-flight writes on traditional storage. He describes how he contacted hard drive manufacturers about the statement "a disk's rotational energy is used to ensure that writes are completed in the face of power loss" but the replies were non-committal as to whether that manufacturer actually performed such an action. Further, no manufacturer would say that torn writes never happen and while he was at Sun, ZFS added checksums to blocks which led to them uncovering cases of torn writes during testing. It's not all bleak though - Andy talks about how sector tearing is rare and if a write is interrupted then you usually get only the old sector, or only the new sector, or an error (so at least corruption is not silent). Andy also has an older slide deck Write Atomicity and NVM Drive Design which collects popular claims and cautions that a lot of software (including various popular filesystems on multiple OSes) are actually unknowingly dependent on sector writes being atomic...
(The following takes a Linux centric view but many of the concepts apply to general-purpose OSes that are not being deployed in a tightly controlled hardware environments)
Going back to 2013, BtrFS lead developer Chris Mason talked about how (the now defunct) Fusion-io had created a storage product that implemented atomic operation (Chris was working for Fusion-io at the time). Fusion-io also created a proprietary filesystem "DirectFS" (written by Chris) to expose this feature. The MariaDB developers implemented a mode that could take advantage of this behaviour by no longer doing double buffering resulting in "43% more transactions per second and half the wear on the storage device". Chris proposed a patch so generic filesystems (such as BtrFS) could advertise that they provided atomicity guarantees via a new flag O_ATOMIC but block layer changes would also be needed. Said block layer changes were also proposed by Chris in a later patch series that added a function blk_queue_set_atomic_write(). However, neither of the patch series ever entered the mainline Linux kernel and there is no O_ATOMIC flag in the (current 2020) mainline 5.7 Linux kernel.
Before we go further, it's worth noting that even if a lower level doesn't offer an atomicity guarantee, a higher level can still provide atomicity (albeit with performance overhead) to its users so long as it knows when a write has reached stable storage. If fsync() can tell you when writes are on stable storage (technically not guaranteed by POSIX but the case on modern Linux) then because POSIX rename is atomic you can use the create new file/fsync/rename dance to do atomic file updates thus allowing applications to do double buffering/Write Ahead Logging themselves. Another example lower down in the stack are Copy On Write filesystems like BtrFS and ZFS. These filesystems give userspace programs a guarantee of "all the old data" or "all the new data" after a crash at sizes greater than a sector because of their semantics even though a disk many not offer atomic writes. You can push this idea all the way down into the disk itself where NAND based SSDs don't overwrite the area currently used by an existing LBA and instead write the data to a new region and keep a mapping of where the LBA's data is now.
Resuming our abridged timeline, in 2015 HP researchers wrote a paper Failure-Atomic Updates of Application Data
in a Linux File System (PDF) (media) about introducing a new feature into the Linux port of AdvFS (AdvFS was originally part of DEC's Tru64):
If a file is opened with a new O_ATOMIC flag, the state of its application data will always reflect the most recent successful msync, fsync, or fdatasync. AdvFS furthermore includes a new syncv operation that combines updates to multiple files into a failure-atomic bundle [...]
In 2017, Christoph Hellwig wrote experimental patches to XFS to provide O_ATOMIC. In the "Failure-Atomic file updates for Linux" talk (slides) he explains how he drew inspiration from the 2015 paper (but without the multi-file support) and the patchset extends the XFS reflink work that already existed. However, despite an initial mailing list post, at the time of writing (mid 2020) this patchset is not in the mainline kernel.
During the database track of the 2019 Linux Plumbers Conference, MySQL developer Dimitri Kravtchuk asked if there were plans to support O_ATOMIC (link goes to start of filmed discussion). Those assembled mention the XFS work above, that Intel claim they can do atomicity on Optane but Linux doesn't provide an interface to expose it, that Google claims to provide 16KiB atomicity on GCE storage1. Another key point is that many database developers need something larger than 4KiB atomicity to avoid having to do double writes - PostgreSQL needs 8KiB, MySQL needs 16KiB and apparently the Oracle database needs 64KiB. Further, Dr Richard Hipp (author of the SQLite database) asked if there's a standard interface to request atomicity because today SQLite makes use of the F2FS filesystem's ability to do atomic updates via custom ioctl()s but the ioctl was tied to one filesystem. Chris replied that for the time being there's nothing standard and nothing provides the O_ATOMIC interface.
At the 2021 Linux Plumbers Conference Darrick Wong re-raised the topic of atomic writes (link goes to start of filmed discussion). He pointed out there are two different things that people mean when they say they want atomic writes:
Hardware provides some atomicity API and this capability is somehow exposed through the software stack
Make the filesystem do all the work to expose some sort of atomic write API irrespective of hardware
Darrick mentioned that Christoph had ideas for 1. in the past but Christoph has not come back to the topic and further there are unanswered questions (how you make userspace aware of limits, if the feature was exposed it would be restricted to direct I/O which may problematic for many programs). Instead Darrick suggested tackling 2. was to propose his FIEXCHANGE_RANGE ioctl which swaps the contents of two files (the swap is restartable if it fails part way through). This approach doesn't have the limits (e.g. smallish contiguous size, maximum number of scatter gather vectors, direct I/O only) that a hardware based solution would have and could theoretically be implementable in the VFS thus being filesystem agnostic...
TLDR; if you are in tight control of your whole stack from application all the way down the the physical disks (so you can control and qualify the whole lot) you can arrange to have what you need to make use of disk atomicity. If you're not in that situation or you're talking about the general case, you should not depend on sector writes being atomic.
When the OS sends the command to write a sector to disk is it atomic?
At the time of writing (mid-2020):
When using a mainline 4.14+ Linux kernel
If you are dealing with a real disk
a sector write sent by the kernel is likely atomic (assuming a sector is no bigger than 4KiB). In controlled cases (battery backed controller, NVMe disk which claims to support atomic writes, SCSI disk where the vendor has given you assurances etc.) a userspace program may be able to use O_DIRECT so long as O_DIRECT wasn't reverting to being buffered, the I/O didn't get split apart/merged at the block layer / you are sending device specific commands and are bypassing the block layer. However, in the general case neither the kernel nor a userspace program can safely assume sector write atomicity.
Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?
From a specification perspective if you are talking about a SCSI disk doing a regular SCSI WRITE(16) and a power failure happening in the middle of that write then the answer is yes: a sector could contain part X, part Y AND part garbage. A crash during an inflight write means the data read from the area that was being written to is indeterminate and the disk is free to choose what it returns as data from that region. This means all old data, all new data, some old and new, all zeros, all ones, random data etc. are all "legal" values to return for said sector. From an old draft of the SBC-3 spec:
4.9 Write failures
If one or more commands performing write operations are in the task set and are being processed when power is lost (e.g., resulting in a vendor-specific command timeout by the application client) or a medium error or hardware error occurs (e.g., because a removable medium was incorrectly unmounted), the data in the logical blocks being written by those commands is indeterminate. When accessed by a command performing a read or verify operation (e.g., after power on or after the removable medium is mounted), the device server may return old data, new data, or vendor-specific data in those logical blocks.
Before reading logical blocks which encountered such a failure, an application client should reissue any commands performing write operations that were outstanding.
1 In 2018 Google announced it had tweaked its cloud SQL stack and that this allowed them to use 16k atomic writes MySQL's with innodb_doublewrite=0 via O_DIRECT... The underlying customisations Google performed were described as being in the virtualized storage, kernel, virtio and the ext4 filesystem layers. Further, a no longer available beta document titled Best practices for 16 KB persistent disk and MySQL (archived copy) described what end users had to do to safely make use of the feature. Changes included: using an appropriate Google provided VM, using specialized storage, changing block device parameters and carefully creating an ext4 filesystem with a specific layout. However, at some point in 2020 this document vanished from GCE's online guides suggesting such end user tuning is not supported.
I think torn pages are not the problem. As far as I know, all drives have enough power stored to finish writing the current sector when the power fails.
The problem is that everybody lies.
At least when it comes to the database knowing when a transaction has been committed to disk, everybody lies. The database issues an fsync, and the operating system only returns when all outstanding writes have been committed to disk, right? Maybe not. It's common, especially with RAID cards and/or SATA drives, for your program to be told everything has committed (that is, fsync returns) and yet there is data not yet on the drive.
You can try using Brad's diskchecker to find out if the platform you are going to use for your database can survive pulling the plug without losing data. The bottom line: If diskchecker fails, the platform is not safe for running a database. Databases with ACID rely upon knowing when a transaction has been committed to backing store and when it has not. This is true whether or not the databases uses write-ahead loggin (and if the database returns to the user without having done an fsync, then transactions can be lost in the event of a failure, so it should not claim that it provides ACID semantics).
There's a long thread on the Postgresql mailing list discussing durability. It starts out talking about SSDs, but then it gets into SATA drives, SCSI drives, and file systems. You may be surprised to learn how exposed your data can be to loss. It's a good thread for anyone with a database that needs durability, not just those running Postgresql.
Nobody seems to agree on this question. So I spent a lot of time trying different Google queries until I finally found an answer.
from Dr. Stephen Tweedie, RedHat employee and linux kernel filesystem and virtual memory developer in a talk on ext3 (which he developed) transcript here. If anyone knows, it'd be him.
"It's not sufficient just to write the thing to the journal, because there's got to be some mark in the journal which says: well, (has this journal record actually) does this journal record actually represent a complete consistency to the disk? And the way you do that is by having some atomic operation which marks that transaction as being complete on disk" [23m, 14s]
"Now, disks these days actually make these guarantees. If you start a write operation to a disk, then even if the power fails in the middle of that sector write, the disk has enough power available, and it can actually steal power from the rotational energy of the spindle; it has enough power to complete the write of the sector that's being written right now. In all cases, the disks make that guarantee." [23m, 41s]
No, they are not. Worse yet, disks may lie and say the data is written when it is in fact in the disk cache, under default settings. For performance reasons, this may be desirable (actual durability is up to an order of magnitude slower) but it means if you lose power and the disk cache is not physically written, your data is gone.
Real durability is both hard and slow unfortunately, since you need to make at least one full rotation per write, or 2+ with journalling/undo. This limits you to a couple hundred DB transactions per second, and requires disabling write caching at a fairly low level.
For practical purposes though, the difference is not that big of a deal in most cases.
See:
How (not) to achieve durability.
FSync() may not flush to disk
People don't seem to agree on what happens during a sector write if the power fails. Maybe because it depends on the hardware being used, and even the filesystem.
From wikipedia (http://en.wikipedia.org/wiki/Journaling_file_system):
Some disk drives guarantee write
atomicity during a power failure.
Others, however, may stop writing
midway through a sector after power is
lost, leaving it mismatched against
its error-correcting code. The sector
is thus corrupt and its contents lost.
A physical journal guards against such
corruption because it holds a complete
copy of the sector, which it can
replay over the corruption upon next
mount.
Seems to suggest that some hard drives will not finish writing the sector, but that a journaling filesystem can protect you from data loss the same way the xlog protects a database.
From the linux kernel mailing list in a discussion on ext3 journaling filesystem:
In any case bad sector checksum is
hardware bug. Sector write is supposed
to be atomic, it either happens or
not.
I'd tend to believe that over the wiki comment. Actually, the very existence of a database (firebird) with no xlog implies that sector write is atomic, that it cannot clobber data you did not mean to change.
There's quite a bit of discussion Here about atomicity of sector writes, and again no agreement. But the people who are disagreeing seem to be talking about multiple-sector writes (which are not atomic on many modern hard-drives.) Those who are saying sector writes are atomic do seem to know more about what they're talking about.
The answer to your first question depends on the hardware involved. At least with some older hardware, the answer was yes -- a power failure could result it garbage being written to the disk. Most current disks, however, have a bit of a "UPS" built into the disk itself -- a capacitor that's large enough to power the disk long enough to write the data in the on-disk cache out to the disk platter. They also have circuitry to detect whether the power supply is still good, so when the power gets flaky, they write the data in the cache to the platter, and ignore garbage they might receive.
As far as a "torn page" goes, a typical disk only accepts commands to write an entire sector at a time, so what you'll get will normally be an integral number of sectors written correctly, and others remaining unchanged. If, however, you're using a logical page size that's larger than a single sector, you can certainly end up with a page that's partially written.
That, however, mostly applies to a direct connection to a normal moving-platter type hard drive. With almost anything else, the rules can and often will be different. Just for an obvious example, if you're writing over the network, you're mostly at the mercy of the network protocol in use. If you transmit data over TCP, data that doesn't match up with the CRC will be rejected, but the same data transmitted over UDP, with the same corruption, might be accepted.
I suspect this assumption is wrong.
Modern HDDs encode the data in sectors - and additionally protect it with ECC. Therefore you can end-up with garbaging all the sector content - it will just not make sense with the encoding used.
As for increasingly poplular SSDs, the situation is even more gruesome - the block is cleared prior to being overwritten, so, depending on the firmware being used and the amount of free space, entirely unrelated sectors can be damaged.
By the way, an OS crash will not lead to data being damaged within single sector.
I would expect one torn page to consist of part X, part Y, and part unreadable sector. If a head is in the middle of writing a sector when the power fails, the drive should park the heads immediately, so that the rest of the drive (aside from that one sector) will remain undamaged.
In some cases I would expect several torn pages consisting of part X and part Y, but only one torn page would include an unreadable sector. The reason for several torn pages is that the drive can buffer lots of writes internally, and the order of writing might interleave various sectors from various pages.
I've read conflicting stories about whether a new write to the unreadable sector will make it readable again. Even if the answer is yes, that will be new data Z, neither X nor Y.
when updating the
disk, the only guarantee drive manufactures make is that a single 512-
byte write is atomic (i.e., it will either complete in its entirety or it won’t
complete at all); thus, if an untimely power loss occurs, only a portion of
a larger write may complete (sometimes called a torn write).