Data distortion after using "scp" for transfer - scp

Recently I transferred a set of data from one server to hpcc(high-performance computing)
Command is like:
scp /folder1/*.fastq.gz xxx#hpcc:/home/
scp /folder2/*.fastq.gz xxx#hpcc:/home/
scp /folder3/*.fastq.gz xxx#hpcc:/home/
I open several terminals to transfer the data at the same time.
And in total I have ~50 such fastq.gz files, each around 10GB.
I'm just wondering is there any possibility that data(esp. such large data) will be distorted when being transferred in the way mentioned above?
Because data on the server is in good-shape; while some data after being copied to hpcc is distorted.
thx
thx

I strongly doubt that your data was corrupted in transit by scp(1).
TCP provides a (weak) 16 bit CRC checksum of traffic streams. Because it is only sixteen bits long, relying upon TCP for data integrity means corrupted packets will still validate roughly one every (2^16) corrupted packets. I've long since lost the link (and the math), but vaguely recall that means corrupted data will be validated as correct once every two to four gigabytes across the public Internet -- though those numbers relied upon a specific error introduction rate at the time I read that statistic.
SSH Version 2 introduced Message Authentication Checks into the protocol. These are negotiated between peers, but I expect the weakest allowed would be MD5, which provides for a 128 bit cryptographic hash of the data. Cryptographic hashes are far more advanced than the Cyclic Redundancy Checks that were more common for detecting data transmission errors two decades ago, and 128 bits is a significant expansion in checksum size. We might not trust MD5 enough to rely on it exclusively these days for resistance against dedicated attackers but it should be sufficient for discovering errors that happen by mistake in all but the most incredible circumstances.
I would look elsewhere for your corruption -- first and foremost, the destination drives where you stored your data.

I know this is an ancient question, but I don't think scp could be responsible either; my guess is filename collision.
You stated that you had several scp copies running at the same time. The commands pasted above will copy the contents of /folder1, /folder2 and /folder3 into /home. If you had two files with the same filename, e.g.
/folder1/argle.fastq.gz
/folder1/bargle.fastq.gz
/folder2/argle.fastq.gz
then you'll have a filename collision on /home. Since scp will happily overwrite files on dest and I don't think it throws a lock on files while it works, copying two different files with the same name to the same place could easily result in a corrupt file.

Related

How do I compress or compact encrypted virtualbox disk image

As I have created a Debian VM inside VirtualBox by encrypting their partition. So that the OS must be running in an encrypted partition. Although while creating a disk image(VHD), I had given for Dynamic allocation, but after OS installation it looks the disk image was consuming the entire disk space. Now the image size is 20GB. Is it possible for us to compress or compact it to some smaller sizes. I saw the documentations to compact the disk image in Virtual Box, but I may need to know whether we can do the same for encrypted disk image.
Your help is greatly appreciated.
Thanks.
It depends on the type of encryption being used. Since you're using Debian I assume you're using LUKS, which is inflexible. The space has to be pre-allocated and therefore the image will utilise the full space allocated to it.
Yes, there is a way to do it, but too complex to do it.
Each time you need/want to compact it you need to do some steps carefully.
(Maybe this is not really need it, try first without this) Blank with zeros all free space inside 'clear' mounted partitions, so free space is zeroed in 'clear', it will not be zeroed in 'encrypted' view point, since encryption will encrypt such zeros.
Shutdown the machine and boot with a LiveCD iso that let you mount the virtual hdd you are using and a new 'dynamic' and 'empty' one.
Set the partition scheme and encription identically on the new one, but ensure encription will not do the 'fill' part, so it does not write all sectors... this is the top most important part... this way the new virtual disk is smalll in size, but encypted by you LUKs, etc.
At this point, only scheme and encryption is on the new 'small' one, now is time to mount both enctypted... the old and the new, so they can see in 'plain' at the same time.
Again this is very important, clone form the old 'plain' to the new 'plain' only sectors that have data (most tools to clone partitions does that).
As i say... the top most important thing (to get a smaller virtual HDD) is:
Create a new virtual dynamic disk empty
Partition it and Encrypt it without writting all sectors; so omit the dd with random data prior to do the encryption or else the dynamic file will grow to max), also omit the fill empty space, that again will grow the virtual disk to max
Clone the partitions from the plain view (mounted and de-crypted on fly), so the clone tool will only write data areas of files, etc, but not free space.
There is a small part that will not be able to be reduced... files inside encrypted partitions that have full clusters fill with zeros (hope you do not have any of thoose)... the cause is that such space (when no encryption, as is all zeros, the normal compat see the full cluster is zero so it does not need such space; but when it is encrypted, such cluster is not all zero inside the real virtaul disk file, so the compact method can not reduce it).
The idea behind all is:
When encryption is on... to get the smallest virtual disk size, start with a dynamic and empty one and write as less as possible clusters on it when cloning the previous one.
As said, it is too much work... and time to time, each write occurs it will start growing and growing again.
My best personal recomendation is, get a 'BIG' and 'FAST' disk and use a fixed virtual disk... if i read well and your disk is only 20GiB... you gain in speed a lot for having it fixed and not dynamic and will not get worried about 'fragmentation' etc.
Remember if you use a USB for it, get one able to write at 30MiB/s (if only have USB 2.0 ports), if you are lucky (like me) and have at least one USB 3 port (better if it is a USB 3.1 Gen 2 Type C) seach for a 2.5 inch HDD 500GiB Sata III (with write speed greater than 100MiB/s, it is really cheap, less than 25 euros) and a Sata III to USB 3.1 Gen 2 Type C enclosure (also cheap, some are under 15 euros)... and avoid having to 'reduce', 'clone' etc.
I have 10 virtual machines on a 500GiB (with more than 50% free space), each is 20GiB in size (with Windows system inside them taking near 16BiG) and VeraCrypt encryption... so i am on the quite near case to you... i opted to use a USB 3.1 gen 2 Type C enclosure to hold all the fixed size VDI files... my experience is that encrypted fixed size fly if compared to non encrypted dynamic size.
Of course, ensure you do the needed test (when encryption takes place), i mean... test virtual HDD speed with no encryption, then test encryption algorithms on ram... and choose a method that is faster than 1.6 times the speed of the disk... so encription will not be a bottle neck... else you can have a really bad speed caused by encryption.
Also think on this, how much cores you show to the guest? that will make encryption speed very different... but also think the worst case... how much CPU will use the non encryption threads on that guest?
Just as an example... if inside the guest you are doing LZMA2 compression (or video transcoding H.264 for example) etc... the free CPU for encryption is very low... so encryption will slow down things a lot... sample cases also do much I/O to disks, so encrypt/decrypt a lot per second is needed.
Maybe a better aproach... would be... encrypt the 'container' not the 'system'... in other words... encrypt where the VDI files are stored, not the whole guest system... create a container per VDI if want different phrase passwords, etc. That way the VDI can also be dynamic and be compacted, etc.
Of course, i would be of more help if you told what encryption scheme (without details) are using.
This makes a really great difference in possible answers:
Are you encrypting system partition with any tool that runs on guest? Then use the 'clone' only used clusters trick
Are you encrypting but setting the VDI encryption property on? Maybe VirtualBox console will help to compat them
Are you encrypting the container where VDI is stored? I am quite sure this is not your case, since in such case compact can be done as normal, VDI is not encrypted at all, neither anything iside it.
I talk about VDI... same applies to the rest formats, VHD, VHDx, etc.
Remember... if encription is done on guest and still want to reduce (compact) the virtua hdd file... start with a new dynamic one, put partition scheme, encryption but without filling all the disk... at this point virtual disk file size must not be great, just a few megabytes... then clone form the old to the new one all used clusters, but not the not used ones.
Advise: Prepare to repeat the 'compact' by 'cloning' a few times per 100 hours of intense use of the guest... if gain is less than 50% it does not compensate the effort... then the best can be done is use fixed size.
Special note: With Fixed size the access speed is much more than with dynamic size... having a dynamic size with 100% size as if it where fixed is a big lost in speed... how much? you must do the test in your machine, depends a lot on CPU, I/O speed (input/ouptup operations per second) of storage you have and also on transfer speed (MiB/s), and other factors... so best do some test.
Since you are talking about 20GiB... better do the test of fixed size... i am quite sure you will enjoy it a lot.
Other thing would be talking about 500GiB system partition with only 10% fill... since space gain could be 450GiB, it is wellcome to do the clone method to compact it, that is why i say how to do such... for such people and your you, and for any one.
P.D.: If someone does not know how to do something, that does not mean it is not possible, and if someone say something is not possible, better for that person explain the demostration or be prepeared to be called an idiot; technology improves a lot time to time, knowledge a lot more.

Can I securely delete a file this way?

Let's say I have a 10 MB file and go through these steps:
Open it in my favorite programming language for Read/Write
Erase everything in the stream
Write exactly 10 MB of random back to the same stream
Save the changes to disk
Delete the file through normal means
Can I be certain that the new 10 MB successfully overwrote the old 10 MB on a sector level in the hard drive? Or is it possible that the "erase everything in the stream" step deletes the old file and potentially writes the new 10 MB in a new location?
The data may still be accessible by a professional who knows what they're doing and can access the raw data on the disk (i.e. without going through the filesystem).
Your program is basically equivalent to the Linux shred command, which contains the following warning:
CAUTION: Note that shred relies on a very important assumption:
that the file system overwrites data in place. This is the traditional
way to do things, but many modern file system designs do not satisfy this
assumption. The following are examples of file systems on which shred is
not effective, or is not guaranteed to be effective in all file system modes:
log-structured or journaled file systems, such as those supplied with
AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some writes
fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance's NFS server
file systems that cache in temporary locations, such as NFS
version 3 clients
compressed file systems
There's other situations as well, such as SSDs with wear leveling.
no, since on any modern file system commits are atomic, you can be almost 100% certain the 10Mb did not overwrite the old 10Mb, and that's before we consider journaled file systems that actually guarantee this.
Short answer: No.
This might depend on your language and OS. I have a feeling that the stream calls are passed to the OS and the OS then decides what to do, so I'd lean towards your second question being correct just to err on the safe side. Furthermore, magnetic artifacts will be present after a deletion which can still be used to recover said data. Even overwriting the same sectors with all zeros could leave behind the data in a faded state. Generally it is recommended to make several deletion passes. See here for an explanation or here for an open source C# file shredder.
For Windows you could use the SDelete command line utility which implements the Department of Defense clearing and sanitizing standard:
Secure delete applications overwrite a deleted file's on-disk data
using techiques that are shown to make disk data unrecoverable, even
using recovery technology that can read patterns in magnetic media
that reveal weakly deleted files.
Of particular note:
Compressed, encrypted and sparse are managed by NTFS in 16-cluster
blocks. If a program writes to an existing portion of such a file NTFS
allocates new space on the disk to store the new data and after the
new data has been written, deallocates the clusters previously
occupied by the file.

Basic high performance data authenticity

(I am not a native speaker and might not be correct in terms of terminology. Sorry about that.)
I am transmitting data via radio between AVR microcontrollers for personal use and would like for clients to demonstrate the authenticity of transmitted data in that it originates from one of the authorized clients. This means I am not requiring non-repudiation and would be able to pre-define a shared key. I have done some research on different approaches and found that I need some assistance on chosing one that best meets my requirements.
Please understand that I do not require maximum security. I would simply like to prevent a potential script kiddie neighbor from breaking in within a matter of hours. Should breaking in with average consumer gear require a number of weeks as of today I would be OK.
The messages I am transmitting are rather small in size (no more than 30 bytes with only a few bytes payload) and the frequency would be no more than 30 messages / min.
One use case is a motion detector sending a message over the air to a processing unit which then sends another message over the air to a light switch. Please do not focus on transport. This question is only on data autheticity.
I am running the client / server software (in C) on 20 MHz AVR microcontrollers with very limited Flash and RAM. So I am looking for a solution with small code size and RAM utilization while still providing a high data rate.
I did some performance testing with an MD5 implementation (C) creating hashes from 20 bytes data and found that it might be too slow. I understand that an MD5 implementation by itself would not solve the requirement. I did the testing only for evaluating hashing performance.
Thanks for comments
I would use 128-bit AES to sign the messages. Here is an excellent source that has already implemented this for AVR, with full documentation of sizes and cycle counts, including different versions that trade off size/speed. http://avrcryptolib.das-labor.org/trac/wiki/AES
If you are happy with a compromise, calculated the CRC-32 or CRC-64 of the message payload with a secret key appended to the end (of the payload, not the CRC checksum). Both ends can do this with the same secret key to get the same result. Not sure of the exact hackability of this but it sure isn't zero.

Grand Unified Theory of logging

Is their a Grand Unified Theory of logging? Shall we develop one? Question (just to show this is not a discussion :), how can I improve on the following? (note that I live mainly in the embedded world, but non-embedded suggestions are also welcome)
How do you log, when do you log, what do you log, what do you do with log files?
How do you log - I generally have macros, #ifdef TESTING, sort of thing. They write to RAM and a low priority process writes them out when the system is idle (using UDP, since I do embedded systems)
When do you log - same as voting, early and often. At every (in)significant program event, I log at varying levels. Events received, transaction succeed/fail, data updated, etc
What do you log - Fatal/Error/Warning/Info/Debug/Trace is covered in When to use the different log levels?
What do you do with log files - 1) keep them (in CVS), both pass and fail 2) capture everything and filter later in case I can't repeat a problem. I have tools to filter the log by "level" (Fatal/Error/etc), process, file, etc. And to draw message sequence charts, dump data structures, draw histograms of memory usage - what am I missing?
Hmmm, binary or ascii log file format? Ascii is bulkier, but binary requires more processing. I have done both, currently I use ascii
Question - did I miss anything, and how can I improve on this?
You could "instrument" your code in many different ways, everything from start-up/shut-down events to individual machine instruction execution (using a processor emulator). Of all the possibilities, what's worth doing? Don't just do it for the sake of completeness; have a specific goal in mind. A business case if you like, with a benefit you expect to receive. E.g.:
Insight into CPU task execution times/patterns to enable optimisation (if you need to improve performance).
Insight into other systems to resolve system integration issues (e.g. what messages is your VoIP box sending and receiving when it connects to a particular peer?)
Insight into the nature of errors (for field diagnostics)
Aid in development
Aid in validation testing
I imagine that there's no grand unified theory of logging, because what you do would depend on many details:
Quantity of data
Type of data
Events
Streamed audio/video
Available storage
Storage speed
Storage capacity
Available channels to extract data
Bandwidth
Cost
Availability
Internet connected 24×7
Site visit required
Need to unlock a rusty gate, climb a ladder onto a roof, to plug in a cable, after filling out OHS documentation
Need to wait until the Antarctic winter is over and the ice sheets thaw
Random access vs linear access (e.g. if you compress it, do you need to read from the start to decompress and access some random point?)
Need to survive error conditions
Watchdog reboots
Possible data corruption
Due to failing power supply
Due to unreliable storage media
Need to survive a plane crash
As for ASCII vs binary, I usually prefer to keep the logging simple, and put any nice presentation in a PC application that decodes the data. It's usually easier to create a user-friendly presentation in PC software (written in e.g. Python) rather than in the embedded system itself.
did I miss anything, and how can I
improve on this?
Asynchronous logging.
Using multiple log files for the same process for different logging abstractions. e.g. the process' activities are logged in a normal log file. And the process' stats (periodic statistics that you might be interested in) are logged in a separate stats log file.
Hmmm, binary or ascii log file format?
Ascii is bulkier, but binary requires
more processing. I have done both,
currently I use ascii
ASCII is good. More often than not, logs are meant to be used for debugging purposes. A human readable form eases and speeds this up.
However, if your logs are used mostly to record information which is used later on for analysis and generation of reports (e.g. stats or latencies etc.) a binary format would be preferred. You can go one step ahead and use a custom format along with a db service which does index based sorting, where the index can be a tuple of time with the event type.
--
One thing which may be helpful is to have a "maybeLogger" object which will accept log records for an operation which may or may not succeed, and then either ditch those records if the operation succeeds or fails in an uninteresting way, or log them if it does something interesting. This is relatively easy to do in something like .net. In an embedded system, it can only be done really easily if the amount of stuff to be logged is small enough to fit in free RAM, but one could probably use a garbage-collection-based approach to hold stuff in flash (have one 'stream' of data in flash for new log entries, and another for ones that are confirmed to be interesting; periodically move data which is known to be good from the first stream to the second).
Here's my $0.02.
I only log when I'm having a problem and need to track down the source. Usually this has to do with a customer's environment, so I can't just attach the debugger. My solution is to enable the Telnet port and use that to print out statements as to where the program is and values of variables.
I do ASCII only because it's over telnet.
Another aspect of telnet is that it is pretty simple. It's a TCP port with text being thrown out. Very little processing other than the normal TCP headaches.
The log files are dumped as soon as I get them because I have not tried to capture and save a telnet session. I guess I could with WireShark, but I don't need a history of that session. I just need to find the problem and verify a fix.

Are disk sector writes atomic?

Clarified Question:
When the OS sends the command to write a sector to disk is it atomic? i.e. Write of new data succeeds fully or old data is left intact should the power fail immediately following the write command. I don't care about what happens in multiple sector writes - torn pages are acceptable.
Old Question:
Say you have old data X on disk, you write new data Y over it, and a tree falls on the power line during that write. With no fancy UPS or battery backed disk controller, you can end up with a torn page, where the data on disk is part X and part Y. Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?
I've been trying to understand the design of ACID systems like databases, and to my naive thinking, it seems firebird, which does not use a write-ahead log, is relying that a given write will not destroy old data (X) - only fail to fully write new data (Y). That means that if part of X is being overwritten, only the part of X that is being overwritten can be changed, not the part of X we intend to keep.
To clarify, this means if you have a page sized buffer, say 4096 bytes, filled with half Y, half X that we want to keep - and we tell the OS to write that buffer over X, there is no situation short of serious disk failure where the half X that we want to keep is corrupted during the write.
The traditional (SCSI, ATA) disk protocol specifications don't guarantee that any/every sector write is atomic in the event of sudden power loss (but see below for discussion of the NVMe spec). However, it seems tacitly agreed that non-ancient "real" disks quietly try their best to offer this behaviour (e.g. Linux kernel developer Christoph Hellwig mentions this off-hand in the 2017 presentation "Failure-Atomic file updates for Linux").
When it comes to synthetic disks (e.g. network attached block devices, certain types of RAID etc.) things are less clear and they may or may not offer sector atomicity guarantees while legally behaving per their given spec. Imagine a RAID 1 array (without a journal) comprised of a disk that offers 512 byte sized sectors but where the other disk offered a 4KiB sized sector thus forcing the RAID to expose a sector size of 4KiB. As a thought experiment, you can construct a scenario where each individual disk offers sector atomicity (relative to its own sector size) but where the RAID device does not in the face of power loss. This is because it would depend on whether the 512 byte sector disk was the one being read by the RAID and how many of the 8 512-byte sectors compromising the 4KiB RAID sector it had written before the power failed.
Sometimes specifications offer atomicity guarantees but only on certain write commands. The SCSI disk spec is an example of this and the optional WRITE ATOMIC(16) command can even give a guarantee beyond a sector but being optional it's rarely implemented (and thus rarely used). The more commonly implemented COMPARE AND WRITE is also atomic (potentially across multiple sectors too) but again it's optional for a SCSI device and comes with different semantics to a plain write...
Curiously, the NVMe spec was written in such a way to guarantee sector atomicity thanks to Linux kernel developer Matthew Wilcox. Devices that are compliant with that spec have to offer a guarantee of sector write atomicity and may choose to offer contiguous multi-sector atomicity up to a specified limit (see the AWUPF field). However, it's unclear how you can discover and use any multi-sector guarantee if you aren't currently in a position to send raw NVMe commands...
Andy Rudoff is an engineer who talks about investigations he has done on the topic of write atomicity. His presentation "Protecting SW From Itself: Powerfail Atomicity for Block Writes" (slides) has a section of video where he talks about how power failure impacts in-flight writes on traditional storage. He describes how he contacted hard drive manufacturers about the statement "a disk's rotational energy is used to ensure that writes are completed in the face of power loss" but the replies were non-committal as to whether that manufacturer actually performed such an action. Further, no manufacturer would say that torn writes never happen and while he was at Sun, ZFS added checksums to blocks which led to them uncovering cases of torn writes during testing. It's not all bleak though - Andy talks about how sector tearing is rare and if a write is interrupted then you usually get only the old sector, or only the new sector, or an error (so at least corruption is not silent). Andy also has an older slide deck Write Atomicity and NVM Drive Design which collects popular claims and cautions that a lot of software (including various popular filesystems on multiple OSes) are actually unknowingly dependent on sector writes being atomic...
(The following takes a Linux centric view but many of the concepts apply to general-purpose OSes that are not being deployed in a tightly controlled hardware environments)
Going back to 2013, BtrFS lead developer Chris Mason talked about how (the now defunct) Fusion-io had created a storage product that implemented atomic operation (Chris was working for Fusion-io at the time). Fusion-io also created a proprietary filesystem "DirectFS" (written by Chris) to expose this feature. The MariaDB developers implemented a mode that could take advantage of this behaviour by no longer doing double buffering resulting in "43% more transactions per second and half the wear on the storage device". Chris proposed a patch so generic filesystems (such as BtrFS) could advertise that they provided atomicity guarantees via a new flag O_ATOMIC but block layer changes would also be needed. Said block layer changes were also proposed by Chris in a later patch series that added a function blk_queue_set_atomic_write(). However, neither of the patch series ever entered the mainline Linux kernel and there is no O_ATOMIC flag in the (current 2020) mainline 5.7 Linux kernel.
Before we go further, it's worth noting that even if a lower level doesn't offer an atomicity guarantee, a higher level can still provide atomicity (albeit with performance overhead) to its users so long as it knows when a write has reached stable storage. If fsync() can tell you when writes are on stable storage (technically not guaranteed by POSIX but the case on modern Linux) then because POSIX rename is atomic you can use the create new file/fsync/rename dance to do atomic file updates thus allowing applications to do double buffering/Write Ahead Logging themselves. Another example lower down in the stack are Copy On Write filesystems like BtrFS and ZFS. These filesystems give userspace programs a guarantee of "all the old data" or "all the new data" after a crash at sizes greater than a sector because of their semantics even though a disk many not offer atomic writes. You can push this idea all the way down into the disk itself where NAND based SSDs don't overwrite the area currently used by an existing LBA and instead write the data to a new region and keep a mapping of where the LBA's data is now.
Resuming our abridged timeline, in 2015 HP researchers wrote a paper Failure-Atomic Updates of Application Data
in a Linux File System (PDF) (media) about introducing a new feature into the Linux port of AdvFS (AdvFS was originally part of DEC's Tru64):
If a file is opened with a new O_ATOMIC flag, the state of its application data will always reflect the most recent successful msync, fsync, or fdatasync. AdvFS furthermore includes a new syncv operation that combines updates to multiple files into a failure-atomic bundle [...]
In 2017, Christoph Hellwig wrote experimental patches to XFS to provide O_ATOMIC. In the "Failure-Atomic file updates for Linux" talk (slides) he explains how he drew inspiration from the 2015 paper (but without the multi-file support) and the patchset extends the XFS reflink work that already existed. However, despite an initial mailing list post, at the time of writing (mid 2020) this patchset is not in the mainline kernel.
During the database track of the 2019 Linux Plumbers Conference, MySQL developer Dimitri Kravtchuk asked if there were plans to support O_ATOMIC (link goes to start of filmed discussion). Those assembled mention the XFS work above, that Intel claim they can do atomicity on Optane but Linux doesn't provide an interface to expose it, that Google claims to provide 16KiB atomicity on GCE storage1. Another key point is that many database developers need something larger than 4KiB atomicity to avoid having to do double writes - PostgreSQL needs 8KiB, MySQL needs 16KiB and apparently the Oracle database needs 64KiB. Further, Dr Richard Hipp (author of the SQLite database) asked if there's a standard interface to request atomicity because today SQLite makes use of the F2FS filesystem's ability to do atomic updates via custom ioctl()s but the ioctl was tied to one filesystem. Chris replied that for the time being there's nothing standard and nothing provides the O_ATOMIC interface.
At the 2021 Linux Plumbers Conference Darrick Wong re-raised the topic of atomic writes (link goes to start of filmed discussion). He pointed out there are two different things that people mean when they say they want atomic writes:
Hardware provides some atomicity API and this capability is somehow exposed through the software stack
Make the filesystem do all the work to expose some sort of atomic write API irrespective of hardware
Darrick mentioned that Christoph had ideas for 1. in the past but Christoph has not come back to the topic and further there are unanswered questions (how you make userspace aware of limits, if the feature was exposed it would be restricted to direct I/O which may problematic for many programs). Instead Darrick suggested tackling 2. was to propose his FIEXCHANGE_RANGE ioctl which swaps the contents of two files (the swap is restartable if it fails part way through). This approach doesn't have the limits (e.g. smallish contiguous size, maximum number of scatter gather vectors, direct I/O only) that a hardware based solution would have and could theoretically be implementable in the VFS thus being filesystem agnostic...
TLDR; if you are in tight control of your whole stack from application all the way down the the physical disks (so you can control and qualify the whole lot) you can arrange to have what you need to make use of disk atomicity. If you're not in that situation or you're talking about the general case, you should not depend on sector writes being atomic.
When the OS sends the command to write a sector to disk is it atomic?
At the time of writing (mid-2020):
When using a mainline 4.14+ Linux kernel
If you are dealing with a real disk
a sector write sent by the kernel is likely atomic (assuming a sector is no bigger than 4KiB). In controlled cases (battery backed controller, NVMe disk which claims to support atomic writes, SCSI disk where the vendor has given you assurances etc.) a userspace program may be able to use O_DIRECT so long as O_DIRECT wasn't reverting to being buffered, the I/O didn't get split apart/merged at the block layer / you are sending device specific commands and are bypassing the block layer. However, in the general case neither the kernel nor a userspace program can safely assume sector write atomicity.
Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?
From a specification perspective if you are talking about a SCSI disk doing a regular SCSI WRITE(16) and a power failure happening in the middle of that write then the answer is yes: a sector could contain part X, part Y AND part garbage. A crash during an inflight write means the data read from the area that was being written to is indeterminate and the disk is free to choose what it returns as data from that region. This means all old data, all new data, some old and new, all zeros, all ones, random data etc. are all "legal" values to return for said sector. From an old draft of the SBC-3 spec:
4.9 Write failures
If one or more commands performing write operations are in the task set and are being processed when power is lost (e.g., resulting in a vendor-specific command timeout by the application client) or a medium error or hardware error occurs (e.g., because a removable medium was incorrectly unmounted), the data in the logical blocks being written by those commands is indeterminate. When accessed by a command performing a read or verify operation (e.g., after power on or after the removable medium is mounted), the device server may return old data, new data, or vendor-specific data in those logical blocks.
Before reading logical blocks which encountered such a failure, an application client should reissue any commands performing write operations that were outstanding.
1 In 2018 Google announced it had tweaked its cloud SQL stack and that this allowed them to use 16k atomic writes MySQL's with innodb_doublewrite=0 via O_DIRECT... The underlying customisations Google performed were described as being in the virtualized storage, kernel, virtio and the ext4 filesystem layers. Further, a no longer available beta document titled Best practices for 16 KB persistent disk and MySQL (archived copy) described what end users had to do to safely make use of the feature. Changes included: using an appropriate Google provided VM, using specialized storage, changing block device parameters and carefully creating an ext4 filesystem with a specific layout. However, at some point in 2020 this document vanished from GCE's online guides suggesting such end user tuning is not supported.
I think torn pages are not the problem. As far as I know, all drives have enough power stored to finish writing the current sector when the power fails.
The problem is that everybody lies.
At least when it comes to the database knowing when a transaction has been committed to disk, everybody lies. The database issues an fsync, and the operating system only returns when all outstanding writes have been committed to disk, right? Maybe not. It's common, especially with RAID cards and/or SATA drives, for your program to be told everything has committed (that is, fsync returns) and yet there is data not yet on the drive.
You can try using Brad's diskchecker to find out if the platform you are going to use for your database can survive pulling the plug without losing data. The bottom line: If diskchecker fails, the platform is not safe for running a database. Databases with ACID rely upon knowing when a transaction has been committed to backing store and when it has not. This is true whether or not the databases uses write-ahead loggin (and if the database returns to the user without having done an fsync, then transactions can be lost in the event of a failure, so it should not claim that it provides ACID semantics).
There's a long thread on the Postgresql mailing list discussing durability. It starts out talking about SSDs, but then it gets into SATA drives, SCSI drives, and file systems. You may be surprised to learn how exposed your data can be to loss. It's a good thread for anyone with a database that needs durability, not just those running Postgresql.
Nobody seems to agree on this question. So I spent a lot of time trying different Google queries until I finally found an answer.
from Dr. Stephen Tweedie, RedHat employee and linux kernel filesystem and virtual memory developer in a talk on ext3 (which he developed) transcript here. If anyone knows, it'd be him.
"It's not sufficient just to write the thing to the journal, because there's got to be some mark in the journal which says: well, (has this journal record actually) does this journal record actually represent a complete consistency to the disk? And the way you do that is by having some atomic operation which marks that transaction as being complete on disk" [23m, 14s]
"Now, disks these days actually make these guarantees. If you start a write operation to a disk, then even if the power fails in the middle of that sector write, the disk has enough power available, and it can actually steal power from the rotational energy of the spindle; it has enough power to complete the write of the sector that's being written right now. In all cases, the disks make that guarantee." [23m, 41s]
No, they are not. Worse yet, disks may lie and say the data is written when it is in fact in the disk cache, under default settings. For performance reasons, this may be desirable (actual durability is up to an order of magnitude slower) but it means if you lose power and the disk cache is not physically written, your data is gone.
Real durability is both hard and slow unfortunately, since you need to make at least one full rotation per write, or 2+ with journalling/undo. This limits you to a couple hundred DB transactions per second, and requires disabling write caching at a fairly low level.
For practical purposes though, the difference is not that big of a deal in most cases.
See:
How (not) to achieve durability.
FSync() may not flush to disk
People don't seem to agree on what happens during a sector write if the power fails. Maybe because it depends on the hardware being used, and even the filesystem.
From wikipedia (http://en.wikipedia.org/wiki/Journaling_file_system):
Some disk drives guarantee write
atomicity during a power failure.
Others, however, may stop writing
midway through a sector after power is
lost, leaving it mismatched against
its error-correcting code. The sector
is thus corrupt and its contents lost.
A physical journal guards against such
corruption because it holds a complete
copy of the sector, which it can
replay over the corruption upon next
mount.
Seems to suggest that some hard drives will not finish writing the sector, but that a journaling filesystem can protect you from data loss the same way the xlog protects a database.
From the linux kernel mailing list in a discussion on ext3 journaling filesystem:
In any case bad sector checksum is
hardware bug. Sector write is supposed
to be atomic, it either happens or
not.
I'd tend to believe that over the wiki comment. Actually, the very existence of a database (firebird) with no xlog implies that sector write is atomic, that it cannot clobber data you did not mean to change.
There's quite a bit of discussion Here about atomicity of sector writes, and again no agreement. But the people who are disagreeing seem to be talking about multiple-sector writes (which are not atomic on many modern hard-drives.) Those who are saying sector writes are atomic do seem to know more about what they're talking about.
The answer to your first question depends on the hardware involved. At least with some older hardware, the answer was yes -- a power failure could result it garbage being written to the disk. Most current disks, however, have a bit of a "UPS" built into the disk itself -- a capacitor that's large enough to power the disk long enough to write the data in the on-disk cache out to the disk platter. They also have circuitry to detect whether the power supply is still good, so when the power gets flaky, they write the data in the cache to the platter, and ignore garbage they might receive.
As far as a "torn page" goes, a typical disk only accepts commands to write an entire sector at a time, so what you'll get will normally be an integral number of sectors written correctly, and others remaining unchanged. If, however, you're using a logical page size that's larger than a single sector, you can certainly end up with a page that's partially written.
That, however, mostly applies to a direct connection to a normal moving-platter type hard drive. With almost anything else, the rules can and often will be different. Just for an obvious example, if you're writing over the network, you're mostly at the mercy of the network protocol in use. If you transmit data over TCP, data that doesn't match up with the CRC will be rejected, but the same data transmitted over UDP, with the same corruption, might be accepted.
I suspect this assumption is wrong.
Modern HDDs encode the data in sectors - and additionally protect it with ECC. Therefore you can end-up with garbaging all the sector content - it will just not make sense with the encoding used.
As for increasingly poplular SSDs, the situation is even more gruesome - the block is cleared prior to being overwritten, so, depending on the firmware being used and the amount of free space, entirely unrelated sectors can be damaged.
By the way, an OS crash will not lead to data being damaged within single sector.
I would expect one torn page to consist of part X, part Y, and part unreadable sector. If a head is in the middle of writing a sector when the power fails, the drive should park the heads immediately, so that the rest of the drive (aside from that one sector) will remain undamaged.
In some cases I would expect several torn pages consisting of part X and part Y, but only one torn page would include an unreadable sector. The reason for several torn pages is that the drive can buffer lots of writes internally, and the order of writing might interleave various sectors from various pages.
I've read conflicting stories about whether a new write to the unreadable sector will make it readable again. Even if the answer is yes, that will be new data Z, neither X nor Y.
when updating the
disk, the only guarantee drive manufactures make is that a single 512-
byte write is atomic (i.e., it will either complete in its entirety or it won’t
complete at all); thus, if an untimely power loss occurs, only a portion of
a larger write may complete (sometimes called a torn write).