Is there a file system with a low level prepend operation? - file-io

At the lowest levels most OS file operations include open, close, read, write, delete and seek and append operation, yet there is no prepend operation.
The question came up because a colleague of mine was working with a large (multi-gigabyte) data log he had generated and he realized he had not written the a file header to the log file. Even though he only needed to add a hundred bytes to the front of the file, we couldn't see any way to do that without getting into the block / sector file allocation table level stuff.
Is there any history or technical reason that a prepend operation does not exist, or would be more expensive then the similar append operation?

I am only aware of a single research paper describing something like this: "Supporting Insertions and Deletions in Striped Parallel Filesystems" from 1992.
The abstract is
The dramatic improvements in the processing rates of parallel computers are turning many compute-bound jobs into IO-bound jobs. Parallel file systems have been proposed to better match IO throughput to processing power. Many parallel file systems stripe files across numerous disks; each disk has its own controller. A striped file can be appended (or prepended) to and maintain its structure. However, a block can't be inserted into or deleted from the middle of the file, since doing so would destroy the regular striping structure of the file. In this paper, we present a distributed file structure that maintains files in indexed striped extents on a message passing multiprocessor. This approach allows highly parallel random and sequential reads, and also allows insertion and deletion into the middle of the file.
You can find more information in the paper.

Related

Distributed file system as a distributed buffer system?

The Problem
I've been developing an application which needs to support reads on a data object asynchronously with appending writes. In other words, a buffer. There will be many data objects at any given time.
I've been researching into available distributed file systems to find one which supports reading a file as it's being written to, but my search has come up with nothing. I know Amazon S3 does not support this from experience, while I am unsure about others such as HadoopDFS.
Solution: Chunking?
I have thought of chunking the data as a solution, which would involve splitting the incoming writes into n-byte chunks to write to the DFS as a whole. Chunks which are no longer needed can be deleted without interfering with the new data being written, as they are separate files on the DFS.
The problem with this strategy is it would result in pauses when a buffer reader consumes data faster than the buffer writer creates it. Smaller chunks would mitigate this effect, but not perfectly.
Summarized Questions
Does a DFS exist which supports reading/writing an object as a buffer?
If not, is chunking data on the DFS the best way to simulate a buffer?
Lustre (lustre.org) supports concurrent coherent reads and writes, including O_APPEND writes. Concurrency is controlled by the Lustre distributed lock manager (LDLM), which grants extent locks to clients and revokes the locks on conflict. Multiple readers and multiple concurrent writers to the same file are supported, they will see consistent file data, much like in case of a local file-system.

File IO for MPI-FORTRAN

I have a FORTRAN MPI code to solve a flow field.
At the start I want to read data from file and distribute it to the participating processes.
The data is consisting of several 3-D arrays(velocities in space x,y,z).
Every process stores only a part of the array.
So if every process is going to read the file(the easiest way I think) it is not going to work as it will only store a the first part of the file corresponding to the number of arrays that the process can hold.
MPI Bcast can work for 3d arrays? But then things become complex.
Or is there an easier way?
You have, broadly speaking, 2 or 3 choices, depending on your platform.
One process reads the input data and sends (parts of) it to the other processes. I wouldn't usually use broadcast for this since it is a collective operation and all processes have to take part. I'd usually just send the necessary information to each process. If it is convenient (and not a memory issue) you could certainly broadcast all the input data to all the processes, it's just not a pattern of operation that I use or see much.
All processes read the data that they require. This may involve a process reading an entire input file and only storing those parts it requires. But if you have very large input files you can write routines to read only the necessary part into each process's memory space. This approach may involve processes competing for disk access, which is only slow in a relative sense: if you are running large-scale and long-running parallel computations waiting a few seconds while all the processes get their data is not much of an overhead.
If you have a parallel file system then you can use MPI's parallel I/O routines so that each process reads only those parts of the input data that it requires.
The canonical way of such an I/O pattern in MPI is either to
Read the data on rank 0, then use MPI_Scatter to distribute it. Or if memory is tight, do this blockwise, or then use 1-to-1 communication rather than MPI_Scatter.
Use MPI-I/O, and have each rank read its own subset of the data file (to be useful, this of course requires a file format where you can figure out the boundaries without first reading through the entire file).
For extreme scalability, one can combine the two approaches, that is a subset of processes (say, sqrt(N) as a rough rule of thumb) use MPI I/O, and each MPI process sends data to its own IO process.
If you are running your code on less than 1000 cores with a good file system (e.g. Lustre) then just use Fortran I/O where each rank opens the file and reads the data it needs (skipping the rest). Yes it takes a few minutes but you're only reading the file once during start.
MPI I/O (binary only) is non-trivial and usually you are always better off using higher level libs such as HDF5 or Parallel NetCDF. Performance will depend on how the data is read (contiguous vs non-contiguous and so on). The following links may be helpful ...
http://www.osc.edu/supercomputing/training/pario/parallel-io-nov04.pdf
https://support.scinet.utoronto.ca/wiki/images/0/01/Parallel_io_course.pdf

Why doesn't Hadoop file system support random I/O?

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.

Does it consume CPU when reading a large file

Suppose I want to do following opeartions on my 2-core machine:
Read a very large file
Compute
Does the file reading operation need to consume 1 core? Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
Thanks.
Edit
Thanks guys, yea, we should always consider if the file I/O blocks the computing. Now let's just consider that the file I/O will never block computing, you can think the computing doesn't depends on the file's data, we just read the file in for future processing. Now we have 2 core, we need to read in a file, and we need to do computing, is it the best solution to create 3 threads, 1 for file reading and 2 for computing, as most of you has already pointed out: file reading consumes very little CPU?
It depends on how your hardware is configured. Normally, reading is not CPU-intensive, thanks to DMA. It may be very expensive though, if it initiates swap-out of other applications. But there is more to it.
Don't read a huge file at once if you can
If your file is really big, you should use mmap or sequential processing, when you don't need to read a whole file at once. Try to consume it by chunks is possible.
For example, to sum all values in a huge file, you don't need to load this file into the memory. You can process it by small chunks, accumulating the sum. Memory is an expensive resource in most situations.
Reading is sequential
Does the file reading operation need to consume 1 core?
Yes, I think most low-level read operations are implemented sequentially (consume 1 core).
You can avoid blocking on read operation if you use asynchronous I/O, but it is just a variation of the same "read by small chunks" technique. You can launch several small asynchronous read operations at once, but you have always to check if an operation has finished before you use the result.
See also this Stack Overflow answer to a related question).
Reading and computing in parallel
Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
It depends, if you need all data to start computations, than there is no reason to start computation in parallel. It will have to wait effectively until reading is done.
If you can start computing even with partial data, likely you don't need to read the whole file at once. And it is usually much better not to do so with huge files.
What is your bottleneck — computation or IO?
Finally, you should know if your task is computation-bound or input-output bound. If it is limited by the performance of input-output subsystem, there is little benefit in parallelizing computation. If computation is very CPU-intensive, and reading time is negligible, you can benefit from parallelizing computation. Input-output is usually a bottleneck unless you are doing some number-crunching.
This is a good candidate for parallelization, because you have two types of operations here - disk I/O (for reading the file), and CPU load (for your computations). So the first step would be to write your application such that the file I/O wasn't blocking the computation. You could do this by reading a little bit at a time from the file and handing it off to the compute thread.
But now you're saying you have two cores that you want to utilize. Your second thought about parallelizing the CPU-intensive part is correct, because we can only parallelize compute tasks if we have more than one processor to use. But, it might be the case that the blocking part of your application is still the file I/O - that depends on a lot of factors, and the only way to tell what level of parallelization is appropriate is to benchmark.
SO required caveat: multithreading is hard and error-prone, and it's better to have correct code than fast code, if you can pick only one. But I don't advocate against threads, as you may find from others on the site.
I would think this depends on the computation you are performing. If you are doing very heavy computations then I would suggest threading the application. Reading a file demands very little from your CPU and because of this, the overhead created by threading the application might slow it down.
Another thing to consider is if you need to load the entire file before you can compute, if so, there is no point in threading it at all as you will have to complete one action before you can perform the other.

Why would two mysql files (same table, same contents) be different in size?

I took an existing MySQL database, and set up a copy on a new host.
The file size for some tables on the new host are 1-3% smaller than their counterpart files on the old host.
I am curious why that is.
My guess is, the old host's files have grown over time, and within the b-tree structure for that file, there is more fragmentation. Whereas the new host, because it was creating the file from scratch (via a binary log), avoided such fragmentation.
Does it even make sense for there to be fragmentation within the b-tree structure itself? (Speaking within the database layer, and not with regards to the OS file system layer) I originally thought "no", but then again, isn't such fragmentation the basis for the DBA task of compressing your database files?
I'm wondering maybe if this is simply an artifact of the file system layer. i.e. the new host has a mostly empty disk drive, hence less fragmentation would result in the allocation of a new file. Then again, I didn't think that fragmentation would show up in the reported file size (Linux OS).
There can certainly be fragmentation in MySQL data files or index files. This is common, even deliberate.
That is, the storage engine may deliberately leave some extra space here and there so when you change values, it can fit the rows in without having to reorder the whole data file. There are even server properties you can use to configure how much of this slop space to allocate.
I wouldn't even blink at a file discrepancy of 1-3%.
From what i understand of mysql It has a growth algo as it approaches capacity, when mounted it chose a different size, prolly trimming excess storage