Why doesn't Hadoop file system support random I/O? - file-io

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?

Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.

I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.

Related

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Storing large objects in Couchbase - best practice?

In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.

Is there a reverse-incremental backup solution with built-in redundancy (e.g. par2)?

I'm setting a home server primarily for backup use. I have about 90GB of personal data that must be backed up in the most reliable manner, while still preserving disk space. I want to have full file history so I can go back to any file at any particular date.
Full weekly backups are not an option because of the size of the data. Instead, I'm looking along the lines of an incremental backup solution. However, I'm aware that a single corruption in a set of incremental backups makes the entire series (beyond a point) unrecoverable. Thus simple incremental backups are not an option.
I've researched a number of solutions to the problem. First, I would use reverse-incremental backups so that the latest version of the files would have the least chance of loss (older files are not as important). Second, I want to protect both the increments and backup with some sort of redundancy. Par2 parity data seems perfect for the job. In short, I'm looking for a backup solution with the following requirements:
Reverse incremental (to save on disk space and prioritize the most recent backup)
File history (kind of a broader category including reverse incremental)
Par2 parity data on increments and backup data
Preserve metadata
Efficient with bandwidth (bandwidth saving; no copying the entire directory over for each increment). Most incremental backup solutions should work this way.
This would (I believe) ensure file integrity and relatively small backup sizes. I've looked at a number of backup solutions already but they have a number of problems:
Bacula - Simple normal incremental backups
bup - incremental and implements par2 but isn't reverse incremental and doesn't preserve metadata
duplicity - incremental, compressed, and encrypted but isn't reverse incremental
dar - incremental and par2 is easy to add, but isn't reverse incremental and no file history?
rdiff-backup - almost perfect for what I need but it doesn't have par2 support
So far I think that rdiff-backup seems like the best compromise but it doesn't support par2. I think I can add par2 support to backup increments easily enough since they aren't modified each backup but what about the rest of the files? I could generate par2 files recursively for all files in the backup but this would be slow and inefficient, and I'd have to worry about corruption during a backup and old par2 files. In particular, I couldn't tell the difference between a changed file and a corrupt file, and I don't know how to check for such errors or how they would affect the backup history. Does anyone know of any better solution? Is there a better approach to the issue?
Thanks for reading through my difficulties and for any input you can give me. Any help would be greatly appreciated.
http://www.timedicer.co.uk/index
Uses rdiff-backup as the engine. I've been looking at it, but that requires me to set up a "server" using linux or a virtual machine.
Personally, I use WinRAR to make pseudo-incremental backups (it actually makes a full backup of recent files) run daily by a scheduled task. It is similarly a "push" backup.
It's not a true incremental (or reverse-incremental) but it saves different versions of files based on when it was last updated. I mean, it saves the version for today, yesterday and the previous days, even if the file is identical. You can set the archive bit to save space, but I don't bother anymore as all I backup are small spreadsheets and documents.
RAR has it's own parity or recovery record that you can set in size or percentage. I use 1% (one percent).
It can preserve metadata, I personally skip the high resolution times.
It can be efficient since it compresses the files.
Then all I have to do is send the file to my backup. I have it copied to a different drive and to another computer in the network. No need for a true server, just a share. You can't do this for too many computers though as Windows workstations have a 10 connection limit.
So for my purpose, which may fit yours, backs up my files daily for files that have been updated in the last 7 days. Then I have another scheduled backup that backups files that have been updated in the last 90 days run once a month or every 30 days.
But I use Windows, so if you're actually setting up a Linux server, you might check out the Time Dicer.
Since nobody was able to answer my question, I'll write a few possible solutions I found while researching the topic. In short, I believe the best solution is rdiff-backup to a ZFS filesystem. Here's why:
ZFS checksums all blocks stored and can easily detect errors.
If you have ZFS set to mirror your data, it can recover the errors by copying from the good copy.
This takes up less space than full backups, even though the data is copied twice.
The odds of an error in both the original and mirror is tiny.
Personally I am not using this solution as ZFS is a little tricky to get working on Linux. Btrfs looks promising but hasn't been proven stable from years of use. Instead, I'm going with a cheaper option of simply checking hard drive SMART data. Hard drives should do some error checking/correcting themselves and by monitoring this data I can see if this process is working properly. It's not as good as additional filesystem parity but better than nothing.
A few more notes that might be interesting to people looking into reliable backup development:
par2 seems to be dated and buggy software. zfec seems like a much faster modern alternative. Discussion in bup occurred a while ago: https://groups.google.com/group/bup-list/browse_thread/thread/a61748557087ca07
It's safer to calculate parity data before even writing to disk. i.e. don't write to disk, read it, and then calculate parity data. Do it from ram, and check against the original for additional reliability. This might only be possible with zfec, since par2 is too slow.

Is there a file system with a low level prepend operation?

At the lowest levels most OS file operations include open, close, read, write, delete and seek and append operation, yet there is no prepend operation.
The question came up because a colleague of mine was working with a large (multi-gigabyte) data log he had generated and he realized he had not written the a file header to the log file. Even though he only needed to add a hundred bytes to the front of the file, we couldn't see any way to do that without getting into the block / sector file allocation table level stuff.
Is there any history or technical reason that a prepend operation does not exist, or would be more expensive then the similar append operation?
I am only aware of a single research paper describing something like this: "Supporting Insertions and Deletions in Striped Parallel Filesystems" from 1992.
The abstract is
The dramatic improvements in the processing rates of parallel computers are turning many compute-bound jobs into IO-bound jobs. Parallel file systems have been proposed to better match IO throughput to processing power. Many parallel file systems stripe files across numerous disks; each disk has its own controller. A striped file can be appended (or prepended) to and maintain its structure. However, a block can't be inserted into or deleted from the middle of the file, since doing so would destroy the regular striping structure of the file. In this paper, we present a distributed file structure that maintains files in indexed striped extents on a message passing multiprocessor. This approach allows highly parallel random and sequential reads, and also allows insertion and deletion into the middle of the file.
You can find more information in the paper.

Loading large amounts of data to an Oracle SQL Database

I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.