I have a project where I have a large set of trees of data. I would absolutely love to do this in the file system, but the trees will likely get VERY deep. Many file systems claim no limit on nesting of directories within each other, but have limits on path length, which puts a limit on the amount of directory nesting. Then from there, there are the APIs/drivers for the file system within an operating system which it seems sometimes put tighter constraints on things. So, my question is, do you know of a system setup (filesystem + operating system + API/library/language) which can support virtually infinite nesting? I can't easily put a bound on the amount of nesting that might arise but I'd say it would often get into the hundreds and in the lifetime of the software may reach the hundreds of thousands for certain branches. (And since I'm not psychic, I'd like it to be able to support beyond that.) The actual size of the data isn't a major concern and is well within the limits of every modern filesystem. It's just the nesting that is pushing the limit.
I'm aware of other ways to achieve this but they all feel less elegant since so many of the things I want are supported by the file system. Storing files, user access control (with the same users acknowledged by the OS), absolute and relative addressing, mounting/symlinking to subtrees, etc. Also, a lot of tools that operate on the file system level would be very useful here.
Related
I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .
When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system
Problem
Currently I have an elasticsearch cluster that is running out of file descriptors, and checking elasticsearch setup documentation page I saw that is recommended to set the number of file descriptors on the machine to 32K or even 64K and digging a bit on search results I found some people that set this limit to a threshold even higher (128K or unlimited).
The exception I'm getting is quite common to the exhaustion of file descriptors:
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component
Question
Is there an equation for the number of file descriptors we should expect to be required by elasticsearch / lucene based on the number of Indexes, Shards, Replicas and / or Documents? Or even foor the number of files for all elasticsearch indexes?
I wouldn't like to set it by try and error, and unlimited number of file descriptors isn't possible for my situation.
I know very little about elasticsearch but will try to answer this from Lucene perspective.
I'm afraid there's no easy way of finding out how many descriptors do you really need.
First, this depends on Directory implementation (which itself depends on the underlying OS, if you use FSDirectory.open(File)).
Secondly, it also depends on your merge policy (which may depend on Lucene version, unless elasticsearch overrides it).
Finally, it can even depend on various exotic circumstances, such as garbage collection behaviour (if certain bits depend on finalizers to free resources). We even had an instance of Lucene which was leaking file descriptors until we manually switched -d64 mode on.
Above said, I would recommend you to set up a monitoring script which gathers some stats over a week or so and come up with the range fitting your typical usage. Add some variance for unexpected cases.
P.S. I am struggling to imagine a case these days where file descriptors would be a genuine problem. Is this a C10K problem? Can you elaborate on this?
My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).
Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.
On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.
Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).
This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.
I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.
This is the only public note found on such topic, and it is little old too.
1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.
30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.
So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.
Possible optimizations are:
Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
Distribute the files over several harddisks so that they can access the files in parallel
Use faster harddisk
Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.
If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.
Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:
RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml
Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.
I believe that a dedicated application with everything feeding off a memcache db would be the best bet.
If i have a site where users can upload as many images as they want(think photobucket-like), what is the best way to set up file storage (also, all uploads get a unique random timestamp)?
site root
--username
----image1.jpg
----image2.jpg
----image3.jpg
--anotheruser
----image1.jpg
----image2.jpg
----image3.jpg
...
or
siteroot
--uploads
----image1.jpg
----image2.jpg
----image3.jpg
----image4.jpg
----image6.jpg
...
----image50000.jpg
I think the first method is more organized. But i think the second method is standard(keeping all uploads in the same dir), but i wonder if it would be slower when retrieving an image if there are thousands of image in the same directory
--- edit ---
Thanks for the great answers so far.
Also, i will be creating thumbnails, so i also would have to insert that directory somewhere... or, create a naming convention such as thumb_whatever.jpg.
so many different ways to do this.
Yes disk space will be a problem. but for now i am concerned with retrieval time. When i have to output an image to the browser, if that image is in a directory with 10,000 other images, i am worried on how slow that could get.
The number of files in a directory should have no effect at all on the time required to read a file's data - but it can massively affect the amount of time needed to find the file before you can start to read it.
The exact breakpoints where the major issues start up will vary from filesystem type to filesystem type, but, in general, if you're talking about a few hundred files, you don't much need to worry about it. If you're talking about a few thousand, it's worth thinking about and maybe doing a little benchmarking to see how your filesystem and hardware handle it. If you're talking about tens of thousands of files, then you really need to start breaking things up. (I once had a Linux/e2fs print server where CUPS wasn't deleting its job control files after it finished printing and it got up around 100,000 files in one directory. Just getting a directory listing took over half an hour before it even started to display any filenames.)
Separating them by user name may not be the best choice, though, since you'll likely have a lot of users uploading very few images and perhaps a couple who upload hundreds or thousands of images, potentially creating access time issues in those users' storage directories. The bigger problem in that scenario is that you'd likely end up (assuming a successful site) with thousands or tens of thousands of users and a large number of subdirectories is just as bad as a large number of files for slowing down access to your data.
Since you're going to have a timestamp on them, what I would probably do is put them into subdirectories based on the last three digits of the timestamp. That will distribute the files relatively evenly across 1000 subdirectories and should keep the number of files in each directory reasonably small. (Using the first three digits would cause one directory to be filled before moving to the next instead of distributing them evenly.) If you're still ending up with too many files in each subdirectory (which would likely mean you're dealing with several million uploaded images), you could add a second level for the previous three digits, so upload-1234567890.jpg would end up at /567/890/upload-1234567890.jpg.
The answer to that is "maybe". It's possible the file retrieval may be fine, but if you need to do any maintenance on the folder, it would be a huge headache as processes attempt ot enumerate the directory listings.
What would improve the situation would be a number of sub directories under the images folder (or two levels, depending on how many images you're looking at storing), so you have a hierarchy like this:
siteroot
-- uploads
---- a
---- b
---- c
:
---- z
...and then store files based on their first letter (so all images with names starting 'a' go into the folder 'a'). You could have this as a two or three letter suffix (aa, ab, ac, ad ..., ba, bb, bc ..., zx, zy, zz) and possibly have a hierarchy under that as well so you split files across a number of folders dependent on the first four characters of the name.
If files are then assigned a random alpha-numeric name then this would ensure files are spread evenly across all the folders (given a large enough sample size).
You might want to consider a mix of your option (1) and splitting images over a hierarchy as I've described above. That would ensure that if a single user does upload lots of files, then you're covered. Similarly, if you're looking at a lot of user directories, the same principle applies to ensure you don't have 1,000,000 user directories under a single parent.
try using mongodb... it is a keyvalue db which also allows to store binary data. It's very fast and efficient and supports sharding (placing data over multiple machines) out of the box
you really don't want to have folders and folders full of files. Managing these folders takes forever, and changing the naming/dividing scheme later is a nightmare. Furthermore, if you run out of diskspace you have a problem. Also for load balancing, having one harddisk full with files is not efficient
I often use schema like this:
uploads/(#id%1000)/img_#id.jpg
Where #id is ofc. id number (integer) of photo stored in database. That provides a simple schema based only on photo's id.
It depends on the file system. For example, FAT16 tends to be quite slow if you have more than 512 files in a directory. FAT32 and NTFS do not have the same limitations but also run much more slowly if you have an extremely large amount of files. Even if you're running one of the more robust Linux file systems, you're still going to be able to parse directories more quickly if they're smaller.
I would definitely go with #2 - splitting the images into directories by user.
I think that subdirectories under the uploads directory would be the best.
site root
--uploads
----username
------image1.jpg
------image2.jpg
------image3.jpg
----anotheruser
------image1.jpg
------image2.jpg
------image3.jpg
...
Depending on the host OS, having too many files in one directory could cause some headaches and compatibility problems. Also, depending on how you are getting the image list, it could cause performance issues.
Plus, option 2 would be a mess. :)