Which technology should be used for serving large number of static files? - apache

My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).
Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.
On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.
Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).
This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.
I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.
This is the only public note found on such topic, and it is little old too.

1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.
30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.
So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.
Possible optimizations are:
Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
Distribute the files over several harddisks so that they can access the files in parallel
Use faster harddisk
Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.
If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.
Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:
RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml

Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.

I believe that a dedicated application with everything feeding off a memcache db would be the best bet.

Related

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Is it wrong to write byte of images in the database?

When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system

Do files on your server influence website speed

If you have a webserver for your website, does it make a difference if there are a lot of other files on the server, even if they aren't used?
Example
An average webserver has a SSD with 500 GB of space. It's hosting a single website, but has a ton of other websites which are inactive. Though that single website is only 1GB in size, the hard drive is full for 50%. Will that influence site speed?
And does SSD vs HDD make a difference in that, apart from the speed difference between the two types.
Edit: I've read somewhere that the amount of files in your server influences it's speed, and it sounds logical due to Andrei's answer, concerning the having to search through more files. I've had a discussion about it with someone however, and he firmly states that it makes no difference.
Having other/unused files always has an impact on the performance, but the question is - how big it is. Usually not much and you will not notice it at all.
But think about how files are read from disk. First, you need to locate the file record in the file allocation table (FAT). Search in the table is similar to search in a tree-like data structure, as we have to deal with folders that contain other folders etc.
The more files you have, the bigger the FAT gets. And the search becomes slower, correspondingly.
All in all, with memory caching and other tricks, this is not an issue.
You will notice the impact when you have thousands of files in one folder. That's why picture-related services that host big amount of images usually store them in a folder structure that holds only limited amount of files per folder. For example, a file named '12345678.jpg' would be stored in '/1/2/3/4/5/12345678.jpg' path as well as other files whose names are '12345000'...'12345999'. Thus only 1000 files would be saved per folder.

Post Processing of Resized Image In clustered environment

Been playing with ImageResizer for a bit now, and trying to do something, I am having trouble understanding the way to go about it.
Mainly I would like to stick to the idea of using the pipeline, and not trying to cheat it.
So.... Let's say, I pretty standard use ImageResizer For something like:
giants_logo.jpg?w=280&h=100
The File giants_logo.jpg
Processing Request is for a resized version of 'w=280&h=100'
In a clustered environment, what will happen is if this same request is served by 3 machines.
All 3 would end up doing the resize, and then storing their cached version in a local folder on disc. I could leverage a shared drive or something, but that has it's own limitations.
What I am looking to do, is get the processed file, and then copy it back up to the DB or S3 where the main images are served from.
My thought is.... I might have to write somehting like DiscCache, but with a complelty different guts, using the DB or S3 as the back end instead of the file system.
I realize the point of caching is speed, and what I am suggesting is negating that aspect..... but that's not the case if we layer the things maybe.
Anyway, What I am focused on is trying to keep track of the files generated, as well as avoid processing on multiple servers.
Any thoughts on the route I should look at to accomplish this?
TLDR; When DiskCache actually stops working well (usually between 1 and 20 million unique images), then switch to a CDN (unless it's too expensive), or a reverse proxy (unless your data set is really too huge to be bound by mortal infrastructure).
For petabyte data sets on the cheap when performance isn't king, it's a good plan. But for most people, it's premature. Even users with upwards of 20TB (source images) still use DiskCache. Really. Terabyte drives are cheap.
Latency is the killer.
To make this work you would need a central Redis server. MSSQL won't cut it (at least not on a VM or commodity hardware, we've tried). Given a Redis server, you can track what is done and stored (and perhaps even what is in progress, to de-duplicate effort in real time, as DiskCache does).
If you can track it, you can reuse it, and you can delete it. Reuse will be slower, since you're doubling the network traffic, moving the result twice. (But also decreasing it linearly with the number of servers in the cluster for source image fetches).
If bandwidth saturation is your bottleneck (very common), this could make performance worse. In fact, unless your read/write ratio is write and CPU heavy, you'll likely see worse performance than duplicated CPU effort under individual disk caches.
If you have the infrastructure to test it, put DiskCache on a SAN or shared drive; this will give you a solid estimate of the performance you can expect (assuming said drive and your blob storage system have comparable IO perf).
However, it's a fair amount of work, and you're essentially duplicating a subset of the functionality of reverse proxy (but with worse performance, since every response has to be proxied through the unlucky cluster server, instead of being spooled directly from disk).
CDNs and Reverse proxies to the rescue
Amazon CloudFront or Varnish can serve quite well as reverse proxies/caches for a web farm or cluster. Now, you'll have a bit less control over the 'garbage collection' process, but... also less code to maintain.
There's also ARR, but I've heard neither success nor failure stories about it.
But it sounds fun!
Send me a Github link and I'll help out.
I'd love to get a Redis-coordinated, cloud-agnostic poor-man's blob cache system out there. You bring the petabytes and infrastructure, I'll help you with the integration and troublesome bits. Efficient HTTP proxying is probably the hardest part; the rest is state management and basic threading.
You might want to have a look at a modified AzureReader2 plugin at https://github.com/orbyone/Sensible.ImageResizer.Plugins.AzureReader2
This implementation stores the transformed image back to the Azure blob container on the initial requests, so subsequent requests are redirected to that copy.

Performance implications of storing 600,000+ images in the same folder (NTFS)

I need to store about 600,000 images on a web server that uses NTFS. Am I better off storing images in 20,000-image chunks in subfolders? (Windows Server 2008)
I'm concerned about incurring operating system overhead during image retrieval
Go for it. As long has you have an external index and have a direct file path to each file with out listing the contents of the directory then you are ok.
I have a folder with that is over 500 GB in size with over 4 million folders (which have more folders and files). I have somewhere in the order of 10 million files in total.
If I accidentally open this folder in windows explorer it gets stuck at 100% cpu usage (for one core) until I kill the process. But as long as you directly refer to the file/folder performance is great (meaning I can access any of those 10 million files with no overhead)
Depending on whether NTFS has directory indexes, it should be alright from the application level.
I mean, that opening files by name, deleting, renaming etc, programmatically should work nicely.
But the problem is always tools. Third party tools (such as MS explorer, your backup tool, etc) are likely to suck or at least be extremely unusable with large numbers of files per directory.
Anything which does a directory scan, is likely to be quite slow, but worse, some of these tools have poor algorithms which don't scale to even modest (10k+) numbers of files per directory.
NTFS folders store an index file with links to all its contents. With a large amount of images, that file is going to increase a lot and impact your performance negatively. So, yes, on that argument alone you are better off to store chunks in subfolders. Fragments inside indexes are a pain.