MaxOpenFiles based on Indexes X Shards X Replicas X Documents - lucene

Problem
Currently I have an elasticsearch cluster that is running out of file descriptors, and checking elasticsearch setup documentation page I saw that is recommended to set the number of file descriptors on the machine to 32K or even 64K and digging a bit on search results I found some people that set this limit to a threshold even higher (128K or unlimited).
The exception I'm getting is quite common to the exhaustion of file descriptors:
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component
Question
Is there an equation for the number of file descriptors we should expect to be required by elasticsearch / lucene based on the number of Indexes, Shards, Replicas and / or Documents? Or even foor the number of files for all elasticsearch indexes?
I wouldn't like to set it by try and error, and unlimited number of file descriptors isn't possible for my situation.

I know very little about elasticsearch but will try to answer this from Lucene perspective.
I'm afraid there's no easy way of finding out how many descriptors do you really need.
First, this depends on Directory implementation (which itself depends on the underlying OS, if you use FSDirectory.open(File)).
Secondly, it also depends on your merge policy (which may depend on Lucene version, unless elasticsearch overrides it).
Finally, it can even depend on various exotic circumstances, such as garbage collection behaviour (if certain bits depend on finalizers to free resources). We even had an instance of Lucene which was leaking file descriptors until we manually switched -d64 mode on.
Above said, I would recommend you to set up a monitoring script which gathers some stats over a week or so and come up with the range fitting your typical usage. Add some variance for unexpected cases.
P.S. I am struggling to imagine a case these days where file descriptors would be a genuine problem. Is this a C10K problem? Can you elaborate on this?

Related

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

File system and API that supports unlimited nesting of directories?

I have a project where I have a large set of trees of data. I would absolutely love to do this in the file system, but the trees will likely get VERY deep. Many file systems claim no limit on nesting of directories within each other, but have limits on path length, which puts a limit on the amount of directory nesting. Then from there, there are the APIs/drivers for the file system within an operating system which it seems sometimes put tighter constraints on things. So, my question is, do you know of a system setup (filesystem + operating system + API/library/language) which can support virtually infinite nesting? I can't easily put a bound on the amount of nesting that might arise but I'd say it would often get into the hundreds and in the lifetime of the software may reach the hundreds of thousands for certain branches. (And since I'm not psychic, I'd like it to be able to support beyond that.) The actual size of the data isn't a major concern and is well within the limits of every modern filesystem. It's just the nesting that is pushing the limit.
I'm aware of other ways to achieve this but they all feel less elegant since so many of the things I want are supported by the file system. Storing files, user access control (with the same users acknowledged by the OS), absolute and relative addressing, mounting/symlinking to subtrees, etc. Also, a lot of tools that operate on the file system level would be very useful here.

RDBMS caching vs disk I/O -- comparison across vendors

I know little about how leading RDBMSs go about retrieving data. So these questions may seem a bit rudimentary:
Does each SELECT in commonly used RDBMSs such as Oracle, SQL Server, MySQL, PostgeSQL etc. always mean a trip to read the data from the disk or do they, to some extent allowable by the hardware, cache commonly requested data to avoid the expensive I/O operation?
How do they determine which data segments to cache?
How do they go about synchronizing the cache once an update of some of the cached data occurs by a different process?
Is there a comparison matrix on how different RDBMSs cache frequently requested data?
Thanks
I'll answer for SQL Server:
Reads are served from cache if possible. Else, an IO occurs.
From what has been written and from what I observe, it is an LRU algorithm. I don't think this is documented anywhere. The LRU items are database pages of 8KB.
SQL Server is the only process which has access to the database files. So no other process can cause modifications. Regarding concurrent transactions: Multiple transactions can modify the same page. Locking (mostly at row-level, sometimes page or table level) ensures that the transactions do not disturb each other.
I don't know.
The answers for Informix are pretty similar to those given for SQL Server:
Reads and writes both use the cache if at all possible. If the page needed is not already in cache, an appropriate collection of I/O operations occurs (typically, evicting some page from cache, perhaps a dirty page that must be written before a new page can be read in, and then reading the new page where the old one was).
There are various algorithms, but page size and usage are the key parts. There are LRU queues for each page size.
The DBMS as a whole is an ensemble of processes that use a buffer pool in shared memory (and, where possible, direct disk I/O instead of going through the kernel cache), and uses various forms of locking (semaphores, spin-locks, mutexes, etc) to handle concurrency and synchronization. (On Windows, Informix uses a single process with multiple threads; on Unix, it uses multiple processes.)
Probably not.

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.

Why doesn't Hadoop file system support random I/O?

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.