Lucene - is it the right answer for huge index? - lucene

Is Lucene capable of indexing 500M text documents of 50K each?
What performance can be expected such index, for single term search and for 10 terms search?
Should I be worried and directly move to distributed index environment?
Saar

Yes, Lucene should be able to handle this, according to the following article:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
Here's a quote:
Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.
The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.
A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/
He often uses Wikipedia's content (25 GB) as test input for Lucene.
Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).
However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?

Related

How to index PDF / MS-Word / Excel files really fast for full text search?

We are building real time search feature for institutions, the index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.
For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.
I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:
Search for keywords in Word documents and index them
Index Word/PDF Documents From File System To SQL Server
How to extract text from MS office documents in C#
Using full-text search with PDF files in SQL Server 2005
So my question is: how to build the index FAST ?
Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?
We're building a high perfomance full-text search for office documents. We can share some insights:
We use ElasticSearch. It's difficult to make it perform well on large file. We write several posts about it.
Highlighting Large Documents in ElasticSearch
Making ElasticSearch Perform Well with Large Text Fields
Use microservice architecture and docker to easily scale your application
Do not store original files in elasticsearch as binary data. Store it separately for example in MongoDB
Hope it helps!

ElasticSearch - Determining maximum shard size

Hopefully this question isn't out of date, but I haven't found a clear answer anywhere yet. According to one of the ES presentations from last year (http://www.elasticsearch.org/videos/big-data-search-and-analytics/), there's a "maximum" size for a shard. I'm trying to determine this for my application, but as far as I can tell, I haven't hit it yet. Does anyone know what the behavior of a single-shard index that's reached its maximum? Do inserts fail, or is it just that the index becomes unusable?
To test this myself, I indexed all the English articles in Wikipedia (without any history information) in a single elasticsearch shard. The elasticsearch data folder grew to ~42GB at the end of the test. Lessons learned are:
indexing speed will not be affected by the size of the shard. Mind you, I did not try indexing with more than one thread at a time, but single thread indexing speed was more or less constant for the duration of the test
querying speed on the other hand was drastically affected by shard size. Especially once you try to query with more than one user at a time. The exact numbers will depend heavily on the power of your machine, data structure and how many threads are querying. To give you an idea, with elasticsearch running on my dev machine, querying the Wikipedia shard with 25 concurrent users resulted in an average response time of 3.5 seconds (with peaks towards half a minute).
My conclusion is that a shard too large will not make elasticsearch fail just from indexing. Querying the large shard may be too slow for your needs, or, in certain situations, even break elasticsearch with an OutOfMemoryException (for example a big faceted query).
This answer is based on my own investigation. Full story can be read on my blog:
http://blog.trifork.com/2013/09/26/maximum-shard-size-in-elasticsearch/
http://blog.trifork.com/2013/11/05/maximum-shard-size-in-elasticsearch-revisited/

Single word lucene indexing limit?

I've a Lucene based application and obviously a problem.
When the number of indexed document is low no problems appear. When then number of documents increase, seems that no single word are indexing. What we obtain is that searching with single word (single term) is an empty set.
The version of Lucene is 3.1 on 64 bit machine and the index is 10GB.
Do you have any idea?
Thanks
According to the Lucene documentation, Lucene should be able to handle 274 billion distinct terms. I don't believe it is possible that you have reached that limitation in a 10GB index.
Without more information, it is difficult to help further. However, being that you only see problems with large numbers of documents, I suspect you are running into exceptional conditions of some form, causing the system to fail to read or respond correctly. File Handle leaks or Memory Overflow perhaps, to take a stab in the dark.

Does the RavenDB compresion bundle provide benefits with many small documents?

I am trying to better understand how RavenDB uses disk space.
My application has many small documents (approximately 140 bytes each). Presently, there are around 81,000 documents which would give a total data size of around 11MB. However, the size of the database is just over 70MB.
Is most of the actual space being used by indexes?
I had read somewhere else that there may be a minimum overhead of around 600 bytes per document. This would consume around 49MB, which is more in the ballpark of the actual use I am seeing.
Would using the compression bundle provide much benefit in this scenario (many small documents), or is it targeted towards helping reduce the size of databases with very large documents?
I have done some further testing on my own and determined, in answer to my own question, that:
Indexes are not the main consumer of disk space in my scenario. In this case, indexes represent < 25% of the disk space used.
Adding the compression bundle for a database with a large number of small documents does not really reduce the total amount of disk space used. This is likely due to some minimum data overhead that each document requires. Compression would benefit documents that are very large.
Is most of the actual space being used by indexes?
Yes, that's likely. Remember that Raven creates indexes for different queries you make. You can fire up Raven Studio to see what indexes it's created for you:
Would using the compression bundle provide much benefit in this
scenario (many small documents), or is it targeted towards helping
reduce the size of databases with very large documents?
Probably wouldn't benefit your scenario of small documents. The compression bundle works on individual documents, not on indexes. But it might be worth trying to see what results you get.
Bigger question: since hard drive space is cheap and only getting cheaper, and 70MB is a spec on the map, why are you concerned about hard drive space? Databases often trade disk space for speed (e.g. multiple indexes, like Raven), and this is usually a good trade off for most apps.

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.