We are upgrading our search infrastructure from Lucene 2.3.1 to Lucene 3.5. I am in the process of load testing and I could find that Lucene 2.3.1 could index 32,000 docs per second, whereas Lucene 3.5 could index only around 17,000 docs per second.
Indeed, both of them use the standard analyzer and the default settings. Is 3.5 slower because it indexes more details and thereby resulting in a faster search? Ours is a log management product and the speed of indexing is highly important.
Ok, cutting the long story short, will the slower indexing of 3.5 result in a higher search speed?, if not, what else should I fine tune to improve the indexing speed?
Have you looked at ImproveIndexingSpeed on Lucene wiki?
Otherwise, please share some details on your setup so that we can help you:
how often do you commit?
what MergePolicy, ramBuffer size, and mergeFactor do use?
how many indexing threads do you spawn?
Related
I've a Lucene based application and obviously a problem.
When the number of indexed document is low no problems appear. When then number of documents increase, seems that no single word are indexing. What we obtain is that searching with single word (single term) is an empty set.
The version of Lucene is 3.1 on 64 bit machine and the index is 10GB.
Do you have any idea?
Thanks
According to the Lucene documentation, Lucene should be able to handle 274 billion distinct terms. I don't believe it is possible that you have reached that limitation in a 10GB index.
Without more information, it is difficult to help further. However, being that you only see problems with large numbers of documents, I suspect you are running into exceptional conditions of some form, causing the system to fail to read or respond correctly. File Handle leaks or Memory Overflow perhaps, to take a stab in the dark.
Ayende, my mails are not delivered to your mailing list, so I'll ask here, maybe someone else would have a solution to my problem.
I'm testing ravendb again and again :) and I think I found a little bug. On your documentation page I read
Raven/MaxNumberOfParallelIndexTasks
The maximum number of indexing
tasks allowed to run in parallel Default: the number of processors in
the current machine
But beside that, looks like RavenDB is using only a one core to do indexing tasks. And it takes too long with a single core to finish indexing large dataset. I tried overriding that configuration and assigned 3 to MaxNumberOfParallelIndexTasks, but still, it uses only single core.
take a look at this screenshot http://dl.dropbox.com/u/3055964/Untitled.gif
CPU utilization is at 25% only, and I have a quad core processor. I didn't modify affinity mask.
Am I doing something wrong or I have just crossed a bug?
Thanks
Davita,
I fixed the mailing list issue.
The problem you are seeing is likely because you are seeing only one index that has work to do. The work for a single index is always done on a single CPU.
We spread the work of indexing across multiple CPUs on index boundary.
Is Lucene capable of indexing 500M text documents of 50K each?
What performance can be expected such index, for single term search and for 10 terms search?
Should I be worried and directly move to distributed index environment?
Saar
Yes, Lucene should be able to handle this, according to the following article:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
Here's a quote:
Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.
The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.
A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/
He often uses Wikipedia's content (25 GB) as test input for Lucene.
Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).
However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?
I know there is post on loading index into Ram for lucene.
Faster search in Lucene - Is there a way to keep the whole index in RAM?
But I really need it for Solr, to improve search speed. Any pointers would be helpful :)
thanks
I don't think this is a good idea. It's like asking to improve speed in Windows by disabling its swapfile. Solr implements some very efficient caching mechanisms on top of Lucene, and there's also the file system cache.
If you have speed issues with Solr this is not the solution. Please post another question detailing your problems and let us recommend you a proper solution.
See also: XY Problem
What are the various ways of optimizing Lucene performance?
Shall I use caching API to store my lucene search query so that I save on the overhead of building the query again?
Have you looked at
Lucene Optimization Tip: Reuse Searcher
Advanced Text Indexing with Lucene
Should an index be optimised after incremental indexes in Lucene?
Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
Cheat. Use RAMDirectory to load the entire index into the ram. Afterwards, everything is blazing fast. :)
Lots of dead links in here.
These (somewhat official) resources are where I would start:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
I have found that the best answer to a performance question is to profile it. Guidelines are great, but there is so many variables that can impact performance such as the size of your dataset, the types of queries you are doing, datatypes, etc.
Get the Netbeans profiler or something similar and try it out different ways. Use the articles linked to by Mitch, but make sure you actually test what helps and what (often surprisingly) hurts.
There is also a good chance that any performance differences you can get from Lucene will be minor compared to performance improvements in your code. The profiler will point that out as well.
For 64 bit machines use MMapDirectory instead of RAMDirectory as very well explained here by one of the core Lucene committers.