I have lucene index (not very small, about 20gb) and I noticed that .tim file is about 16GB. I know that this is terms dictionary, is there any way to somehow to to reduce its size or even to entirely get rid off and if yes, what will I lose if I'll do this.
Thanks
Related
Is it possible to tell a Lucene to write its segments sequentially and of fixed size? By this way we would avoid merges which are heavy for large segments. Lucene has LogMergePolicy classes with similar functionality which gives ability to set max segment size by doc count or file size, but it is just a limit for merges.
You could use the NRTCachingDirectory to do the small segment merges in memory and only write them out to disk once they reach ~256MiB or so.
But fundamentally the merges are necessary since the data structures like the FST are write-once and are modified by creating a new one.
Maybe this can be combined with the NoMergePolicy for the FilesystemDirectory it will not perform further merges. But that will have pretty bad query performance.
Maybe do merges manually and somehow merge them all at once (by setting TieredMergePolicy.setMaxMergeAtOnceExplicit())
But Merging is just a cost of doing business, probably better to get used to it and tune the MergePolicy to your workload.
While using lucene for full text search, i want to keep the index in memory.I read that the index size can be maximum of size 2GB and if it exceeds, we will get OutOfMemoryException. Will using multisearcher serves as a solution to it?In multiSearcher also we create multiple indexes,isnt it?
I don't believe there is a hard limit on RAM index size, other than the space alloted to the JVM. Combining indexes with a MultiReader won't help you overcome not having enough memory available to the JVM (unless you are planning to build, and subsequently trash, indexes as needed, or something like that, but I'm guessing that is not the case).
See this question: Increase heap size in java, for how to give it more space.
Also, Mike McCandless wrote a blog post on this topic that might be of interest to you.
I am learning "Lucene in Action". It is said that in order to search the contents of files you need to index the files. I am not much clear on indexing files.
How much file space does indexing 1 GB of documents (like doc,xls,pdb) take?
How long it will take to index these files?
Do we need to update the index every day?
Q> How much file space does indexing 1 GB of documents (like doc,xls,pdb) takes?
A> Your question is too vague. Documents and spreadsheets can vary from virtually nothing to tens or even hundreds of megabytes. It also depends on the analyzer you are going to use and many other factors (e.g. fields only indexed or indexed and stored, etc.). You can use this spreadsheet for rough estimation, plus add some extra space for merges.
Q> How long it will to index these files?
A> Again, it depends on how much content is there. Generally speaking, indexing is fast. On the given link, it went as fast as 95.8 GB/hour but I assume conversion from doc/xsl will add some costs (which is irrelevant to Lucene btw).
Q> Do we need to update the index every day?
A> It is up to you. If you won't update the index, you will get the same search results. There's no magic way for new/updated content to get into index without updating.
I try to build some real-time aggregates on Lucene as a part of experiment. Documents have their values stored in the index. This works very nice for up-to 10K documents.
For larger numbers of documents, this gets kinda slow. I assume there is not too much invested in getting bulk-amounts of documents, as this kind of defeats the purpose of a search engine.
However, it would be cool to be able to do this. So, basically my question is: what could I do to get documents faster from Lucene? Or are there smarter approaches?
I already only retrieve fields I need.
[edit]
The index is quite large >50GB. This does not fit in memory. The number of fields differ, I have several types of documents. Aggregation will mostly take place on a fixed document type; but there is no way to tell on beforehand which one.
Have you put the index in memory? If the entire index fits in memory, that is a huge speedup.
Once you get the hits (which comes back super quick even for 10k records), I would open up multiple threads/readers to access them.
Another thing I have done is store only some properties in Lucene (i.e. don't store 50 attributes from a class). You can get things faster sometimes just by getting a list of IDs and getting the other content from a service/database faster.
I've a Lucene based application and obviously a problem.
When the number of indexed document is low no problems appear. When then number of documents increase, seems that no single word are indexing. What we obtain is that searching with single word (single term) is an empty set.
The version of Lucene is 3.1 on 64 bit machine and the index is 10GB.
Do you have any idea?
Thanks
According to the Lucene documentation, Lucene should be able to handle 274 billion distinct terms. I don't believe it is possible that you have reached that limitation in a 10GB index.
Without more information, it is difficult to help further. However, being that you only see problems with large numbers of documents, I suspect you are running into exceptional conditions of some form, causing the system to fail to read or respond correctly. File Handle leaks or Memory Overflow perhaps, to take a stab in the dark.