loading whole Solr index into Ram - lucene

I know there is post on loading index into Ram for lucene.
Faster search in Lucene - Is there a way to keep the whole index in RAM?
But I really need it for Solr, to improve search speed. Any pointers would be helpful :)
thanks

I don't think this is a good idea. It's like asking to improve speed in Windows by disabling its swapfile. Solr implements some very efficient caching mechanisms on top of Lucene, and there's also the file system cache.
If you have speed issues with Solr this is not the solution. Please post another question detailing your problems and let us recommend you a proper solution.
See also: XY Problem

Related

Is manifold cf a good option for Google Drive indexing?

I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you really think its a good option to index Google Drive using it?
It is currently bit on slow side, due to response time and throttling constraints from google drive itself. But this limit can probably relieved if you buy additional bandwidth from google. With current setup if you are looking to index a large set of documents in google drive it may not be quick as you may expect
Manifold CF is good for crawling through file-system. You can go for Apache Nutch if you are interested in web crawling.
Yes ManifoldCF does take a lot of time to reflect a small number of document. Also it has very less documentation. Although, you can join the mailing list where you can ask questions to the lead developer "Karl". He is very helpful and usually answers withing a few hours.
P.S. :I have worked using ManifoldCF over a project for a span of 10 months.

IndexWriter in Lucene 3.5 is slower?

We are upgrading our search infrastructure from Lucene 2.3.1 to Lucene 3.5. I am in the process of load testing and I could find that Lucene 2.3.1 could index 32,000 docs per second, whereas Lucene 3.5 could index only around 17,000 docs per second.
Indeed, both of them use the standard analyzer and the default settings. Is 3.5 slower because it indexes more details and thereby resulting in a faster search? Ours is a log management product and the speed of indexing is highly important.
Ok, cutting the long story short, will the slower indexing of 3.5 result in a higher search speed?, if not, what else should I fine tune to improve the indexing speed?
Have you looked at ImproveIndexingSpeed on Lucene wiki?
Otherwise, please share some details on your setup so that we can help you:
how often do you commit?
what MergePolicy, ramBuffer size, and mergeFactor do use?
how many indexing threads do you spawn?

Next gen data indexers

Are there any new technologies for indexing and fulltext + attributes data search? Better then sphinx, lucene etc?
Maybe some new products in early betas?
Better - I mean faster with HUGE amount of data 100M+ records - less memory usage, faster search etc, maybe with some build-it scalability features...
Thanks in advance guys!
Could you provide more details what you disappoint you with Sphinx?
Actually Sphinx could handle with easy even 1B+ collection and has build-in scalability features.

Katta in production environment

According to the website Katta is a scalable, failure tolerant, distributed, indexed, data storage.
I would like to know if it is ready to be deployed into production environment. Anyone already using it and has advices? Any pitfalls? Recommendations? Testimonials? Please share.
Any answer would be greatly appreciated.
We have tried using katta and for what its worth - found it very stable, relatively easy to manage (as compared to managing plain vanilla lucene)
Only pitfall I can think of is lack of realtime updates - when we tested it (about 9-10 months back) update meant, updating index using a separate process (hadoop job or what have you...) and replacing the live index, this was a deal-breaker for us.
If you are looking into distributed lucene you should really tryout ElasticSearch or Solandra

Optimizing Lucene performance

What are the various ways of optimizing Lucene performance?
Shall I use caching API to store my lucene search query so that I save on the overhead of building the query again?
Have you looked at
Lucene Optimization Tip: Reuse Searcher
Advanced Text Indexing with Lucene
Should an index be optimised after incremental indexes in Lucene?
Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
Cheat. Use RAMDirectory to load the entire index into the ram. Afterwards, everything is blazing fast. :)
Lots of dead links in here.
These (somewhat official) resources are where I would start:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
I have found that the best answer to a performance question is to profile it. Guidelines are great, but there is so many variables that can impact performance such as the size of your dataset, the types of queries you are doing, datatypes, etc.
Get the Netbeans profiler or something similar and try it out different ways. Use the articles linked to by Mitch, but make sure you actually test what helps and what (often surprisingly) hurts.
There is also a good chance that any performance differences you can get from Lucene will be minor compared to performance improvements in your code. The profiler will point that out as well.
For 64 bit machines use MMapDirectory instead of RAMDirectory as very well explained here by one of the core Lucene committers.