What are the various ways of optimizing Lucene performance?
Shall I use caching API to store my lucene search query so that I save on the overhead of building the query again?
Have you looked at
Lucene Optimization Tip: Reuse Searcher
Advanced Text Indexing with Lucene
Should an index be optimised after incremental indexes in Lucene?
Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
Cheat. Use RAMDirectory to load the entire index into the ram. Afterwards, everything is blazing fast. :)
Lots of dead links in here.
These (somewhat official) resources are where I would start:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
I have found that the best answer to a performance question is to profile it. Guidelines are great, but there is so many variables that can impact performance such as the size of your dataset, the types of queries you are doing, datatypes, etc.
Get the Netbeans profiler or something similar and try it out different ways. Use the articles linked to by Mitch, but make sure you actually test what helps and what (often surprisingly) hurts.
There is also a good chance that any performance differences you can get from Lucene will be minor compared to performance improvements in your code. The profiler will point that out as well.
For 64 bit machines use MMapDirectory instead of RAMDirectory as very well explained here by one of the core Lucene committers.
Related
Are there any new technologies for indexing and fulltext + attributes data search? Better then sphinx, lucene etc?
Maybe some new products in early betas?
Better - I mean faster with HUGE amount of data 100M+ records - less memory usage, faster search etc, maybe with some build-it scalability features...
Thanks in advance guys!
Could you provide more details what you disappoint you with Sphinx?
Actually Sphinx could handle with easy even 1B+ collection and has build-in scalability features.
I know there is post on loading index into Ram for lucene.
Faster search in Lucene - Is there a way to keep the whole index in RAM?
But I really need it for Solr, to improve search speed. Any pointers would be helpful :)
thanks
I don't think this is a good idea. It's like asking to improve speed in Windows by disabling its swapfile. Solr implements some very efficient caching mechanisms on top of Lucene, and there's also the file system cache.
If you have speed issues with Solr this is not the solution. Please post another question detailing your problems and let us recommend you a proper solution.
See also: XY Problem
I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html
Am using Lucene search API for a .net web application.
Can I know the pros and cons of using MultiSearcher ?In what scenarios shall I use it?
Thanks for reading!
The main drawback to MultiSearcher is that there is some overhead in merging results from multiple searchers into a single set of hits. It's similar to the penalty you experience with an unoptimized index, although likely not as severeāit depends on how many searchers are involved.
However, MultiSearcher can be very useful if you have a lot of documents, or need to do frequent updates. If you have a huge database, it allows you to split your documents into groups to be indexed on separate machines in parallel, then searched together. If you need frequent updates, you might find a MultiSearcher that has one file-system directory and one RAM directory gives you fast index updates. New documents go into the RAM directory, and periodically the contents of the RAM directory are merged into the file-system directory.
Also consider ParallelMultiSearcher. Depending on your machine architecture and query load, this could hurt or help. If you have many cores, it is likely to help, but there is additional overhead involved with the threading, so it requires some profiling under representative loads.
Keep in mind that I am a rookie in the world of sql/databases.
I am inserting/updating thousands of objects every second. Those objects are actively being queried for at multiple second intervals.
What are some basic things I should do to performance tune my (postgres) database?
It's a broad topic, so here's lots of stuff for you to read up on.
EXPLAIN and EXPLAIN ANALYZE is extremely useful for understanding what's going on in your db-engine
Make sure relevant columns are indexed
Make sure irrelevant columns are not indexed (insert/update-performance can go down the drain if too many indexes must be updated)
Make sure your postgres.conf is tuned properly
Know what work_mem is, and how it affects your queries (mostly useful for larger queries)
Make sure your database is properly normalized
VACUUM for clearing out old data
ANALYZE for updating statistics (statistics target for amount of statistics)
Persistent connections (you could use a connection manager like pgpool or pgbouncer)
Understand how queries are constructed (joins, sub-selects, cursors)
Caching of data (i.e. memcached) is an option
And when you've exhausted those options: add more memory, faster disk-subsystem etc. Hardware matters, especially on larger datasets.
And of course, read all the other threads on postgres/databases. :)
First and foremost, read the official manual's Performance Tips.
Running EXPLAIN on all your queries and understanding its output will let you know if your queries are as fast as they could be, and if you should be adding indexes.
Once you've done that, I'd suggest reading over the Server Configuration part of the manual. There are many options which can be fine-tuned to further enhance performance. Make sure to understand the options you're setting though, since they could just as easily hinder performance if they're set incorrectly.
Remember that every time you change a query or an option, test and benchmark so that you know the effects of each change.
Actually there are some simple rules which will get you in most cases enough performance:
Indices are the first part. Primary keys are automatically indexed. I recommend to put indices on all foreign keys. Further put indices on all columns which are frequently queried, if there are heavily used queries on a table where more than one column is queried, put an index on those columns together.
Memory settings in your postgresql installation. Set following parameters higher:
.
shared_buffers, work_mem, maintenance_work_mem, temp_buffers
If it is a dedicated database machine you can easily set the first 3 of these to half the ram (just be carefull under linux with shared buffers, maybe you have to adjust the shmmax parameter), in any other cases it depends on how much ram you would like to give to postgresql.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-resource.html
http://wiki.postgresql.org/wiki/Performance_Optimization
The absolute minimum I'll recommend is the EXPLAIN ANALYZE command. It will show a breakdown of subqueries, joins, et al., all the time showing the actual amount of time consumed in the operation. It will also alert you to sequential scans and other nasty trouble.
It is the best way to start.
Put fsync = off in your posgresql.conf, if you trust your filesystem, otherwise each postgresql operation will be imediately written to the disk (with fsync system call).
We have this option turned off on many production servers since quite 10 years, and we never had data corruptions.