How to: Increase Lucene .net Indexing Speed - indexing

I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?

I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.

Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).

The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.

The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html

Related

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

Which optimize policy should I take on heavy non stop indexing with Elasticsearch?

I have a search engine application that parse feeds constantly and index the results in ES (Version 1.5.2).
I have an average of 3.5 million documents indexed.
The deleted documents percentage is about 40% sometimes and I am having some request timeouts while indexing (bulk).
Which optimize policy should I take?
Should I have to stop indexing once or multiple times a day to
optimize the index and reduce the percentage of deleted documents and
merge the segments?
Does the optimization process affects queries?
I would like to know which is the best solution for this use of case.
I am using a custom _id, I know it has performance issues, but it is not an option to change it sadly.
Thanks in advance
If some of your bulk index requests are timing out, that is indication that you need to lower the rate of indexing. Elasticsearch gurus advice not to use the optimize API. In the background segment merges happen which take care of getting rid of deleted documents automatically. Also never use optimize API if you have a high indexing rate. That will only cause more indexing requests to time out. And yes, optimize can also negatively affect search performance as it is a very resource intensive operation.
In a nutshell, just reduce your indexing rate. That should solve most of the problems you have mentioned here. Requests will not time out and deleted document percentage may also come down.

Lucene: fast(er) to get docs in bulk?

I try to build some real-time aggregates on Lucene as a part of experiment. Documents have their values stored in the index. This works very nice for up-to 10K documents.
For larger numbers of documents, this gets kinda slow. I assume there is not too much invested in getting bulk-amounts of documents, as this kind of defeats the purpose of a search engine.
However, it would be cool to be able to do this. So, basically my question is: what could I do to get documents faster from Lucene? Or are there smarter approaches?
I already only retrieve fields I need.
[edit]
The index is quite large >50GB. This does not fit in memory. The number of fields differ, I have several types of documents. Aggregation will mostly take place on a fixed document type; but there is no way to tell on beforehand which one.
Have you put the index in memory? If the entire index fits in memory, that is a huge speedup.
Once you get the hits (which comes back super quick even for 10k records), I would open up multiple threads/readers to access them.
Another thing I have done is store only some properties in Lucene (i.e. don't store 50 attributes from a class). You can get things faster sometimes just by getting a list of IDs and getting the other content from a service/database faster.

How many queries in one webpage?

How many queries in one webpage is good performance? If that page is home page that is viewed many times.
and how about....
$sql1 = mysql_query("SELECT * FROM a", $db1);
while($row = mysql_fetch_assoc($sql1)){
$sql2 = mysql_query("SELECT * FROM b WHERE aid='a'", $db2);
$a = mysql_fetch_assoc($sql2);
}
is it good? acctually I can combine $sql1 and $sql2 together by INNER JOIN but the problem is $sql1 is query data from database 1 and $sql2 is query data from database 2. and I use Parallels Plesk Panel that doesn't allow me to add same database user to multiple database.
If I use this code on my website, is it good? or anyway to do this?
Thanks...
Actually you have 2 questions in 1.
A general one and a particular one.
Both has obvious answers in my opinion.
How many queries in one webpage is good performance?
There is no direct connection between number of queries and performance. Database setup, architecture and tuning is responsible for the performance.
And number of queries should be caused by database architecture only. Use as many queries as many you need. Do not reduce number of queries at any cost, only in pursue of performance.
is it good?
Does it matter if you have no choice?
And another, unspoken question:
Should I be concerned about this code snippet performance?
Should you?
Do you have any performance issues at the moment?
If not - why to worry at all? Why to worry about this particular snippet, not any other one?
If yes - you have to profile your code first.
And then build your optimization strategy based on the profiling results. It may be number of queries, it may be proper indexing, clusterization, server upgrade.
Do not blind shoot. Take sensible steps.
I like to keep mine under 12.
In all seriousness though, that's pretty meaningless. If hypothetically there was a reason for you to have 800 queries in a page, then you could go ahead and do it. You'll probably find that the number of queries per page will simply be dependant on what you're doing, though in normal circumstances I'd be surprised to see over 50 (though these days, it can be hard to realise just how many you're doing if you are abstracting your DB calls away).
Slow queries matter more
I used to be frustrated at a certain PHP based forum software which had 35 queries in a page and ran really slow, but that was a long time ago and I know now that the reason that particular installation ran slow had nothing to do with having 35 queries in a page. For example, only one or two of those queries took most of the time. It just had a couple of really slow queries, that were fixed by well-placed indexes.
I think that identifying and fixing slow queries should come before identifying and eliminating unnecessary queries, as it can potentially make a lot more difference.
Consider even that 20 fast queries might be significantly quicker than one slow query - number of queries does not necessarily relate to speed. Sometimes, you can reduce load and speed up a page by splitting a slow query into multiple queries.
Try caching
There are various ways to cache parts of your application which can really cut down on the number of queries you do, without reducing functionality. Libraries like memcached make this trivially easy these days and yet run really fast. This can also help improve performance a lot more than reducing the number of queries.
If queries are really unnecessary, and the performance really is making a difference, then remove/combine them
Just consider looking for slow queries and optimizing them, or caching their results, first.
Measure it.
For the specific case outlined above, I'd combine to a join if possible.
In general, multiple queries per request is pretty normal.
Many sites have tens of requests per query and they are fairly performant.
Use a load tester like Apache bench. (If you have Apache installed, type ab to see the parameters)
I just had the same problem here.
The problem is that you use a query in a loop. If your record a has 10 rows, it makes 10 queries. If your record 'a' has 100 rows, it will make 100 queries. So the more rows your record 'a' has, the worse it gets.
The solution is to put the requests in an array and use the correct foreach loops to display the same thing with only 2 queries. I found This site which is really clear about this topic.

Performance Tuning PostgreSQL

Keep in mind that I am a rookie in the world of sql/databases.
I am inserting/updating thousands of objects every second. Those objects are actively being queried for at multiple second intervals.
What are some basic things I should do to performance tune my (postgres) database?
It's a broad topic, so here's lots of stuff for you to read up on.
EXPLAIN and EXPLAIN ANALYZE is extremely useful for understanding what's going on in your db-engine
Make sure relevant columns are indexed
Make sure irrelevant columns are not indexed (insert/update-performance can go down the drain if too many indexes must be updated)
Make sure your postgres.conf is tuned properly
Know what work_mem is, and how it affects your queries (mostly useful for larger queries)
Make sure your database is properly normalized
VACUUM for clearing out old data
ANALYZE for updating statistics (statistics target for amount of statistics)
Persistent connections (you could use a connection manager like pgpool or pgbouncer)
Understand how queries are constructed (joins, sub-selects, cursors)
Caching of data (i.e. memcached) is an option
And when you've exhausted those options: add more memory, faster disk-subsystem etc. Hardware matters, especially on larger datasets.
And of course, read all the other threads on postgres/databases. :)
First and foremost, read the official manual's Performance Tips.
Running EXPLAIN on all your queries and understanding its output will let you know if your queries are as fast as they could be, and if you should be adding indexes.
Once you've done that, I'd suggest reading over the Server Configuration part of the manual. There are many options which can be fine-tuned to further enhance performance. Make sure to understand the options you're setting though, since they could just as easily hinder performance if they're set incorrectly.
Remember that every time you change a query or an option, test and benchmark so that you know the effects of each change.
Actually there are some simple rules which will get you in most cases enough performance:
Indices are the first part. Primary keys are automatically indexed. I recommend to put indices on all foreign keys. Further put indices on all columns which are frequently queried, if there are heavily used queries on a table where more than one column is queried, put an index on those columns together.
Memory settings in your postgresql installation. Set following parameters higher:
.
shared_buffers, work_mem, maintenance_work_mem, temp_buffers
If it is a dedicated database machine you can easily set the first 3 of these to half the ram (just be carefull under linux with shared buffers, maybe you have to adjust the shmmax parameter), in any other cases it depends on how much ram you would like to give to postgresql.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-resource.html
http://wiki.postgresql.org/wiki/Performance_Optimization
The absolute minimum I'll recommend is the EXPLAIN ANALYZE command. It will show a breakdown of subqueries, joins, et al., all the time showing the actual amount of time consumed in the operation. It will also alert you to sequential scans and other nasty trouble.
It is the best way to start.
Put fsync = off in your posgresql.conf, if you trust your filesystem, otherwise each postgresql operation will be imediately written to the disk (with fsync system call).
We have this option turned off on many production servers since quite 10 years, and we never had data corruptions.