I am hoping someone can point to on how to benchmark my lucense RAMDirectory index?
I have about 300-500K documents indexed (less than 80 characters per document) and I want to benchmark how fast the in-memory RAMDirectory is.
At a very high level, should this be tens or hundreds of queries per second?
Too many variables to even guess. What kind of queries you're running, your hardware, the makeup of your index, etc. That can make orders of magnitude of difference, so even a high-level guess would be meaningless.
You can take a look at Lucene's nightly benchmarks though, which use the Wikipedia English export (a much larger dataset than yours, of course, but it's something).
Related
There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
http://www.optaplanner.org/blog/2014/03/27/IsTheSearchSpaceOfAnOptimizationProblemReallyThatBig.html
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.
I have the frequency count of all the words from a file (that i am using to analyze and index data: elasticsearch), and the frequency of words follows zipf's law. How can i use this knowledge to improve my search over it? Rather, how can i use it to get anything done to my benefit?
I think this is a very interesting question, and I'm sad that it's gone without answer or comment for so long. Zipfian distribution is a phenomenon that occurs not only in language, but far beyond that.
Zipf and Pareto
Zipfian distribution or Zipf's Law is a rank-frequency distribution of words in this case. But perhaps more importantly pareto distribution implies that approximately 20% of words(cause) account for roughly 80% of word occurrences(outcome) in any given body, or bodies, of text. Lucene, the brain behind elasticsearch, accounts for this in multiple ways, and often beyond that of zipf's law. It's common that your results will contain a zipfian distribution.
Word frequency, least is best(usually)
One of the problems here is in most bodies of text the most common words actually bare the least context. Usually being an article or having a very limited context. The top 3 most common words in english are: "the", "of", and "to". Elasticsearch actually comes with a list of stop words which will optimize indexing by ignoring articles.
Elasticsearch stop words:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
not, of, on, or, such, that, the, their, then, there, these, they,
this, to, was, will, with
It's actually a common occurrence that words that appear the least frequent bare the most context. So you're likely going to look for the least frequent words when doing text search.
80:20 phenomenon
The thing is elasticsearch and lucene are both build with these things in mind, and well optimised for such. A simple LRU eviction policy for caching indices actually works very well, as 80% of your searches will likely use 20% of your actual indices making cache pollution both infrequent and low impact due to a predictable workload. So if you allocate a cache size larger than 20% your total index size you should be fine. In the event that the index is not in cache it will read off the disk(usually mmap), and you can optimize performance by having a drive with fast random reads(like an SSD).
More Reading
There is an interesting article on this. Its likely that the total word rank in your data set looks very similar to that of the word rank of most other data sets. So optimizing performance as well as relevance is left to those few words which are likely to occur least often, but are likely to be most search for. This may be jargon in context to the demographic/profession your application is targeting.
Conclusion
These optimizations, however, could be premature. Like I stated, lucene and elasticsearch both do their part to increase effectiveness and efficiency of search with these principles in mind. Like I stated earlier, a simple LRU cache works very well in this case, and LRU is both common(already part of ES) and relatively simple. Cases where it might be worth-while are usually cases where you have a lot of jargon or specific language or perhaps multilingual. For something like a news site you'll likely want a more broad solution, as you cover a huge spectrum of topics which include many different words and subjects. These are usually things to consider when you're configuring elasticsearch, but tinkering with the analyzer can be complicated and may be hard to do effectively especially if you have a large range of subjects with different terminologies that you need to index, however this will likely have the largest effect on increasing search relevance.
Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.
I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html
What are the various ways of optimizing Lucene performance?
Shall I use caching API to store my lucene search query so that I save on the overhead of building the query again?
Have you looked at
Lucene Optimization Tip: Reuse Searcher
Advanced Text Indexing with Lucene
Should an index be optimised after incremental indexes in Lucene?
Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
Cheat. Use RAMDirectory to load the entire index into the ram. Afterwards, everything is blazing fast. :)
Lots of dead links in here.
These (somewhat official) resources are where I would start:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
I have found that the best answer to a performance question is to profile it. Guidelines are great, but there is so many variables that can impact performance such as the size of your dataset, the types of queries you are doing, datatypes, etc.
Get the Netbeans profiler or something similar and try it out different ways. Use the articles linked to by Mitch, but make sure you actually test what helps and what (often surprisingly) hurts.
There is also a good chance that any performance differences you can get from Lucene will be minor compared to performance improvements in your code. The profiler will point that out as well.
For 64 bit machines use MMapDirectory instead of RAMDirectory as very well explained here by one of the core Lucene committers.