I try to build some real-time aggregates on Lucene as a part of experiment. Documents have their values stored in the index. This works very nice for up-to 10K documents.
For larger numbers of documents, this gets kinda slow. I assume there is not too much invested in getting bulk-amounts of documents, as this kind of defeats the purpose of a search engine.
However, it would be cool to be able to do this. So, basically my question is: what could I do to get documents faster from Lucene? Or are there smarter approaches?
I already only retrieve fields I need.
[edit]
The index is quite large >50GB. This does not fit in memory. The number of fields differ, I have several types of documents. Aggregation will mostly take place on a fixed document type; but there is no way to tell on beforehand which one.
Have you put the index in memory? If the entire index fits in memory, that is a huge speedup.
Once you get the hits (which comes back super quick even for 10k records), I would open up multiple threads/readers to access them.
Another thing I have done is store only some properties in Lucene (i.e. don't store 50 attributes from a class). You can get things faster sometimes just by getting a list of IDs and getting the other content from a service/database faster.
Related
I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas
Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.
i have an indexed of 10 millions documents, then as per the new requirement i need to add extra field in my solr schema.
So my question is what would be the ideal approach? i mean simply add field in schema and re-index the whole data? or just some partial index? or something else?.
The data has to be submitted regardless of what you do, so the answer would probably depend on the size of the rest of your fields. If they are small, just add the field and reindex everything.
If there's a very large field present, it might be more effective letting Solr fetch the content internally (i.e. a partial update) instead of submitting it over the network, but that would require that you already have all fields defined as stored or with doc values.
It's impossible to say exactly, so you'll have to perform a small test with a small number of documents to see exactly how performance is with your dataset.
I currently find myself needing to do fairly simple computations on several million datapoints. (Constructing a large list of strings from a well defined multi-gigabite file, sorting that list, and then comparing it to another list, a superset.) This is the sort of simple work most of us normally do with the data entirely in-memory, but the size and quantity of the units of data I need to work with could make RAM an issue if I try to keep everything in memory. I quickly realized I probably need to write the data to a file, at a few points, to avoid exhausting my system's resources. I decided to use SQLite3 for this. (This is probably a bit much for a CSV.) It is fairly lightweight, while its storage limits seem to safely exceed my requirements.
The problem I am having is the understanding exactly how the result set works. The documentation I have come across seems a little vague on this. Obviously, SQLite is not writing a whole new table to the database every time a SELECT statement is executed. Does this mean it is duplicating all the selected fields in a complete in-memory table, or does it only keep some sort of pointers in memory (rather than the actual data)? Something else altogether?
I need to be able to sort the data in question. If the result set is really just an in-memory data structure, than simply creating creating a new table and populating it with the help of ORDER BY could be a bad idea.
SQLite does not really have result sets. It has cursors, which allow access to only the current row, and which cannot go backwards.
SQLite computes results on the fly, so only one row needs to be in memory at a time.
When a computation needs to access multiple rows (i.e., aggregate functions, or sorting without a usable index), as much data as possible is kept in the cache, and then spilled to disk in a temporary database.
We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.
Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.
So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.
Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.
Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.
Thanks for your responses!
I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.
Lucene/Solr uses a vector space model. I'm not certain what you mean by you're using your "own" algorithm, but if it gets too far away from the notion of tf/idf (say, by using a neural net) you're going to have difficulties fitting it into lucene. If by your own algorithm you just mean you want to weight certain terms more heavily than others, that will fit in fine. Basically, lucene stores information about how important a term is to a document. If you want to redefine the calculation of how important a term is, that's easy to do. If you want to get away from the whole notion of a term's importance to a document, that's going to be a pain.
Lucene (and as a result Solr) stores things in its custom format. You don't need to use a database. 30 million records is not an remarkably large lucene index (depending, of course, on how big each record is). If you do want to use a db, use hadoop.
In general, you will want to use Solr instead of Lucene.
I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.
I actually did something similar with Solr. I can't comment on the details, but basically the proprietary analysis/relevance step generated a series of search terms with associated boosts and fed them to Solr. I think this can be done with any search engine (they all support some sort of boosting).
Ultimately it comes down to what your particular analysis requires.
I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html