Are there any new technologies for indexing and fulltext + attributes data search? Better then sphinx, lucene etc?
Maybe some new products in early betas?
Better - I mean faster with HUGE amount of data 100M+ records - less memory usage, faster search etc, maybe with some build-it scalability features...
Thanks in advance guys!
Could you provide more details what you disappoint you with Sphinx?
Actually Sphinx could handle with easy even 1B+ collection and has build-in scalability features.
Related
I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing
I know there is post on loading index into Ram for lucene.
Faster search in Lucene - Is there a way to keep the whole index in RAM?
But I really need it for Solr, to improve search speed. Any pointers would be helpful :)
thanks
I don't think this is a good idea. It's like asking to improve speed in Windows by disabling its swapfile. Solr implements some very efficient caching mechanisms on top of Lucene, and there's also the file system cache.
If you have speed issues with Solr this is not the solution. Please post another question detailing your problems and let us recommend you a proper solution.
See also: XY Problem
We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.
Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.
So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.
Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.
Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.
Thanks for your responses!
I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.
Lucene/Solr uses a vector space model. I'm not certain what you mean by you're using your "own" algorithm, but if it gets too far away from the notion of tf/idf (say, by using a neural net) you're going to have difficulties fitting it into lucene. If by your own algorithm you just mean you want to weight certain terms more heavily than others, that will fit in fine. Basically, lucene stores information about how important a term is to a document. If you want to redefine the calculation of how important a term is, that's easy to do. If you want to get away from the whole notion of a term's importance to a document, that's going to be a pain.
Lucene (and as a result Solr) stores things in its custom format. You don't need to use a database. 30 million records is not an remarkably large lucene index (depending, of course, on how big each record is). If you do want to use a db, use hadoop.
In general, you will want to use Solr instead of Lucene.
I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.
I actually did something similar with Solr. I can't comment on the details, but basically the proprietary analysis/relevance step generated a series of search terms with associated boosts and fed them to Solr. I think this can be done with any search engine (they all support some sort of boosting).
Ultimately it comes down to what your particular analysis requires.
I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html
What are the various ways of optimizing Lucene performance?
Shall I use caching API to store my lucene search query so that I save on the overhead of building the query again?
Have you looked at
Lucene Optimization Tip: Reuse Searcher
Advanced Text Indexing with Lucene
Should an index be optimised after incremental indexes in Lucene?
Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
Cheat. Use RAMDirectory to load the entire index into the ram. Afterwards, everything is blazing fast. :)
Lots of dead links in here.
These (somewhat official) resources are where I would start:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
I have found that the best answer to a performance question is to profile it. Guidelines are great, but there is so many variables that can impact performance such as the size of your dataset, the types of queries you are doing, datatypes, etc.
Get the Netbeans profiler or something similar and try it out different ways. Use the articles linked to by Mitch, but make sure you actually test what helps and what (often surprisingly) hurts.
There is also a good chance that any performance differences you can get from Lucene will be minor compared to performance improvements in your code. The profiler will point that out as well.
For 64 bit machines use MMapDirectory instead of RAMDirectory as very well explained here by one of the core Lucene committers.