Term-Document payload support in Lucene - lucene

I am using Elasticsearch 1.3.4 and as a result Lucene 4.9. I have a requirement of storing some information per term-document pair (something like term frequency but only spanning variable number of bytes). I am aware of payloads being supported by Lucene, but that information is per term-document-occurrence trio. Hence in my case, it would be an overkill to use payloads. Well, I can also try saving that information as a payload corresponding to only the first occurrence of a term in a document but that does not sound very clean.
I would like to know if there is an out of the box solution for storing term-document custom information in Lucene. If not, what are my alternatives?

Related

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

How does Lucene/Solr achieve high performance in multi-field / faceted search?

Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.

Sphinx/Solr/Lucene/Elastic Relevancy

We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.
Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.
So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.
Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.
Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.
Thanks for your responses!
I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.
Lucene/Solr uses a vector space model. I'm not certain what you mean by you're using your "own" algorithm, but if it gets too far away from the notion of tf/idf (say, by using a neural net) you're going to have difficulties fitting it into lucene. If by your own algorithm you just mean you want to weight certain terms more heavily than others, that will fit in fine. Basically, lucene stores information about how important a term is to a document. If you want to redefine the calculation of how important a term is, that's easy to do. If you want to get away from the whole notion of a term's importance to a document, that's going to be a pain.
Lucene (and as a result Solr) stores things in its custom format. You don't need to use a database. 30 million records is not an remarkably large lucene index (depending, of course, on how big each record is). If you do want to use a db, use hadoop.
In general, you will want to use Solr instead of Lucene.
I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.
I actually did something similar with Solr. I can't comment on the details, but basically the proprietary analysis/relevance step generated a series of search terms with associated boosts and fed them to Solr. I think this can be done with any search engine (they all support some sort of boosting).
Ultimately it comes down to what your particular analysis requires.

Lucene / Lucene.NET - Document.SetBoost() values?

I know it takes in a float, but what are some typical values for various levels of boosting within a result?
For example:
If I wanted to boost a document's weighting by 10% then I should set it 1.1?
For 20% then 1.2?
What happens if I start setting boosts to values like 75.0? or 500.0?
Edit: Fixed Formatting
Please see the Lucene Similarity Documentation for the formula. In principle, all other factors being equal, setting a document's boost to 1.1 will indeed give it a score that is 10% higher as compared to an identical document with a boost of 1.0. If you have a set of documents that should be intrinsically preferred in searches, this may be a good idea. Note that document boost is an indexing-time attribute, making it impossible to change the document's boost without reindexing it.
There are other important factors in scoring - including term match scores, norms etc.
See Debugging Relevance Issues in Search for details.
Adding to what Yuval has said. This value is function of field boost & document boost. The boost values are encoded in a single byte. So, the precision might be lost while storing this value. Debugging with Searcher.Explain() would help you get the right amount of boost.
If you want the boost value to be preserved (it's useful, for example, when you want to recreate index from current index), you may add it in a stored field.
The important thing to remember about boosting is not to approach it in isolation, you need to consider it as part of a global strategy, make a list of each criteria used to effect the relevancy and then order those criteria. Define a relationship between each of those criteria. Are you regularly re-indexing or are you just adding new documents, if you are regularly re-indexing, you can afford to tune your document boost criteria, if not you need to think it through thoroughly beforehand.