Lucene / Lucene.NET - Document.SetBoost() values? - lucene

I know it takes in a float, but what are some typical values for various levels of boosting within a result?
For example:
If I wanted to boost a document's weighting by 10% then I should set it 1.1?
For 20% then 1.2?
What happens if I start setting boosts to values like 75.0? or 500.0?
Edit: Fixed Formatting

Please see the Lucene Similarity Documentation for the formula. In principle, all other factors being equal, setting a document's boost to 1.1 will indeed give it a score that is 10% higher as compared to an identical document with a boost of 1.0. If you have a set of documents that should be intrinsically preferred in searches, this may be a good idea. Note that document boost is an indexing-time attribute, making it impossible to change the document's boost without reindexing it.
There are other important factors in scoring - including term match scores, norms etc.
See Debugging Relevance Issues in Search for details.

Adding to what Yuval has said. This value is function of field boost & document boost. The boost values are encoded in a single byte. So, the precision might be lost while storing this value. Debugging with Searcher.Explain() would help you get the right amount of boost.
If you want the boost value to be preserved (it's useful, for example, when you want to recreate index from current index), you may add it in a stored field.

The important thing to remember about boosting is not to approach it in isolation, you need to consider it as part of a global strategy, make a list of each criteria used to effect the relevancy and then order those criteria. Define a relationship between each of those criteria. Are you regularly re-indexing or are you just adding new documents, if you are regularly re-indexing, you can afford to tune your document boost criteria, if not you need to think it through thoroughly beforehand.

Related

What are the advantages of the knowledge that a corpus of text follows the zipf's law?

I have the frequency count of all the words from a file (that i am using to analyze and index data: elasticsearch), and the frequency of words follows zipf's law. How can i use this knowledge to improve my search over it? Rather, how can i use it to get anything done to my benefit?
I think this is a very interesting question, and I'm sad that it's gone without answer or comment for so long. Zipfian distribution is a phenomenon that occurs not only in language, but far beyond that.
Zipf and Pareto
Zipfian distribution or Zipf's Law is a rank-frequency distribution of words in this case. But perhaps more importantly pareto distribution implies that approximately 20% of words(cause) account for roughly 80% of word occurrences(outcome) in any given body, or bodies, of text. Lucene, the brain behind elasticsearch, accounts for this in multiple ways, and often beyond that of zipf's law. It's common that your results will contain a zipfian distribution.
Word frequency, least is best(usually)
One of the problems here is in most bodies of text the most common words actually bare the least context. Usually being an article or having a very limited context. The top 3 most common words in english are: "the", "of", and "to". Elasticsearch actually comes with a list of stop words which will optimize indexing by ignoring articles.
Elasticsearch stop words:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
not, of, on, or, such, that, the, their, then, there, these, they,
this, to, was, will, with
It's actually a common occurrence that words that appear the least frequent bare the most context. So you're likely going to look for the least frequent words when doing text search.
80:20 phenomenon
The thing is elasticsearch and lucene are both build with these things in mind, and well optimised for such. A simple LRU eviction policy for caching indices actually works very well, as 80% of your searches will likely use 20% of your actual indices making cache pollution both infrequent and low impact due to a predictable workload. So if you allocate a cache size larger than 20% your total index size you should be fine. In the event that the index is not in cache it will read off the disk(usually mmap), and you can optimize performance by having a drive with fast random reads(like an SSD).
More Reading
There is an interesting article on this. Its likely that the total word rank in your data set looks very similar to that of the word rank of most other data sets. So optimizing performance as well as relevance is left to those few words which are likely to occur least often, but are likely to be most search for. This may be jargon in context to the demographic/profession your application is targeting.
Conclusion
These optimizations, however, could be premature. Like I stated, lucene and elasticsearch both do their part to increase effectiveness and efficiency of search with these principles in mind. Like I stated earlier, a simple LRU cache works very well in this case, and LRU is both common(already part of ES) and relatively simple. Cases where it might be worth-while are usually cases where you have a lot of jargon or specific language or perhaps multilingual. For something like a news site you'll likely want a more broad solution, as you cover a huge spectrum of topics which include many different words and subjects. These are usually things to consider when you're configuring elasticsearch, but tinkering with the analyzer can be complicated and may be hard to do effectively especially if you have a large range of subjects with different terminologies that you need to index, however this will likely have the largest effect on increasing search relevance.

how to determine the accuracy of a semantic search engine?

my cousin has created a semantic search engine and he claims that his search engine is the most accurate.
I've seen many semantic search engine and they all look the same to me, because they are not designed to give you results based on the keyword you type.
So if you are creating a semantic search engine, how to to determine the accuracy of its results?
Actually sarnold's suggestion is not far off the mark.
What you would typically do is to take a whole bunch of people and have them try out a bunch of standard queries. Or if you wanted to make the experiment fairer you might let each user pick their own queries to avoid any accusation of bias (because you could pick standard queries you knew your engine was good at answering).
For each query the user would look through the first 10 or so results and say whether they thought each result was relevant or not (you may want to have users score on a scale rather than just yes/no).
Then for each of the queries you can calculate accuracy scores, depending on exactly how you set up the experiment Precision and Recall may be the most appropriate measures though these rely on having a known expected answer which you may not necessarily have. It may be simpler and more appropriate to calculate a simple percentage accuracy.
To determine whether your search engine was better than your competitors you'd have the same people perform the same queries on those search engines scoring in the same way. Having done this you can then calculate and compare the scores for the search engines against your own.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

How does Lucene/Solr achieve high performance in multi-field / faceted search?

Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.

Sphinx/Solr/Lucene/Elastic Relevancy

We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.
Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.
So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.
Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.
Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.
Thanks for your responses!
I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.
Lucene/Solr uses a vector space model. I'm not certain what you mean by you're using your "own" algorithm, but if it gets too far away from the notion of tf/idf (say, by using a neural net) you're going to have difficulties fitting it into lucene. If by your own algorithm you just mean you want to weight certain terms more heavily than others, that will fit in fine. Basically, lucene stores information about how important a term is to a document. If you want to redefine the calculation of how important a term is, that's easy to do. If you want to get away from the whole notion of a term's importance to a document, that's going to be a pain.
Lucene (and as a result Solr) stores things in its custom format. You don't need to use a database. 30 million records is not an remarkably large lucene index (depending, of course, on how big each record is). If you do want to use a db, use hadoop.
In general, you will want to use Solr instead of Lucene.
I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.
I actually did something similar with Solr. I can't comment on the details, but basically the proprietary analysis/relevance step generated a series of search terms with associated boosts and fed them to Solr. I think this can be done with any search engine (they all support some sort of boosting).
Ultimately it comes down to what your particular analysis requires.