What Happens When Multiple Fields with the Same Name have Different Boosts? - lucene

Boost is by-Field in Lucene, and that is then set in the Index. You can add multiple Fields with the Name property set to the same value, and they'll then be searched for that Document as though you just added the union of those Terms to that Document under that Field name.
var field1 = new Field("Text", "aquaculture");
var field2 = new Field("Text", "fish");
field1.setBoost(1.0f);
field2.setBoost(2.0f);
var doc = new Document();
doc.AddField(field1);
doc.AddField(field2);
But you can set the Boost to multiple values as you do so - if you do, what happens? Do the Terms in each Field get set to the separate Boost levels, or does one shared Boost level get used? If one, is it Last-In, First-In, random?
(The above code is pseudocode just to illustrate the Fields getting added to a single Doc to help coders visualize it, I realize the actual code to do the above is a bit different in actual Lucene API)

For the current version of Lucene, this question is no longer relevant. Setting boosts on fields was deprecated in 6.5, and no longer supported in 7.0. See LUCENE-6819 for discussion. The replacement, per the migration guide:
...index-time scoring factors should be indexed in a doc value field and combined with the score at query time using FunctionScoreQuery for instance.
In previous versions, if multiple boosts added to the same field, they are multiplied. This specified by the FieldInvertState, which is the information that gets passed to the Similarity, which performs the translation of that data into what gets stored in the index.
This is the cumulative product of document boost and field boost for all field instances sharing the same field name.

Related

Getting search suggestions to work on 2 (or more) non-consecutive words (to improve search on a medical conditions list - ICD10 codes)

Context:
We are using Azure Cognitive Services in a mobile app to search patient diagnostic codes (ICD10 codes).
The ICD10 code list is approximately 94,000 items. For anyone interested here is a list.
We currently have set-up a standard Lucene analyser on the diagnostic description field
Requirement:
We want to provide a really good search as you type experience, which provides the most relevant suggestions
Using the Suggest method with the fuzzy parameter set to true works reasonably well for a single search term:
As you can see it does well in finding partial matches and is resilient to typos.
The issue comes in when I add a second search term. E.g. I want to search for asthma that is moderate:
In both these examples, there is no match.
So when searching for more than one term, requiring the user to express this in the sequence that this is in the data is not a good user experience.
Using the Search method instead, we can overcome the problem of finding matches where 2 search terms are supplied that do not appear consecutively in the data:
And this is resilient to typos
However, this is not good at finding partial matches (like the Suggest does).
E.g. in this search, we would still want the term moderate to be picked up:
Seemingly if we could combine a wild card search with a fuzzy search we could solve this problem. e.g. supplying the following search phrase: ashtma~* AND moder~*.
But from what we have seen this syntax is not supported.
Any suggestions on how to overcome this limitation so we can get the best of both worlds, i.e:
For 2 or more search terms, it will work on partial matches
And the search terms are treated independently and do not need to appear consecutively in the data
Many thanks in advance,
Andreas.
I recommend using (or at least experimenting with) Lucene ngrams.
An example custom analyzer can use the NGramTokenFilter.
This filter splits each source token into one or more indexed tokens by chopping up the source into substrings of different lengths.
An example from the above link:
"abc" will give "a", "ab", "abc", "b", "bc", "c"
You can, as an example, set each token to be from 3 to 5 characters long (but this is one of the areas where you can experiment with different settings).
When you use this analyzer for indexing, it's going to create many more tokens (larger index) but that gives you more searching flexibility.
Use the same analyzer for searching.
If the user enters the following two words as their search values:
ashtma moder
You would convert that into the following Lucene search phrase:
ashtma~ AND moder~
This will find the following hits:
doc id = 12877
field = Moderate persistent asthma with status asthmaticus
doc id = 12874
field = Moderate persistent asthma
doc id = 12875
field = Moderate persistent asthma, uncomplicated
doc id = 12876
field = Moderate persistent asthma with (acute) exacerbation
doc id = 94210
field = Family history of asthma and oth chronic lower resp diseases
doc id = 6970
field = Xanthelasma of right lower eyelid
doc id = 6973
field = Xanthelasma of left lower eyelid
doc id = 6979
field = Chloasma of right lower eyelid and periocular area
doc id = 6982
field = Chloasma of left lower eyelid and periocular area
As you can see it does find some false positives, but the first four hits (the highest scored) are the ones you want.
You can see how this approach performs in terms of index size and search speed.
One reason for suggesting ngrams is your point about wanting to handle mis-spellings: ngrams may help to isolate spelling mistakes into smaller tokens,since the ~ fuzzy search operator is fairly limited in what it can handle. But, definitely experiment with different ngram lengths - and maybe also without using ngrams at all.

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

Index time field level boosting in Lucene 6.6.0?

In Lucene 6.6.0 and greater, Field level index time boosting is deprecated. The documentation states:
Index-time boosts are deprecated, please index index-time scoring
factors into a doc value field and combine them with the score at
query time using eg. FunctionScoreQuery.
Previously one would boost a field at index time like so:
Field title = new Field(PaperDAO.LUCENE_FIELD_TITLE, titleStr, fieldType);
title.setBoost(3.00f);
document.add(title);
Field authors = new Field(PaperDAO.LUCENE_FIELD_AUTHOR, StringEscapeUtils.unescapeHtml4(this.getAuthorsForLucene()), fieldType);
authors.setBoost(10.00f);
document.add(authors);
I do not understand how the suggested FunctionScoreQuery is an appropriate replacement for field level boosting, as one constructs a FunctionScoreQuery given only an existing Query and a DoubleValuesSource representing the boost value for only one of potentially many fields:
// INDEX TIME
Field title = new Field(PaperDAO.LUCENE_FIELD_TITLE, titleStr, fieldType);
document.add(title);
document.add(new FloatDocValuesField(PaperDAO.LUCENE_FIELD_TITLE + "_boost", 3.00f));
// QUERY TIME
new FunctionScoreQuery(query, DoubleValuesSource.fromFloatField(PaperDAO.LUCENE_FIELD_TITLE + "_boost"))
Can someone please explain the appropriate replacement for Field#setBoost # index time in Lucene >= 6.6.0? Are we supposed to be enumerating all possible fields at query time and applying the relevant boost? If so, how is that query constructed?
First of all, you still have some time to use old-style index time boosts, since they only will be remove in Lucene 7.0 :)
Moving on to the subject, community long time ago decided, that index-time boost is a complex and hard to get it right technique.
What I think is the current idea - is not to replace per-field index time boost with per field docvalues field, but rather replace all index time boosts for a document with 1 accumulated score in the docvalues field and later use it during search.
please index index-time scoring factors into a doc value field
and combine them with the score at query time
The quote is from the javadoc, which only strengthen me in this idea. You could index multiple factors into just one field.
The open question to me is - how to combine several factors into 1. I hope that's something to test and validate (to use multiplication, sum or some linear combination)
if you want to boost different fields using FunctionScoreQuery, the suggested method is the following (taken from CustomeScoreProvider):
For more complex custom scores, use the lucene-expressions library
SimpleBindings bindings = new SimpleBindings();
bindings.add("score", DoubleValuesSource.SCORES);
bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));

Boosting Lucene Terms When Building the Index

Is it possible to determine that specific terms are more important then other when creating the index (not when querying it) ?
Consider for example a synonym filter:
doc 1: "this is a nice car"
doc 2: "this is a nice vehicle"
I want to add the term vehicle to the first doc and the term car to the second doc,
but I want that if later the index is queried with the word car then the first document will be scored higher then the second one and if queried for vehicle it will be the other way around.
Will calling setBoost on the fields before adding them to their respective documents do the trick?
Or maybe I should add the synonyms to a different field name?
Or am I looking at this from a wrong point of view ?
Thanks
Setting boost on a filed affects all terms in that field so this wouldn't work in your case.
But it should be posible using Lucene payloads (a byte array that can be set for every term). You would use them to set term specific boosts (vehicle to 0.5 for doc 1, for example). Then you'll implement your own Similarity and override scorePayload() method to decode that boost and then use PayloadTermQuery which allows you to contribute to the score based on the boots you have in the payload for that term.

Bubbling up newest content in lucene search results

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!
I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.
You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).