I am using a booleanquery constructed of termqueries, all on the same field, that are all set on 'SHOULD' at the moment.
I have tried to figure out how the ranking of the ScoreDoc[] result object works for this query, but haven't been able to find the right documentation, maybe you can help with the following questions:
1) Will a booleanquery rank hits that match all terms higher than hits that only match single terms?
2) Is there a way to determine which termquery was matched and which was not for the a resulting scoredoc object?
Thanks for the help!
A boolean query does rank hits on multiple query terms more highly than those that only match one, but keep in mind, that is only one part of the scoring algorithm. There are a number of other impacts that could wash that out.
Query terms combined by a boolean query have their sub-scores multiplied together to form the final score, so more query terms matching will naturally be weighed more heavily. On top of that, there is a coord factor, which is larger when a larger proportion of the query terms are matched, which is also multiplied into the score.
However, multiple matches of the same query term, document length, term rarety, and boosts also impact the score, and it's quite possible to have documents that, even though they don't match all terms, get a higher score from these impacts.
See the TFIDFSimilarity docs for details on the algorithm in use here.
To understand the scoring of a document for your query, you should get familiar with Explanation. You can get a human readable explanation of why a document was scored the way it was like:
Explanation explain = searcher.explain(myQuery, resultDocNo);
System.out.print(explain.ToString());
To identify the fragments of the documents which matched the query, you can use Highlighter, a simple use of which might be:
QueryScorer scorer = new QueryScorer(myQuery);
Highlighter highlighter = new Highlighter(scorer);
String fragment = highlighter.getBestFragment(analyzer, fieldName, myDoc.getField(fieldName));
Related
I have no idea what setDisableCoord is and what value should I set for it. I understand coord in a simple query (e.g. TFIDF query). But don't understand what it means in a Boolean query consisting of several queries.
To give some context, assume the following two scenarios. What value should I set in setDisableCoord for each of them?
In the first scenario I have a query with BooleanClause.Occur.FILTER (the query is used only for filtering) and another one for scoring (BooleanClause.Occur.MUST). In this scenario the first query only checks if the "year" field of the document is in a specified range and the second query uses some algorithm for ranking.
In the second scenario, I have two queries with BooleanClause.Occur.SHOULD whose scores must be combined to obtain the final retrieval score of documents.
Summary: For Lucene > 6.x, set disableCoord to true, otherwise leave it at false.
Coord is a scoring feature of BooleanQuery to counteract some of TF/IDFs shortcomings for over-saturated terms. It's only relevant for multiple should clauses. In your first scenario, all sub-queries must match, there is no coord factor involved and the disableCoord parameter has no effect.
In the second scenario, when having multiple should clauses, a BooleanQuery sums up all the sub-scores to determine, which of the documents is a better match. The idea is that a doc that matches more sub-queries is a better match and thus, gets a better score.
Now, imagine a query x OR y and a document that has 1000 occurrences of x but none of y. With TF/IDF, due to the high termFreq(x), the sub-score of x is very high and so is the resulting score of x OR y, which can push this document before others, that match both fields, which is not what BooleanQuery was meant to do. This is where the coord comes into play.
The coord factor is calculated per document as number of should clauses matched/total number of should clauses in query. This basically gives a number in [0..1] that represents, how many sub-queries have matches a document. The summed score of all sub-queries is then multiplied by this coord factor. A document matching all should clauses will have it original score of all summed sub-queries and a document matching only x out of x OR y will have it's score halved, counteracting the high score that the over-saturated x gave. If you disabled coord, this factor will not be calculated and the final score is only the sum of the sub-scores.
Coord was designed with TF/IDF in mind and other similarity formulas might not suffer from over-saturated terms. BM25, which has become the default similarity in Lucene 6.0, has much better control over such over-satured terms, controlled by its k1 parameter. Instead of a score that grows near-linear with increasing termFreq, BM25 approaches a limit and stops growing. It gives no boost for documents that have a termFreq=1000 over one that has termFreq=5, but does so for termFreq=1 over termFreq=0. Britta Weber has given a talk at buzzwords about this, where she explains the saturation curve.
That means, for BM25, the coord factor is not necessary anymore and might actually lead to counter-intuitive results. It is already removed from Lucene master and will be gone in 7.0.
If you're using Lucene 6.x witht he default similarity BM25, it's a good idea to always disable the coord, as BM25 does not suffer from the problem coord worked around. If you're using TF/IDF (regardless of 6.x or not), disabling coord will only give you more predictable results as long as your term frequencies are evenly distributed (which they practically never are) and setting disableCoord to false (the default) will give results, that are intuitively better.
I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.
We have an application where every term position in a document is associated with an "engine score".
A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500.
I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal.
There are two issues with this solution:
Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ?
The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide this information.
Is there another way this can be achieved in Lucene ?
One workaround for a document-level payload is to create a single Lucene document / your document that just contains the engine score for your whole document as a specially-named field (different from all other Lucene document field names). You can then combine / extract that document during your searches. Not much of a workaround, but there it is.
Quite new to Solr 1.4 - seems to be very powerful indeed. However, I am stuck when trying to return search results in order of relevancy (score) and rating_value (a 0 to 5 star rating on each result).
I've tried ordering search results by "rating desc, score desc", and while this works, it feels a bit basic.
I would ultimately like to boost the relevancy of a search result based on how many stars it has been rated as (0 to 5). A 5-star result should therefore give the highest boost.
I did try adding 'rating_value:1^0.1 rating_value:2^0.2' etc, etc, but this seems to massively boost answers that have no keyword match, but do have a high star rating.
Any help is VERY much appreciated!
Thanks, Seb
You are on the right track with adding the "rating_value" terms with boost values. However, make sure when you are constructing your query, that the keyword terms are "MUST" terms, which will require the doc to contain that term in order for it to be returned.
From there you can play with the relative boost values for each term. If the rating boost is too high, you can give the keywords more boost, and vice-versa. It's important to know that the absolute values of the boost is not comparable across fields, i.e. giving keywords a boost of 20 and rating_value a boost of 19 will does not mean that keywords will be boosted more, mainly because of length normalization. See Lucene's Similarity for more info.
If you are using the DISMAX request handler you should also consider boosting using the bq (boost query) field, as this boost only affects documents that are already matched by the users query.
You would predefine the bq field in solrconfig.xml inside the request handler e.g.
<str name="bq">
rating_value:1^0.1 rating_value:2^0.2
</str>
Does TermQuery:ExtractTerms result in a higher count when termvectors/positions/offsets are turned on? (assuming that there is more than 1 occurence of a match). Conversely, with the inverted file info turned off, does ExtractTerms always return 1 and only 1 term?
EDIT: How and where does turning on termvectors in the schema affect scoring?
TermQuery.ExtractTerms extracts the terms in the query, not the result. So a search for "foo:bar" will always return exactly one term, regardless of what's in the index.
It sounds to me like you want to know about highlighting, not Query.ExtractTerms.
EDIT: Based on your comment, it sounds like you are asking: "how is scoring affected by term vectors?" The answer to that is: not at all. The term frequency, norm, etc. is calculated at index time, so it doesn't matter what you store.
The major exception is PhraseQuery with slop, which uses the term positions. A minor exception is that custom scoring classes can use whatever data they want, so not only term vectors but also payloads etc. can potentially affect the score.
If you're just doing TermQuerys though, what you store should have no effect.