Lucene 7 scoring vs Lucene 4 - lucene

We had a Lucene 4 index. We successfully upgraded our index and our code to Lucene 7.7. Everything is working nicely. But when looking at the scores, we noticed that now we see large numbers like 57.24629, 51.479507 or 48.664692 but for the same search results we used to see 4.1282754 or 2.685965.
We do not have anything tricky like custom scorers or custom Similarity.
We would like to understand the difference between Lucene 7 and Lucene 4 for the scores. Not only the scores for Lucene 7 are larger numbers but some results have moved up or down in the ranking. I thought that the basic formula is the same. Has the basic TfIdf formula changed?
I see that Lucene 7 is using BM25Similarity. Could that be why? Can someone explain BM25Similarity to me?
I have tried using Explain, but the results just overwhelm me. Is there an easier way of making sense of the explainaition?
Thanks.

Related

How to index terms in lucene.net based on timestamp instead of position

I'm trying to index speech text in Lucene.net 4.8 and I would like to search terms based on the timestamps (the moment the term was recognized). I have this data structured in json. My challenge is how to search the terms with a query like: "Term1 and Term2 WITHIN/~ 5 seconds". I was thinking to use a proximity query (SpanQuery) and a custom Tokenizer for this, but as far as I understand it, SpanQuery is based on text position. So this approach is not very useful for this task.
Does anyone have any good advice/guidance on how to solve this in Lucene or any other FT library for that matter?
Any help is appreciated.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Lucene: detecting missing spaces

I'm writing a search engine with Lucene.net for a database of ~ 2 million products. I'm using the Snowball Analyzer and so far I've been really impressed with the performance and result sets.
The one issue I can't seem to overcome is detecting missing spaces in search inputs.
For Example:
A User is looking for 'Black Diamond' brand products but they search for 'blackdiamond'.
Since the snowball analyzer creates two separate Tokens for Black Diamond I get 0 results.
What approach can I take to correct this issue? I've looked a bit into the Shingle Analyzer(n-gram) but not sure if that would help.
Is it possible to combine a Shingle Analyzer with the SpellChecker (and would that be an effect solution)? It would be idea if I could just prompt people with a Did You Mean: 'Black Diamond'? link when this occurs.
How about initially running the user query as is, if there are no results (or score is below a certain threshold), run N additional searches (where N is the number of possibilities to break the word in two) showing the user results for the possibility that received the highest score.

Lucene: how do I assign weights to the different search terms at query time?

I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.

Lucene document score appears to be lost after search

In lucene 3.1 i have a large boolean query, that i execute like so:
IndexSearcher is = new IndexSearcher(myDir);
is.search(query, 10);
I get 10 results just fine, but they are sorted by docId, and contain no score information. All documentation i can find, says that lucene by default sorts by relevance/score but this is not the case for me. If I ask for explain, there is no score information, just "0.0". Funny thing is that if I execute the same query in Luke on the same index, i get a result sorted by score just fine, but I can't see how to get the scores to stay and be used for sorting when launched from app. So I believe the query is just fine, seeing how it works in Luke.
What am I doing wrong ? I have also tried setting is.setDefaultFielsSortScoring(true,true) but this makes no difference. I tried using TopScoreDocColletor with no success.
Look at Lucene scoring, particularly the query norm. If one of your weights is Float.MAX_VALUE everything else will be close enough to zero that it's smaller than machine precision.