Lucene 4.*: FuzzyQuery and finding positions of hits - lucene

The various examples I see about how to find positions of the matches returned by an IndexSearcher either require retrieving the document's content and search a TokenStream or to index the positions and offsets in the term vectors, turn the query into a term and find it in the term vector.
But what happens when I use a FuzzyQuery? Is there a way to know which term(s) exactly matched in the hit so that I may look for them in the term vector of this document?
In case that's of any value, I'm new to Lucene and my goal here is to annotate a set of documents (the ones indexed in Lucene) with a set of terms, but the documents are from scanned texts and contain OCR errors, therefore I must use a FuzzyQuery. I thought about using lucene-suggest to do some spellchecking beforehand but it occured to me that it boiled down to trying to find fuzzy matches.

Related

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

Lucene Spans: How are the Positions Useful?

Lucene Spans objects have startPosition() and endPosition() methods, which, according to their Javadoc return "position[s] in the current doc." How are these useful?
My understanding is that these positions are the indices of the start and end tokens of the span—indices after an Analyzer has processed the original text. But after digging around Javadocs for awhile, I don't know what I can do with these positions. It seems like I should be able to query a document and, say, get the tokens between startPosition and endPosition, or maybe get the offsets corresponding to the positions, but I don't see anything like that.
How can I relate these positions back to the original text?
Perhaps span positions are only useful for search queries. From Lucene Javadocs: "phrase and proximity searches rely on position info."

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

understanding the relationship between boosting a document in lucene at index time and its corresponding score at search time

When indexing, I boost certain documents, but they do not appear on the top of the list of retrieved documents. I looked at the score of those documents, and somehow, the score of the documents retrieved is always NaN.
What is the relationship between a boost of a document at index time and its score at retrieve time? I thought these would be correlated, and further, I thought I would get a wide range of scores in my scoredocs, not just NaN. If you can shed some light on this I would be grateful.
I have read http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html
and cant figure out what is missing.
Here is the simple boosting code:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
I'm gonna go on a wild guess here since you havent provide a sample of you search code, but a common reason why the score of retreived docs is NaN is because you use a Sort. When sorting, most of the time the score of the documents is not used, and therefore disabled by default.
If you use a Sort for your search, and want the score, check the method setDefaultFieldSortScoring of the IndexSearcher class. This method allows you to enable scoring the documents in a search that uses a Sort.
http://lucene.apache.org/java/2_9_4/api/all/org/apache/lucene/search/IndexSearcher.html#setDefaultFieldSortScoring(boolean, boolean)

Bubbling up newest content in lucene search results

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!
I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.
You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).