How to use percentage (floating) similarity fuzzy queries in Lucene? - lucene

Lucene, version: 7.3.0.
All I want is to use percentage (floating) similarity fuzzy queries (FuzzyQuery class) in Lucene.
defaultMinSimilarity is now deprecated, so I can use only defaultMaxEdits for my purposes.
As far as I can see, maximal supported distance for org.apache.lucene.search.FuzzyQuery can't be more than 2:
MAXIMUM_SUPPORTED_DISTANCE = 2
What if I want to search for 55% similar strings, but for a term with a big length?
How can I do that with Lucene's FuzzyQuery?
Can I bypass that maximum-2-step edit distance restriction at all?

Can you bypass that FuzzyQuery limitation? No. Can you do it at all? Almost certainly yes, but you need to rethink the problem a bit. FuzzyQuery is not the answer.
You should consider, instead, how you could use analysis to solve your problem. Indexing NGrams would be the most direct solution for very loose, fuzzy style matching, see NGramTokenFilter.

Related

How to index terms in lucene.net based on timestamp instead of position

I'm trying to index speech text in Lucene.net 4.8 and I would like to search terms based on the timestamps (the moment the term was recognized). I have this data structured in json. My challenge is how to search the terms with a query like: "Term1 and Term2 WITHIN/~ 5 seconds". I was thinking to use a proximity query (SpanQuery) and a custom Tokenizer for this, but as far as I understand it, SpanQuery is based on text position. So this approach is not very useful for this task.
Does anyone have any good advice/guidance on how to solve this in Lucene or any other FT library for that matter?
Any help is appreciated.

Lucene: how do I assign weights to the different search terms at query time?

I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.

Improving lucene spellcheck

I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.
Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?
Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.
thanks
almir
As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellCheckers to achieve this.
EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.
However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).
If you look at the source of SpellChecker.SuggestSimilar you can see:
BooleanQuery query = new BooleanQuery();
String[] grams;
String key;
for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
{
<...>
if (bStart > 0)
{
Add(query, "start" + ng, grams[0], bStart); // matches start of word
}
<...>
I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:
query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
BooleanClause.Occur.MUST));
which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.
To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).
After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.
Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.
They have a feature called morphology preprocessors:
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
There are many stemmers available, and as you can see, german is among them.
UPDATE:
Elaboration on why I feel that sphinx has been the clear winner for me.
Speed: Sphinx is stupid fast. Both indexing and in the serving search queries.
Relevance: Though it's hard to quantify this, I felt that I was able to get more relevant results with sphinx compared to my lucene implementation.
Dependence on the filesystem: With lucene, I was unable to break the dependence on the filesystem. And while their are workarounds, like creating a ram disk, I felt it was easier to just select the "run only in memory" option of sphinx. This has implications for websites with more than one webserver, adding dynamic data to the index, reindexing, etc.
Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.
Hope that helps...

Lucene: Setting minimum required similarity on searches

I'm having a lot of trouble dealing with Lucene's similarity factor. I want it to apply a similarity factor different than its default (which is 0.5 according to documentation), but it doesn't seem to be working.
When I type a query that explicitly sets the required similarity factor, like [tinberland~0.5] (notice that I wrote tiNberland, with an "N", while the correct would be with an "M"), it brings many products by the Timberland manufacturer. But when I just type [tinberland] (no similarity factor explicitly defined) and try to set the similarity via code, it doesn't work (returns no results).
The code I wrote to set the similarity is like:
multiFieldQueryParser.SetFuzzyMinSim(0.5F);
And I didn't change the Similarity algorithm, so it is using the DefaultSimilarity class.
Isn't that the correct or recommended way of applying similarity via code? Is there a specific QueryParser for fuzzy queries?
Any help is highly appreciated.
Thanks in advance!
What you are setting is the minimal similarity, so e.g. if someone searched for foo~.1 the parser would change it to foo~.5. It's not saying "turn every query into a fuzzy query."
You can use MultiFieldQueryParser.getFuzzyQuery like so:
Query q = parser.getFuzzyQuery(field, term, minSimilarity);
but that will of course require you calling getFuzzyQuery for each field. I'm not aware of a "MultiFieldFuzzyQueryParser" class, but all it would do is just combine a bunch of those getFuzzyQuery calls.

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.
Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.
Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.