How to configure Solr to use Levenshtein approximate string matching?

How to configure Solr to use Levenshtein approximate string matching? - lucene

Does Apaches Solr search engine provide approximate string matches, e.g. via Levenshtein algorithm?
I'm looking for a way to find customers by last name. But I cannot guarantee the correctness of the names. How can I configure Solr so that it would find the person
"Levenshtein" even if I search for "Levenstein" ?

Typically this is done with the SpellCheckComponent, which internally uses the Lucene SpellChecker by default, which implements Levenshtein.
The wiki really explains very well how it works, how to configure it and what options are available, no point repeating it here.
Or you could just use Lucene's fuzzy search operator.
Another option is using a phonetic filter instead of Levenshtein.

Great answer by Mauricio, my only "cheapo" addition is to just append the ~ character to all terms that you want to fuzzy match on the way in to solr. If you are using the default set up, this will give you fuzzy match.

Related

Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.

You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate

getting the document length during query evaluation apache lucene 5.3

I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.

You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.

Solr spellcheck vs fuzzy search

I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob

I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.

How to restrict a search to multiple sites?

Restricting a search to multiple sites - is it possible to do this?
E.g. site:www.google+www.yahoo.com?

Bing does not provide a separate method of specifying which sites to search like Google's annotations. Instead, you need to add them as parameters in your query string. To do so, use the OR logical operator along with the site: specification. Bing prioritizes AND over OR, so make sure you put parentheses around the OR'ed terms. For example,
example search terms (site:google.com OR site:yahoo.com)
If you are adding a lot of sites, keep in mind that the total URL length must be less than 2048 characters, encoded.

Use OR operator and site: keyword.
E.g.
(site:http://superuser.com/ OR site:http://stackoverflow.com/) (some query)
It should work in most of the search engines such as Google, Bing, Yahoo, DuckDuckGo, etc.
Note: It's important to use capital letters for operators such as AND and OR, otherwise it could not work.

Format your query like this:
site:dell.com OR site:ibm.com "search phrase"
So each site/domain with OR between. Tested seems to work for both Bing and Google.
Syntax shamelessly taken from searchenginewatch.com.

How can I exclude certain URLs in Solr / Lucene

I have setup new instance of Solr indexing on a website. I want Solr NOT to index certain URL patterns. Is there any way of mentioning such exclude-pattern?
Regards,
Paras

It can be done in the program, index only if the pattern does not match the exclude pattern.

You can do the filtering in Solr using an UpdateRequestProcessor. In that UpdateRequestProcessor, you could decide whether or not to index the document if it matches or not your regex.

Do you have a crawler going about and collecting data? I would lean towards doing that logic in the crawler. Solr is more of respository, and I don't think is the best place to put lots of indexing logic in.
Eric

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas