Getting search suggestions to work on 2 (or more) non-consecutive words (to improve search on a medical conditions list - ICD10 codes) - lucene

Context:
We are using Azure Cognitive Services in a mobile app to search patient diagnostic codes (ICD10 codes).
The ICD10 code list is approximately 94,000 items. For anyone interested here is a list.
We currently have set-up a standard Lucene analyser on the diagnostic description field
Requirement:
We want to provide a really good search as you type experience, which provides the most relevant suggestions
Using the Suggest method with the fuzzy parameter set to true works reasonably well for a single search term:
As you can see it does well in finding partial matches and is resilient to typos.
The issue comes in when I add a second search term. E.g. I want to search for asthma that is moderate:
In both these examples, there is no match.
So when searching for more than one term, requiring the user to express this in the sequence that this is in the data is not a good user experience.
Using the Search method instead, we can overcome the problem of finding matches where 2 search terms are supplied that do not appear consecutively in the data:
And this is resilient to typos
However, this is not good at finding partial matches (like the Suggest does).
E.g. in this search, we would still want the term moderate to be picked up:
Seemingly if we could combine a wild card search with a fuzzy search we could solve this problem. e.g. supplying the following search phrase: ashtma~* AND moder~*.
But from what we have seen this syntax is not supported.
Any suggestions on how to overcome this limitation so we can get the best of both worlds, i.e:
For 2 or more search terms, it will work on partial matches
And the search terms are treated independently and do not need to appear consecutively in the data
Many thanks in advance,
Andreas.

I recommend using (or at least experimenting with) Lucene ngrams.
An example custom analyzer can use the NGramTokenFilter.
This filter splits each source token into one or more indexed tokens by chopping up the source into substrings of different lengths.
An example from the above link:
"abc" will give "a", "ab", "abc", "b", "bc", "c"
You can, as an example, set each token to be from 3 to 5 characters long (but this is one of the areas where you can experiment with different settings).
When you use this analyzer for indexing, it's going to create many more tokens (larger index) but that gives you more searching flexibility.
Use the same analyzer for searching.
If the user enters the following two words as their search values:
ashtma moder
You would convert that into the following Lucene search phrase:
ashtma~ AND moder~
This will find the following hits:
doc id = 12877
field = Moderate persistent asthma with status asthmaticus
doc id = 12874
field = Moderate persistent asthma
doc id = 12875
field = Moderate persistent asthma, uncomplicated
doc id = 12876
field = Moderate persistent asthma with (acute) exacerbation
doc id = 94210
field = Family history of asthma and oth chronic lower resp diseases
doc id = 6970
field = Xanthelasma of right lower eyelid
doc id = 6973
field = Xanthelasma of left lower eyelid
doc id = 6979
field = Chloasma of right lower eyelid and periocular area
doc id = 6982
field = Chloasma of left lower eyelid and periocular area
As you can see it does find some false positives, but the first four hits (the highest scored) are the ones you want.
You can see how this approach performs in terms of index size and search speed.
One reason for suggesting ngrams is your point about wanting to handle mis-spellings: ngrams may help to isolate spelling mistakes into smaller tokens,since the ~ fuzzy search operator is fairly limited in what it can handle. But, definitely experiment with different ngram lengths - and maybe also without using ngrams at all.

Related

SQL Edit Distance: How have you handled Fuzzy String Matching using SQL in the past?

I have always wanted to ask for your views on this topic, so here we go:
My team just provided me with a list of customer accounts we need to match with other databases and the main challenge we face is the fact our list is non-standarized so we call similarly but differently the same accounts than in our databases. For example:
My_List.Customers_Name Customers_Database.Customers_Name
- -
Charles Schwab Charles Schwab Corporation
So for example, I use Jaro Wrinkler Similarity function and Edit Distance in order to gather a list of similar strings and then manually match the accounts if needed. My question is:
Which rules/filters do you apply to the results of the fuzzy data matching in order to reduce the amount of manual match?
I am refering to rules like:
If the string has more than 20 characters and Edit Distance <= 1 then it will probably be the same so consider it a match. If the string has less than 4 characters and Edit Distance >0 then it will probably not be the same account so consider it a mismatch.
These rules I apply are completely made up from my side, I am wondering if there are some standard convention for applying text string fuzzy matching in order to only retrieve useful results and reduce manual match workload.
If there are not, could you tell your experience and how you handled this before?
Thank you so much
I've done this a few times. It's hugely dependent on the data sets, and the rules change every time.
My process is:
select a random set of sample records to check my rule set - large enough to be representative, small enough to be able to scan visually.
create a "match" table with "original", "match" and "confidence score" columns.
write the rules, as "insert" or "update" statements to create records in the "match" table
run the rules on my sample data set
evaluate the matches on the samples. Tweak, add, configure the rules.
rinse & repeat
The "rules" depend hugely on the data set. Commonly I use the following:
strip out punctuation
apply common substitutions (e.g. "Corp" becomes "Corporation")
split into separate words; apply fraction of each exact match out of 10 (so "Charles Schwab" matching "Charles Schwab Corporeation" would be 2/3 = 7 points, "HSBC" matching "HSBC" is 1/1 = 10 points
split into separate words; apply fraction of each close match out of 5 (so "Chls Schwab" matching "Charles Schwab Corporation" would be 2/3 = 3 points, "HSBC" matching "HSCB" is 1/1 = 5 points)

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

Forward Index Implementation in google

I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

Help needed ordering search results

I've 3 records in Lucene index.
Record 1 contains healthcare in title field.
Record 2 contains healthcare and insurance in description field but not together.
Record 3 contains healthcare insurance in company name field.
When a user searches for healthcare insurance,I want to show records in the following order in search results...
a.Record #3---because it contains both the words of the input together(ie.as a phrase)
b.Record #1
c.Record #2
To put it another way, exact match of all keywords should be given more weight than matches of individual keywords.
How do i achieve this in lucene?
Thanks.
You can use phrase + slop as bajafresh4life says, but it will fail to match anything if the terms are more than slop apart.
A slightly more complicated alternative is to construct a boolean query that explicitly searches for the phrase (with or without slop) and each of the terms in the phrase. E.g.
"healthcare insurance" OR healthcare OR insurance
Normal lucene relevance sort will give you what you want, and won't fail in the way that the "big slop" approach will.
You can also boost individual fields so that, for example, title is weighted more heavily than description or company name. This needs an even more complicated query, but gives you a lot more control over the ordering...
title:"healthcare insurance"^2 OR title:healthcare^2 OR title:insurance^2
OR description:"healthcare insurance" OR ...
It can be quite tricky to get the weights right, and you may have to play around with them to get exactly what you want (e.g. in the example I just gave, you might not want to boost the individual terms for title), but when you get it working, its pretty nice :-)
Rewrite the query with a phrase + slop factor. So if the query is:
healthcare insurance
you can rewrite it as:
"healthcare insurance"~100
Documents that have the words "healthcare" and "insurance" closer in proximity to each other will be scored higher. In this case, since the slop factor is 100, documents that have both words but are more than 100 terms apart will not match.
Rewriting the query involves manipulating the Term objects in a BooleanQuery. Take all the terms, create a PhraseQuery, and set a slop factor.