SQL Edit Distance: How have you handled Fuzzy String Matching using SQL in the past? - sql

I have always wanted to ask for your views on this topic, so here we go:
My team just provided me with a list of customer accounts we need to match with other databases and the main challenge we face is the fact our list is non-standarized so we call similarly but differently the same accounts than in our databases. For example:
My_List.Customers_Name Customers_Database.Customers_Name
- -
Charles Schwab Charles Schwab Corporation
So for example, I use Jaro Wrinkler Similarity function and Edit Distance in order to gather a list of similar strings and then manually match the accounts if needed. My question is:
Which rules/filters do you apply to the results of the fuzzy data matching in order to reduce the amount of manual match?
I am refering to rules like:
If the string has more than 20 characters and Edit Distance <= 1 then it will probably be the same so consider it a match. If the string has less than 4 characters and Edit Distance >0 then it will probably not be the same account so consider it a mismatch.
These rules I apply are completely made up from my side, I am wondering if there are some standard convention for applying text string fuzzy matching in order to only retrieve useful results and reduce manual match workload.
If there are not, could you tell your experience and how you handled this before?
Thank you so much

I've done this a few times. It's hugely dependent on the data sets, and the rules change every time.
My process is:
select a random set of sample records to check my rule set - large enough to be representative, small enough to be able to scan visually.
create a "match" table with "original", "match" and "confidence score" columns.
write the rules, as "insert" or "update" statements to create records in the "match" table
run the rules on my sample data set
evaluate the matches on the samples. Tweak, add, configure the rules.
rinse & repeat
The "rules" depend hugely on the data set. Commonly I use the following:
strip out punctuation
apply common substitutions (e.g. "Corp" becomes "Corporation")
split into separate words; apply fraction of each exact match out of 10 (so "Charles Schwab" matching "Charles Schwab Corporeation" would be 2/3 = 7 points, "HSBC" matching "HSBC" is 1/1 = 10 points
split into separate words; apply fraction of each close match out of 5 (so "Chls Schwab" matching "Charles Schwab Corporation" would be 2/3 = 3 points, "HSBC" matching "HSCB" is 1/1 = 5 points)

Related

Getting search suggestions to work on 2 (or more) non-consecutive words (to improve search on a medical conditions list - ICD10 codes)

Context:
We are using Azure Cognitive Services in a mobile app to search patient diagnostic codes (ICD10 codes).
The ICD10 code list is approximately 94,000 items. For anyone interested here is a list.
We currently have set-up a standard Lucene analyser on the diagnostic description field
Requirement:
We want to provide a really good search as you type experience, which provides the most relevant suggestions
Using the Suggest method with the fuzzy parameter set to true works reasonably well for a single search term:
As you can see it does well in finding partial matches and is resilient to typos.
The issue comes in when I add a second search term. E.g. I want to search for asthma that is moderate:
In both these examples, there is no match.
So when searching for more than one term, requiring the user to express this in the sequence that this is in the data is not a good user experience.
Using the Search method instead, we can overcome the problem of finding matches where 2 search terms are supplied that do not appear consecutively in the data:
And this is resilient to typos
However, this is not good at finding partial matches (like the Suggest does).
E.g. in this search, we would still want the term moderate to be picked up:
Seemingly if we could combine a wild card search with a fuzzy search we could solve this problem. e.g. supplying the following search phrase: ashtma~* AND moder~*.
But from what we have seen this syntax is not supported.
Any suggestions on how to overcome this limitation so we can get the best of both worlds, i.e:
For 2 or more search terms, it will work on partial matches
And the search terms are treated independently and do not need to appear consecutively in the data
Many thanks in advance,
Andreas.
I recommend using (or at least experimenting with) Lucene ngrams.
An example custom analyzer can use the NGramTokenFilter.
This filter splits each source token into one or more indexed tokens by chopping up the source into substrings of different lengths.
An example from the above link:
"abc" will give "a", "ab", "abc", "b", "bc", "c"
You can, as an example, set each token to be from 3 to 5 characters long (but this is one of the areas where you can experiment with different settings).
When you use this analyzer for indexing, it's going to create many more tokens (larger index) but that gives you more searching flexibility.
Use the same analyzer for searching.
If the user enters the following two words as their search values:
ashtma moder
You would convert that into the following Lucene search phrase:
ashtma~ AND moder~
This will find the following hits:
doc id = 12877
field = Moderate persistent asthma with status asthmaticus
doc id = 12874
field = Moderate persistent asthma
doc id = 12875
field = Moderate persistent asthma, uncomplicated
doc id = 12876
field = Moderate persistent asthma with (acute) exacerbation
doc id = 94210
field = Family history of asthma and oth chronic lower resp diseases
doc id = 6970
field = Xanthelasma of right lower eyelid
doc id = 6973
field = Xanthelasma of left lower eyelid
doc id = 6979
field = Chloasma of right lower eyelid and periocular area
doc id = 6982
field = Chloasma of left lower eyelid and periocular area
As you can see it does find some false positives, but the first four hits (the highest scored) are the ones you want.
You can see how this approach performs in terms of index size and search speed.
One reason for suggesting ngrams is your point about wanting to handle mis-spellings: ngrams may help to isolate spelling mistakes into smaller tokens,since the ~ fuzzy search operator is fairly limited in what it can handle. But, definitely experiment with different ngram lengths - and maybe also without using ngrams at all.

How to do Fuzzy match in Snowflake / SQL

How to do fuzzy match in Snowflake / SQL
Here is the business logic
The ABC Company INC, The north America, ABC (Those two should shows a match)
The 16K LLC, 16K LLC (Those two should shows a match)
enter image description here
I attached some test data. Thank so much guys!
Any matching attempt that treats string pairs like "The ABC Company INC" and "The north America, ABC" or "Preferred ABC Group" and "The Preferred Residences" as a match is probably going to give you many false positive matches, since in some of your examples there is only one word similar between the strings.
That said, Snowflake does provide a couple of functions that might help: EDITDISTANCE and JAROWINKLER_SIMILARITY.
EDITDISTANCE generates a number that represents the Levenshtein distance between two strings (basically the number of edits it would take to change one string into the other). A lower number indicates fewer edits needed and so potentially a closer match.
JAROWINKLER_SIMILARITY uses an algorithm to calculate a "similarity" score between 0 and 100 for two strings. A higher number indicates more similarity, 100 being an exact match.
You could use either or both of these functions to generate scores for each pair of strings and then decide on a threshold that best represents a match for your purposes.

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.
The solutions I found (examples are in R) :
Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.
agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE)
## [1] "accusait" "abusait"
The comparison of phonemes
library(RecordLinkage)
soundex(x<-c('accusait','acusait','abusait'))
## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.
I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.
I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary.
What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive.
Other kinds of "spelling mistakes" are also possible, for example transposing two characters.
I've done this a few times, but not lately.
The most recent case had to do with concomitant medications for clinical trials.
You would be amazed how many ways there are to misspell "acetylsalicylic".
Here is an outline in C++ of how it is done.
Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie.
As you search, you try the word as it is, and you try all possible alterations of the word at each point.
As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration.
If you exhaust the budget, you allow no further alterations.
Now there is a top-level loop, where you call the search.
On the first iteration, you call it with a budget of 0.
When the budget is 0, it will allow no alterations, so it is simply a direct lookup.
If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration.
If that fails, try a budget of 2, and so on.
Something I have not tried is a fractional budget.
For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc.
Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.
If the word is not misspelled, the time it takes is proportional to the number of letters in the word.
In general, the time it takes is exponential in the number of alterations in the word.
If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.
Extending what Mike had to say, and using the string matching library stringdist in R to match a vector of addresses that errored out in ARCGIS's geocoding function:
rows<-length(unmatched$addresses)
#vector to put our matched addresses in
matched_add<-rep(NA, rows)
score<-rep(NA, rows)
#for instructional purposes only, you should use sapply to apply functions to vectors
for (u in c(1:rows)){
#gives you the position of the closest match in an address vector
pos<-amatch(unmatched$address[u],index$address, maxDist = Inf)
matched_add[u]<-index$address[pos]
#stringsim here will give you the score to go back and adjust your
#parameters
score[u]<-stringsim(unmatched$address[u],index$address[pos])
}
Stringdist has several methods you can use to find approximate matches, including Levenshtein (method="lv"). You'll probably want to tinker with these to fit your dataset as well as you can.

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).

Help needed ordering search results

I've 3 records in Lucene index.
Record 1 contains healthcare in title field.
Record 2 contains healthcare and insurance in description field but not together.
Record 3 contains healthcare insurance in company name field.
When a user searches for healthcare insurance,I want to show records in the following order in search results...
a.Record #3---because it contains both the words of the input together(ie.as a phrase)
b.Record #1
c.Record #2
To put it another way, exact match of all keywords should be given more weight than matches of individual keywords.
How do i achieve this in lucene?
Thanks.
You can use phrase + slop as bajafresh4life says, but it will fail to match anything if the terms are more than slop apart.
A slightly more complicated alternative is to construct a boolean query that explicitly searches for the phrase (with or without slop) and each of the terms in the phrase. E.g.
"healthcare insurance" OR healthcare OR insurance
Normal lucene relevance sort will give you what you want, and won't fail in the way that the "big slop" approach will.
You can also boost individual fields so that, for example, title is weighted more heavily than description or company name. This needs an even more complicated query, but gives you a lot more control over the ordering...
title:"healthcare insurance"^2 OR title:healthcare^2 OR title:insurance^2
OR description:"healthcare insurance" OR ...
It can be quite tricky to get the weights right, and you may have to play around with them to get exactly what you want (e.g. in the example I just gave, you might not want to boost the individual terms for title), but when you get it working, its pretty nice :-)
Rewrite the query with a phrase + slop factor. So if the query is:
healthcare insurance
you can rewrite it as:
"healthcare insurance"~100
Documents that have the words "healthcare" and "insurance" closer in proximity to each other will be scored higher. In this case, since the slop factor is 100, documents that have both words but are more than 100 terms apart will not match.
Rewriting the query involves manipulating the Term objects in a BooleanQuery. Take all the terms, create a PhraseQuery, and set a slop factor.