What is the Metaphone 3 Algorithm? - spell-checking

I want to code the Metaphone 3 algorithm myself. Is there a description? I know the source code is available for sale but that is not what I am looking for.

Since the author (Lawrence Philips) decided to commercialize the algorithm itself it is more than likely that you will not find description. The good place to ask would be the mailing list: https://lists.sourceforge.net/lists/listinfo/aspell-metaphone
but you can also checkout source code (i.e. the code comments) in order to understand how algorithm works:
http://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/refine/clustering/binning/Metaphone3.java?r=2029

From Wikipedia, the Metaphone algorithm is
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar [...]
Metaphone 3 specifically
[...] achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.
The overview of the algorithm is:
The Metaphone algorithm operates by first removing non-English letters and characters from the word being processed. Next, all vowels are also discarded unless the word begins with an initial vowel in which case all vowels except the initial one are discarded. Finally all consonents and groups of consonents are mapped to their Metaphone code. The rules for grouping consonants and groups thereof then mapping to metaphone codes are fairly complicated; for a full list of these conversions check out the comments in the source code section.
Now, onto your real question:
If you are interested in the specifics of the Metaphone 3 algorithm, I think you are out of luck (short of buying the source code, understanding it and re-creating it on your own): the whole point of not making the algorithm (of which the source you can buy is an instance) public is that you cannot recreate it without paying the author for their development effort (providing the "precise algorithm" you are looking for is equivalent to providing the actual code itself). Consider the above quotes: the development of the algorithm involved a "test harness of [...] encodings". Unless you happen to have such test harness or are able to create one, you will not be able to replicate the algorithm.
On the other hand, implementations of the first two iterations (Metaphone and Double Metaphone) are freely available (the above Wikipedia link contains a score of links to implementations in various languages for both), which means you have a good starting point in understanding what the algorithm is about exactly, then improve on it as you see fit (e.g. by creating and using an appropriate test harness).

The link by #Bo now refers to (now defucnt) project entire source code.
Hence here is the new link with direct link to Source code for Metaphone 3
https://searchcode.com/codesearch/view/2366000/
by Lawrence Philips
Metaphone 3 is designed to return an approximate phonetic key
(and an alternate * approximate phonetic key when appropriate) that
should be the same for English * words, and most names familiar in
the United States, that are pronounced similarly. * The key value
is not intended to be an exact phonetic, or even phonemic, *
representation of the word. This is because a certain degree of
'fuzziness' has * proven to be useful in compensating for variations
in pronunciation, as well as * misheard pronunciations. For example,
although americans are not usually aware of it, * the letter 's' is
normally pronounced 'z' at the end of words such as "sounds".
The 'approximate' aspect of the encoding is implemented according to the following rules: * * (1) All vowels are
encoded to the same value - 'A'. If the parameter encodeVowels * is
set to false, only initial vowels will be encoded at all. If
encodeVowels is set * to true, 'A' will be encoded at all places in
the word that any vowels are normally * pronounced. 'W' as well as
'Y' are treated as vowels. Although there are differences in * the
pronunciation of 'W' and 'Y' in different circumstances that lead to
their being * classified as vowels under some circumstances and as
consonants in others, for the purposes * of the 'fuzziness' component
of the Soundex and Metaphone family of algorithms they will * be
always be treated here as vowels. * * (2) Voiced and
un-voiced consonant pairs are mapped to the same encoded value. This
means that: * 'D' and 'T' -> 'T' * 'B' and 'P' -> 'P' * 'G' and 'K' -> 'K' * 'Z' and 'S' -> 'S' * 'V' and 'F' -> 'F' * * - In addition to the above voiced/unvoiced rules,
'CH' and 'SH' -> 'X', where 'X' * represents the "-SH-" and "-CH-"
sounds in Metaphone 3 encoding.

I thought it is wrong to have the general community be denied an algorithm (not code)
I am selling source, so the algorithm is not hidden. I am asking $40.00 for a copy of the source code, and asking other people who are charging for their software or services that use Metaphone 3 to pay me a licensing fee, and also asking that the source code not be distributed by other people (except for an exception I made for Google Refine - i can only request that you do not redistribute the copy of Metaphone 3 found there separately from the Refine package.)

Actually Metaphone3 is an algorithm with many very specific rules being a result of some test cases analysis. So it's not only a pure algorithm but it comes with extra domain knowledge. To obtain these knowledge and specific rules the author needed to put in a great effort. That's why this algorithm is not open-source.
There is an alternative anyway which is open-source: Double Metaphone.
See here: https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html

This is not a commercial post and I have no relationship with the owner but it is worth saying that an implementation of Metaphone3 is available as commercial software from its creator amporphics.com. It looks like his personal store. It is a Java app but I bought the Python version and it works fine.
The Why Metaphone3? page says:
One common solution to spelling variation is the database approach.
Some very impressive work has been done accumulating personal name
variations from all over the world. (Of course, we are always very
pleased when the companies that retail these databases advertise that
they also use some version of Metaphone to improve their flexibility
:-) )
But - there are some problems with this approach:
They only work well until they encounter a spelling variation or a
new word or name that is not already in their database.
Then they don't work at all.
Metaphone 3 is an algorithmic approach that will deliver a phonetic
lookup key for anything you enter into it.
Personal names, that is, first names and family names, are not the
same as company names. In fact, the name of a company or agency may
contain words of any kind, not just names. Database solutions usually
don't cover possible spelling variations, or for that matter
misspellings, for regular 'dictionary' words. Or if they do, not very
thoroughly.
Metaphone 3 was developed to account for all spelling variations
commonly found in English words, first and last names found in the
United States and Europe, and non-English words whose native
pronunciations are familiar to Americans. It doesnt care what kind of
a word you are trying to match.
For what it is worth, we licensed the code since it is affordable and it is easy to use. I can't speak as to performance yet. There are good alternatives on PyPi but I can't find them at the moment.

Related

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.
The solutions I found (examples are in R) :
Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.
agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE)
## [1] "accusait" "abusait"
The comparison of phonemes
library(RecordLinkage)
soundex(x<-c('accusait','acusait','abusait'))
## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.
I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.
I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary.
What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive.
Other kinds of "spelling mistakes" are also possible, for example transposing two characters.
I've done this a few times, but not lately.
The most recent case had to do with concomitant medications for clinical trials.
You would be amazed how many ways there are to misspell "acetylsalicylic".
Here is an outline in C++ of how it is done.
Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie.
As you search, you try the word as it is, and you try all possible alterations of the word at each point.
As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration.
If you exhaust the budget, you allow no further alterations.
Now there is a top-level loop, where you call the search.
On the first iteration, you call it with a budget of 0.
When the budget is 0, it will allow no alterations, so it is simply a direct lookup.
If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration.
If that fails, try a budget of 2, and so on.
Something I have not tried is a fractional budget.
For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc.
Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.
If the word is not misspelled, the time it takes is proportional to the number of letters in the word.
In general, the time it takes is exponential in the number of alterations in the word.
If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.
Extending what Mike had to say, and using the string matching library stringdist in R to match a vector of addresses that errored out in ARCGIS's geocoding function:
rows<-length(unmatched$addresses)
#vector to put our matched addresses in
matched_add<-rep(NA, rows)
score<-rep(NA, rows)
#for instructional purposes only, you should use sapply to apply functions to vectors
for (u in c(1:rows)){
#gives you the position of the closest match in an address vector
pos<-amatch(unmatched$address[u],index$address, maxDist = Inf)
matched_add[u]<-index$address[pos]
#stringsim here will give you the score to go back and adjust your
#parameters
score[u]<-stringsim(unmatched$address[u],index$address[pos])
}
Stringdist has several methods you can use to find approximate matches, including Levenshtein (method="lv"). You'll probably want to tinker with these to fit your dataset as well as you can.

How to convert BIC & IBAN to account and sortcode

Now that SEPA requirements are getting people used to BIC & IBAN, there are legacy system that cannot cope with this new data. Is there an algorithm or tool available for converting BIC & IBAN back to sort code and account?
Here is an example:
from this website.
Wikipedia has a list of IBAN formats by country, so it seems at least possible.
However, there is no complete algorithm for it - being a software developer, you can derive an algorithm from that input. Note that other countries might follow in the future, so you can expect more work (and hopefully not more exceptional cases of sort codes and accounts).
Regarding the tool or library, that's off-topic here on StackOverflow, but you might want to ask on Software Recommendations, though. Note that they have different requirements on how to ask questions, so you might want to read the tour first. Don't forget to mention the programming language.
Well, a quick search pointed me at this page: http://www.business.hsbc.co.uk/1/2/international-business/iban-bic.
Looks to me like you can just extract appropriate substrings. Although, a bit more searching seems to indicate that the format may vary a bit depending on the country.
Both sort code and account number are present inside a United Kingdom or Ireland IBAN.
You can simply substring like, PHP Examples:
$iban = "GB04BARC20474473160944";
$sort = substr($iban,8,6);
$account = substr($iban,14,8);
print "SortCode:" . $sort;
print "AccountNumber:" . $account;
The IBAN Calculator webservice has an API which digs up bank and branch information and so on. Also does check digit validation on the iban and sort/account.
But for simple extracting of the sort/account the substring is sufficient.

Forward Index Implementation in google

I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).