Algorithm for almost similar values search - sql

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?

Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck

I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.

The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.

You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?

I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.

I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)

One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.

Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx

I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)

I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx

the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.

You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.

One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.

Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.

A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).

Related

TSQL grouping on fuzzy column

I would like to group all the merchant transactions from a single table, and just get a count. The problem is, the merchant, let's say redbox, will have a redbox plus the store number added to the end(redbox 4562,redbox*1234). I will also include the category for grouping purpose.
Category Merchant
restaurant bruger king 123 main st
restaurant burger king 456 abc ave
restaurant mc donalds * 45877d2d
restaurant mc 'donalds *888544d
restaurant subway 454545
travelsubway MTA
gas station mc donalds gas
travel nyc taxi
travel nyc-taxi
The question: How can I group the merchants when they have address or store locations added on to them.All I need is a count for each merchant.
The short answer is there is no way to accurately do this, especially with just pure SQL.
You can find exact matches, and you can find wildcard matches using the LIKE operator or a (potentially huge) series of regular expressions, but you cannot find similar matches nor can you find potential misspellings of matches.
There's a few potential approaches I can think of to solve this problem, depending on what type of application you're building.
First, normalize the merchant data in your database. I'd recommend against storing the exact, unprocessed string such as Bruger King in your database. If you come across a merchant that doesn't match a known set of merchants, ask the user if it already matches something in your database. When data goes in, process it then and match it to an existing known merchant.
Store a similarity coefficient. You might have some luck using something like a Jaccard index to judge how similar two strings are. Perhaps after stripping out the numbers, this could work fairly well. At the very least, it could allow you to create a user interface that can attempt to guess what merchant it is. Also, some database engines have full-text indexing operators that can descibe things like similar to or sounds like. Those could potentially be worth investigating.
Remember merchant matches per user. If a user corrects bruger king 123 main st to Burger King, store that relation and remember it in the future without having to prompt the user. This data could also be used to help other users correct their data.
But what if there is no UI? Perhaps you're trying to do some automated data processing. I really see no way to handle this without some sort of human intervention, though some of the techniques described above could help automate this process. I'd also look at the source of your data. Perhaps there's a distinct merchant ID you can use as a key, or perhaps there exists somewhere a list of all known merchants (maybe credit card companies provide this API?) If there's boat loads of data to process, another option would be to partially automate it using a service such as Amazon's Mechanical Turk.
You can use LIKE
SELECT COUNT(*) AS "COUNT", "BURGER KING"
FROM <tables>
WHERE restaurant LIKE "%king%"
UNION ALL
SELECT COUNT(*) AS "COUNT", "JACK IN THE BOX"
FROM <tables>
Where resturant LIKE "jack in the box%"
You may have to move the wildcards around depending on how the records were spelled out.
It depends a bit on what database you use, but most have some kind of REGEXP_INSTR or other function you can use to check for the first index of a pattern. You can then write something like this
SELECT SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]')), count('x')
FROM Expenses
GROUP BY SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]'))
This assumes that the merchant name doesn't have a number and the store number does. However you still may need to strip out any special chars with a replace (like *, -, etc).

figuring out which field to look for a value in with SQL and perl

I'm not too good with SQL and I know there's probably a much more efficient way to accomplish what I'm doing here, so any help would be much appreciated. Thanks in advance for your input!
I'm writing a short program for the local school high school. At this school, juniors and seniors who have driver's licenses and cars can opt to drive to school rather than ride the bus. Each driver is assigned exactly one space, and their DLN is used as the primary key of the driver's table. Makes, models, and colors of cars are stored in a separate cars table, related to the drivers table by the License plate number field.
My idea is to have a single search box on the main GUI of the program where the school secretary can type in who/what she's looking for and pull up a list of results. Thing is, she could be typing a license plate number, a car color, make, and model, someone driver's name, some student driver's DLN, or a space number. As the programmer, I don't know what exactly she's looking for, so a couple of options come to mind for me to build to be certain I check everywhere for a match:
1) preform a couple of
SELECT * FROM [tablename]
SQL statements, one per table and cram the results into arrays in my program, then search across the arrays one element at a time with regex, looking for a matched pattern similar to the search term, and if I find one, add the entire record that had a match in it to a results array to display on screen at the end of the search.
2) take whatever she's looking for into the program as a scaler and prepare multiple select statements around it, such as
SELECT * FROM DRIVERS WHERE DLN = $Search_Variable
SELECT * FROM DRIVERS WHERE First_Name = $Search_Variable
SELECT * FROM CARS WHERE LICENSE = $Search_Variable
and so on for each attribute of each table, sticking the results into a results array to show on screen when the search is done.
Is there a cleaner way to go about this lookup without having to make her specify exactly what she's looking for? Possibly some kind of SQL statement I've never seen before?
Seems like a right application for the Sphinx full-text search engine. There's the Sphinx::Search module on CPAN which can be used as perl client for Sphinx.
First of all, you should not use SELECT * and you should definitely use bind values.
Second, the easiest way to figure out what the user is searching for is to ask the user. Have a set of checkboxes likes so:
Search among: [ ] Names
[ ] License Plate Numbers
[ ] Driver's License Numbers
Alternatively, you can note that names do not contain any digits and I have not seen any driver's license number which contains digits. There are other heuristics you can apply to partially deduce what the user was trying to search.
If you do an OK job of presenting the results, this might work out.
Finally, try to figure out what search possibilities are offered by the database you are using and leverage them so that most of the searching happens before the user interface touches the data.

Is there common street addresses database design for all addresses of the world? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am a programmer and need a practical approach to storing street address structures of the world in a database. So which is the best and common database design for storing street addresses? It should be simple to use, fast to query and dynamic to store all street addresses of the world.
It is possible to represent addresses from lots of different countries in a standard set of fields. The basic idea of a named access route (thoroughfare) which the named or numbered buildings are located on is fairly standard, except in China sometimes. Other near universal concepts include: naming the settlement (city/town/village), which can be generically referred to as a locality; naming the region and assigning an alphanumeric postcode. Note that postcodes, also known as zip codes, are purely numeric only in some countries. You will need lots of fields if you really want to be generic.
The Universal Postal Union (UPU) provides address data for lots of countries in a standard format. Note that the UPU format holds all addresses (down to the available field precision) for a whole country, it is therefore relational. If storing customer addresses, where only a small fraction of all possible addresses will be stored, its better to use a single table (or flat format) containing all fields and one address per row.
A reasonable format for storing addresses would be as follows:
Address Lines 1-4
Locality
Region
Postcode (or zipcode)
Country
Address lines 1-4 can hold components such as:
Building
Sub-Building
Premise number (house number)
Premise Range
Thoroughfare
Sub-Thoroughfare
Double-Dependent Locality
Sub-Locality
Frequently only 3 address lines are used, but this is often insufficient. It is of course possible to require more lines to represent all addresses in the official format, but commas can always be used as line separators, meaning the information can still be captured.
Usually analysis of the data would be performed by locality, region, postcode and country and these elements are fairly easy for users to understand when entering data. This is why these elements should be stored as separate fields. However, don't force users to supply postcode or region, they may not be used locally.
Locality can be unclear, particularly the distinction between map locality and postal-locality. The postal locality is the one deemed by a postal authority which may sometimes be a nearby large town. However, the postcode will usually resolve any problems or discrepancies there, to allow correct delivery even if the official post-locality is not used.
Have a look at Database Answers. Specifically, this covers many cases:
(All variable length character datatype)
AddressId
Line1
Line2
Line3
City
ZipOrPostcode
StateProvinceCounty
CountryId
OtherAddressDetails
Ask yourself what is the main purpose of storing this data? Do you intend to actually send mail to the person at the address? Track demographics, populations? Be able to ask callers for their correct address as part of some basic authentication/verification? All of the above? None of the above?
Depending on your actual need, you will determine either a) it doesn't really matter, and you can go for a free-text approach, or b) structured/specific fields for all countries, or c) country specific architecture.
Sometimes the closest you can get to a street address is the city.
I once had a project to put all the Secondary Schools in India in Google Maps. I wrote a spiffy program using the Google API and thought it would be quite easy.
Then I got the data from the client. Some school addresses were things like "Across from the market, next to the barber" or "Near old bus stand".
It made my task much harder since, unfortunately, the Google API does not support that format.
For international addresses, it is remarkably hard to find a way to format the information if it is broken down into fields. As a for instance, an Italian address uses:
<street address>
<zip> <town> <region>
<country>
Such as
Via Eroi della Repubblica
89861 Tropea VV
Italy
That is rather different from the order for US addresses - on the second line.
See also the SO questions:
How many address fields would you use for a UK database?
Do you break up addresses into street / city / state / zip?
How do you deal with duplicate street suffixes?
Best practices for storing postal addresses in a database (RDBMS)?
Also check out tag 'postal-code'.
Edit: Reverse order of region and town - per UPU
Maybe this is useful:
https://gist.github.com/259744
For a project I collected a table of informations about all countries of the world, including ISO codes, top level domain, phone code, car sign, length and regex of zip.
Country names and comments unfortunately only in German...
Differently of other answers here, I believe it's possible to have an structured address database.
Just out of the hat, I can think of the following structure:
Country
Region (State / Province)
Locality (City / Municipality)
Sub-Locality (County / other sub-division of a locality)
Street
But how to query it fast enough?
One way I always think it can be accomplished is to ask for the ZIP Code (or Postal Code) which varies from country to country, but is solid within the country.
This way you can structure your data around the information provided by the postal offices around the world.
Depends on how free-form you are prepared to go with the fields. One free-form address field will obviously always do, but be of relatively little help narrowing down geography.
The problem you'll have is that there is too much variation in the level of geographical hierarchy across countries. Heck, some countries do not even have 'street addresses' everywhere.
I recommend you don't try to make it too clever.
Len Silverston of Universal Data Model fame recommends a separate hierarchy of GEOGRAPHIC BOUNDARIES and depending on how much free-formed-ness you're willing to accept either simple STREET ADDRESS LINEs or per-country derivatives.
No, absolutely not. If you compare the way US and Japanese addresses work, you'll see that it's not possible.
UPDATE:
On second thought, anything can be done, but there's a trade-off.
One approach is to model the problem with address and address_attribute tables, with a 1:m relationship between them, anything can be modeled. The address_attribute table would have a pk, a name, a value, and an fk that points back to its address parent's pk. It's almost like using a Map with name, value pairs.
The trade-off is having to do a JOIN every time you want an address. You also have to interrogate the names of the address_attributes to figure out what you're dealing with each time.
Another approach would be to do more comprehensive research on how addresses are modeled around the world. In an object-oriented world you might have the western Address class (street1/street2/city/state/zip) and others for Japan, China, as many as needed to tile the address space. Then you'd have a master Address table and child tables to the other types with a 1:1 relationship between them.
How does Amazon or eBay do it? They ship internationally. Do they have locale-specific UI features? I've only used the US locale.
No, there are no standard addressing scheme. It usually varies from country-to-country.
Even the Universal Postal Union said on Adressing the world, an address for everyone that there is none. The best solution for this is to use the 2/3-letter country code standards known as ISO 3166 and treat everything else by country's standards.
However, if you really are desperate to use easily accessible tools for your project, you can try Google Place API.
Your design should strongly depend from your purpose. Some people have posted how to structure data. So if you simply want to send s-mail to someone, it will do. Things begin to complicate if you want to use this data for navigation. Car navigation will require additional structures to contain traffic info (eg one-way roads), while foot navigation will require a lot of additional data. Here is small example: in my city, my neighborhood is near the park. Next to the park is former airfield (in fact, one of the oldest in Europe) turned into aviation museum. Next to aviation museum is a business park. Street number for museum is 39, while business park numbers start with 39A. So it may seem that 39 and 39A are close – but it takes about a mile to walk from one to another (and even longer if going by car) .
This is just a small example taken from my city, I think you can probably find a lot of exceptions (especially in rural or wilder parts of every country).

Finding exact match using Lucene search API

I'm working on a company search API using Lucene.
My Lucene company index has got 2 companies:
1.Abigail Adams National Bancorp, Inc.
2.National Bancorp
If the user types in National Bancorp, then only company # 2(ie. National Bancorp) should be returned and not #1.....ie. only exact matches should be returned.
How do I achieve this functionality?
Thanks for reading.
You can use KeywordAnalyzer to index and search on this field. Keyword Analyzer will generate only one token for the entire string.
I googled a lot with no help for the same problem. After scratching my head for a while I found the solution. Search the string within double quotes, that will solve your problem.
National Bancorp will return both #1 and #2 but "National Bancorp" will return only #2.
This is something that may warrant the use of the shingle filter. This filter groups multiple words together. For example, Abigail Adams National Bancorp with a ShingleFilter of 3 tokens would produce (assuming a simple WhitespaceAnalyzer) [Abigail], [Abigail Adams], [Abigail Adams National], [Adams National Bancorp], [Adams National], [Adams], [National], [National Bancorp] and [Bancorp].
If a user the queries for National Bancorp, you will get an exact match on National Bancorp itself, and a lower scored exact match on Abigail Adams National Bancorp (lower scored because this one has much more tokens in the field, thus lowering the idf). I think it makes sense to return both documents on such a query.
You may want to apply the shingle filter at query time as well, depending on the use case.
You may want to reconsider your requirements, depending on whether or not I correctly understood your question. Please bear with me if I did misunderstand you.
Just a little food for thought:
If you only want exact matches returned, then why are you searching in the first place?
Are you sure that the user expects exact matches? I typically search assuming that the search engine will accommodate missing words.
Suppose the user searched for National Bank but National Bank was no longer in your index. Would you still want Abigail Adams National Bancorp, Inc to be excluded from the results simply because it was not an exact match?
In light of this, I would suggest you continue to present all possible matches (exact or not) to the user and let them decide for themselves which is most appropriate for them. I say this simply because you may not be thinking the same way as all of your users. Lucene will take care of making sure the closest matches rank highest in the results, helping them make quicker choices.
I have the same requirements of exact matching. I have used queryBuilder of org.hibernate.search.query.dsl and the query is:
query = queryBuilder.phrase().withSlop(0).onField(field)
.sentence(searchTerm).createQuery();
Its working for me.

how to determine if a record in every source, represents the same person

I have several sources of tables with personal data, like this:
SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...
SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...
SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...
So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person. Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly.
In this case, the exact matching needs to be relaxed because we assume the data sources has not been rigurously checked against the official bureau of identification of the country. Also we need to assume typos are common, because the nature of the processes to collect the data. What is more, the amount of records is around 2 or 3 millions in every source...
Our team had thought in something like this: first, force exact matching in selected fields like ID NUMBER, and NAMES to know how hard the problem can be. Second, relaxing the matching criteria, and count how much records more can be matched, but is here where the problem arises: how to do to relax the matching criteria without generating too noise neither restricting too much?
What tool can be more effective to handle this?, for example, do you know about some especific extension in some database engine to support this matching?
Do you know about clever algorithms like soundex to handle this approximate matching, but for spanish texts?
Any help would be appreciated!
Thanks.
The crux of the problem is to compute one or more measures of distance between each pair of entries and then consider them to be the same when one of the distances is less than a certain acceptable threshold. The key is to setup the analysis and then vary the acceptable distance until you reach what you consider to be the best trade-off between false-positives and false-negatives.
One distance measurement could be phonetic. Another you might consider is the Levenshtein or edit distance between the entires, which would attempt to measure typos.
If you have a reasonable idea of how many persons you should have, then your goal is to find the sweet spot where you are getting about the right number of persons. Make your matching too fuzzy and you'll have too few. Make it to restrictive and you'll have too many.
If you know roughly how many entries a person should have, then you can use that as the metric to see when you are getting close. Or you can divide the number of records into the average number of records for each person and get a rough number of persons that you're shooting for.
If you don't have any numbers to use, then you're left picking out groups of records from your analysis and checking by hand whether they look like the same person or not. So it's guess and check.
I hope that helps.
This sounds like a Customer Data Integration problem. Search on that term and you might find some more information. Also, have a poke around inside The Data Warehousing Institude, and you might find some answers there as well.
Edit: In addition, here's an article that might interest you on spanish phonetic matching.
I've had to do something similar before and what I did was use a double metaphone phonetic search on the names.
Before I compared the names though, I tried to normalize away any name/nickname differences by looking up the name in a nick name table I created. (I populated the table with census data I found online) So people called Bob became Robert, Alex became Alexander, Bill became William, etc.
Edit: Double Metaphone was specifically designed to be better than Soundex and work in languages other than English.
SSIS , try using the Fuzzy Lookup transformation
Just to add some details to solve this issue, I'd found this modules for Postgresql 8.3
Fuzzy String Match
Trigrams
You might try to cannonicalise the names by comparing them with a dicionary.
This would allow you to spot some common typos and correct them.
Sounds to me you have a record linkage problem. You can use the references in the link.