SQL DIFFERENCE function with names bringing too many results - sql

I have a function that uses the SQL DIFFERENCE function to see if the name of a client is similar to a client already in the database
SELECT ID FROM People p
WHERE DIFFERENCE(p.FullName, #fullName) = 4
Being #fullname a variable passed to the function. The issue I'm having is that if I pass "pedro sanchez" as a parameter, the query will bring me all the Peter's in the database, or if I enter "pablo sanchez", it'll bring record "PEOPLE'S CREDIT UNION".
As I understand the DIFFERENCE function should returns 4 when the two strings are almost identical, but the results I'm having say otherwise.
Is there a way to further specify the resemblance to the DIFFERENCE function, or maybe another approach in finding similar names ?

Difference() is based on soundex(), which in turn -- to be frank -- is a lousy system for comparing strings. Let me add a caveat: it is pretty good for the purpose it was designed for, which is matching last names of people in English. You can read about the rules here and you can try it out here. Using the latter link, you can see that "Pedro" and "People" have the same code, P-140.
Soundex encodes the consonants and basically the first four matching consonants the list it cares about. (Some languages, such as Hawaiian and other Polynesian languages are rather light in consonants. One assumes the designers were not thinking about names in such languages.)
When you are looking for proximity among written strings, Levenshtein distance is a common metric. Unfortunately, SQL Server does not have this functionality built-in, but you can easily find implementations on the web. For most real applications, Levenshtein distance is too slow. Happily, the functionality of the full text search component is usually sufficient for most purposes.

Related

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).
I have found that few entries get more than two characters wrong, so the code
Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"
would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write
Name that differs by at most 2 characters from "Massachusetts";
Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.
Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).
Thanks in advance.
(Edit: Improved the title)
SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.
data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;
proc sql;
select name from typo
where complev(name, "massachusetts") <= 2;
quit;
There are other phonetic algorithms like Hamming distance that should work better.
You can search on google for implementation of this algorithm for your specific DB engine.
What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help
You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):
CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
-- do some comparison here
END distance;
And then use it in SQL:
SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2
Of course, these things tend to be quite slow...
I know this is four years too late but since it might also give ideas to others who are searching this thread:
What you're considering is a semantic layered design you would need to implement some conditional logic for these different text comparisons, using Lenvenschtien distances like the Jaro-Winkler for comparing text of differing lengths and Hamming for those of the same length for which you suppose simple text trans-positioning. This is nothing new these days with all of the various text mining programs out there.
Here is a post which is very good in my view;
Jaro-Winkler string comparison function in SAS

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.