How to find strings which are similar to given string in SQL server? - sql

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.
For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:
1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری
I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:
I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.
The best solution would be a solution which help us sort result by similarity, but, its optional.
Do you have any suggestion for that?
Thanks

MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?
SELECT * FROM table WHERE input LIKE '%مختار%'

Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?
--This part searches for a string you want
declare #MyString varchar(max)
set #MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)
--This part searches for that string
select searchColumn, ABS(Len(searchColumn) - Len(#MyString)) as Similarity
from table where data LIKE '%' + #MyString + '%'
Order by Similarity, searchColumn
The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.
Hope that helps =-)

Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.

The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.
The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.
And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection

Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.
Look at the following reference:
http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for #method Algorithm:
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm

Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

Related

Data Quality - Is SOUNDEX the solution?

I work for an organization that has a serious data quality problem with names. There are fifteen databases that contain information about people. For example:
Database 1
Name=Fre&d Blo-ggs DOB 01/01/1980
Database 2
Name=Freddy Bloggs DOB 01/01/1980
If a user searches for Fred Bloggs using my search tool then I want both records to be found. I was thinking about something like this:
SELECT * FROM Person WHERE Soundex('Fred Bloggs') = Soundex('Fre&d Blo-ggs')
Is it advisable to use Soundex like this rather than using replace statements like this:
select Replace(Replace(Replace(Name,',',''),'&',''),'#') from Person
where Replace(Replace(Replace(Name,',',''),'&',''),'#') = #Name
#Name is the variable passed in. Is there a better way of doing it e.g. using regular expressions? Does Soundex affect performance.
Nice idea. I would not suggest using it though. I suppose that "John Right" is not the same as "John Write", even though they hear the same. I mean that in the end, what it matters is what you want to compare.... If you want to compare if the name sounds are the same, then SOUNDEX is fine.
However, I would suggest correcting your data somehow. This would be a real solution, although I can imagine that is not an easy one.
Hope I helped!
If soundex is better than regex depends of your data. For example there are different soundex versions for different languages. You have to check with your data, which is better..
Of course soundex does affect performance as any other additional functions you are calling. If performance becomes a problem, I would advise to add an additional column with the already computed soundex or normalized names and to create an index over it.
From own experience I think a normalized / simplified search criterion as e.g. parts of surname, prename and month of birth date should be sufficient to get all persons, but not too many, so a user can decide which person (s)he really wants to choose.
Soundex wont help you. you will stuck if a consonant appears in the name by mistake.
Its better you go for string distance and specify a percentage. A kind of fuzzy matching.
Have a look at the below link for fuzzy matching using levenshtein edit distance algorithm.
Levenshtein edit distance - MS SQL SERVER

Contains() function falters with strings of numbers?

For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.
Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!
try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)
try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).
I have found that few entries get more than two characters wrong, so the code
Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"
would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write
Name that differs by at most 2 characters from "Massachusetts";
Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.
Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).
Thanks in advance.
(Edit: Improved the title)
SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.
data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;
proc sql;
select name from typo
where complev(name, "massachusetts") <= 2;
quit;
There are other phonetic algorithms like Hamming distance that should work better.
You can search on google for implementation of this algorithm for your specific DB engine.
What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help
You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):
CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
-- do some comparison here
END distance;
And then use it in SQL:
SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2
Of course, these things tend to be quite slow...
I know this is four years too late but since it might also give ideas to others who are searching this thread:
What you're considering is a semantic layered design you would need to implement some conditional logic for these different text comparisons, using Lenvenschtien distances like the Jaro-Winkler for comparing text of differing lengths and Hamming for those of the same length for which you suppose simple text trans-positioning. This is nothing new these days with all of the various text mining programs out there.
Here is a post which is very good in my view;
Jaro-Winkler string comparison function in SAS

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

MySQL: select the closest match?

I want to show the closest related item for a product. So say I am showing a product and the style number is SG-sfs35s. Is there a way to select whatever product's style number is closest to that?
Thanks.
EDIT: to answer your questions. Well I definitely want to keep the first 2 letters as that is the manufacturer code but as for the part after the first dash, just whatever matches closest. so for example SG-sfs35s would match SG-shs35s much more than SG-sht64s. I hope this makes sense whenever I do LIKE product_style_number it only pulls the exact match.
There normally isn't a simple way to match product codes that are roughly similar.
A more SQL friendly solution is to create a new table that maps each product to all the products it is similar to.
This table would either need to be maintained manually, or a more sophisticated script can be executed periodically to update it.
If your product codes follow a consistent pattern (all the letters are the same for similar products, with only the numbers changing), then you should be able to use a regular expression to match the similar items. There are docs on this here...
It sounds like what you want is levenshtein distance .
Unfortunately, there isn't a built-in levenshtein function for mysql, but some folks have come up with a user-defined function that does it(deadlink).
You will probably want to do it as a stored procedure, as I expect that the algorithm may not be trivial.
For example, you may split the term at the -, so you have two parts. You do a LIKE query on each part and use that to make a decision.
You could just loop though, replacing the last character with "%" until you get at least one result, in your stored procedure.
Sounds like you need something like Lucene, though i'm not sure if that would be overkill for your situation. But it certainly would be able to do text searches and return the ones most similar first.
If you need something more simple I would try to start by searching with the full product code, then if that doesn't work try to use wildcards/remove some characters until you return a result.
JD Isaacks.
This situation of yours is very simple to solve.
It`s not like you need to use Artificial Intelligence like the Google.
http://www.w3schools.com/sql/sql_wildcards.asp
Take a look at this manual at w3schools about wildcards to use with your SELECT code.
But also you will need to create a new table with 3 columns: LeftCode, RightCode and WildCard.
Example:
Rows on Table:
LeftCode = SG | RightCode = 35s | WildCard = SG-s_s35s
LeftCode = SG | RightCode = 64s | WildCard = SG-s_t64s
SQL Code
If the user typed the code that matches the row1 of the table:
SELECT * FROM PRODUCTS WHERE CODE LIKE "$WildCard";
Where $WildCard is the PHP variable containing the column 3 of the new table.
I hope I helped, even 4 years late...