Percentage match between two strings in SQL Server - sql

I want something similar to this, two strings then the percentage match between the two, example below
string1 string2 percentage match
cali cali 100%

You'd have to write Levenstein distance function for that. Algorithm calculates number of insert+delete+substitute operations required to turn one string to another. In worst case scenario number of operations will be equal to length of longer string, so you could easily calculate 0%-100% similarity. There are T-SQL implementations available online, e.g. here (I didn't test it).

Related

How to do Fuzzy match in Snowflake / SQL

How to do fuzzy match in Snowflake / SQL
Here is the business logic
The ABC Company INC, The north America, ABC (Those two should shows a match)
The 16K LLC, 16K LLC (Those two should shows a match)
enter image description here
I attached some test data. Thank so much guys!
Any matching attempt that treats string pairs like "The ABC Company INC" and "The north America, ABC" or "Preferred ABC Group" and "The Preferred Residences" as a match is probably going to give you many false positive matches, since in some of your examples there is only one word similar between the strings.
That said, Snowflake does provide a couple of functions that might help: EDITDISTANCE and JAROWINKLER_SIMILARITY.
EDITDISTANCE generates a number that represents the Levenshtein distance between two strings (basically the number of edits it would take to change one string into the other). A lower number indicates fewer edits needed and so potentially a closer match.
JAROWINKLER_SIMILARITY uses an algorithm to calculate a "similarity" score between 0 and 100 for two strings. A higher number indicates more similarity, 100 being an exact match.
You could use either or both of these functions to generate scores for each pair of strings and then decide on a threshold that best represents a match for your purposes.

What is the worst case target number for binary search?

For simplicity, let's say the number is between 1 and n. Is there a generic formula that gives us the number that results in the maximum number of binary search iterations?
I always have to write things about by hand, but I feel there should be an analytical form that is a function of n.

SQL Query - hexadecimal vs. decimal values

I expected to find the answer to this question fairly quickly, but surprisingly, don't seem to see it anywhere.
I'm guessing that a comparison to a binary constant in an SQL query would be faster than a comparison to a decimal number, as the binary constant is probably a direct lookup while decimal numbers need to be converted, but is the performance difference measurable?
In other words, is the first query better than the second one? If so, how much better?
select *
from Cats
where Cats_Id = 0x0000000000000086
select *
from Cats
where Cats_Id = 134
There absolutely no difference: 0x0000000000000086 is an integer with a decimal value of 134. It's just written in base 16 (hexadecimal).
The two queries are exactly identical and will get exactly the same execution plan.
The one different will be if the column you are comparing to is binary(n) or varbinary(n). There the hex constant is representing a sequence of octets.
Your premise is based on a misunderstanding:
the binary constant is probably a direct lookup while decimal numbers need to be converted
The SQL you enter consists of characters in some text encoding; for simplicity, let's assume ASCII.
In the computer's memory, it's composed of a long string of binary states we normally write as 0 and 1, but could equally write as _ and |.
The binary data for the string 134 in ASCII looks something like __||___|__||__||__||_|__. The binary data for the string 0x0086 looks something like __||_____||||_____||______||______|||_____||_||_
When actually working with the data, e.g. comparing numbers, the computer will use a different representation altogether. The number "one hundred and thirty-four" will look something more like |____||_.
So whichever representation you use, there is a conversion going on.
Nonetheless, one conversion might be more efficient, by some incidental detail of its implementation, but by such a tiny margin that it would be almost impossible to measure amongst the noise of the system you're testing.
The answer is yes. In some cases, the query with the hexadecimal value is substantially better.
I hired a consultant DBA to help us with our system and after reviewing one of our queries, which was running a bit slow, he showed me that changing the value to hexadecimal improved it substantially (by about 95%).
He then showed me that, although I had an index on the field I was searching (binary foreign key), the index wasn't used when executing the query with the decimal value.
If anyone can provide a more detailed answer about the different cases in which these queries are identical or not, performance wise, I would appreciate that.

SQL Edit Distance: How have you handled Fuzzy String Matching using SQL in the past?

I have always wanted to ask for your views on this topic, so here we go:
My team just provided me with a list of customer accounts we need to match with other databases and the main challenge we face is the fact our list is non-standarized so we call similarly but differently the same accounts than in our databases. For example:
My_List.Customers_Name Customers_Database.Customers_Name
- -
Charles Schwab Charles Schwab Corporation
So for example, I use Jaro Wrinkler Similarity function and Edit Distance in order to gather a list of similar strings and then manually match the accounts if needed. My question is:
Which rules/filters do you apply to the results of the fuzzy data matching in order to reduce the amount of manual match?
I am refering to rules like:
If the string has more than 20 characters and Edit Distance <= 1 then it will probably be the same so consider it a match. If the string has less than 4 characters and Edit Distance >0 then it will probably not be the same account so consider it a mismatch.
These rules I apply are completely made up from my side, I am wondering if there are some standard convention for applying text string fuzzy matching in order to only retrieve useful results and reduce manual match workload.
If there are not, could you tell your experience and how you handled this before?
Thank you so much
I've done this a few times. It's hugely dependent on the data sets, and the rules change every time.
My process is:
select a random set of sample records to check my rule set - large enough to be representative, small enough to be able to scan visually.
create a "match" table with "original", "match" and "confidence score" columns.
write the rules, as "insert" or "update" statements to create records in the "match" table
run the rules on my sample data set
evaluate the matches on the samples. Tweak, add, configure the rules.
rinse & repeat
The "rules" depend hugely on the data set. Commonly I use the following:
strip out punctuation
apply common substitutions (e.g. "Corp" becomes "Corporation")
split into separate words; apply fraction of each exact match out of 10 (so "Charles Schwab" matching "Charles Schwab Corporeation" would be 2/3 = 7 points, "HSBC" matching "HSBC" is 1/1 = 10 points
split into separate words; apply fraction of each close match out of 5 (so "Chls Schwab" matching "Charles Schwab Corporation" would be 2/3 = 3 points, "HSBC" matching "HSCB" is 1/1 = 5 points)

Link numbers with an equation/algorithm

I am making an anagram solver in Visual Basic that gives you every possible combination when you enter a string. I need to work out how many combinations there are depending on the amount of characters in the string and how many different characters there are.
E.G.
Sample string:
abc
Total characters: 3, Different Characters: 3
Possible combinations: 6
abc, acb, bac, bca, cab, cba
I need an equation (using the number of characters and different characters) to link this to a string that contains a different amount of characters.
I've been using trial and error to try and figure is out, but I can't quite get my head around it. So far I have:
((letters - 1) ^ (different letters - 1)) + (letters - 1)
which works for a few different letter counts but now for all.
Help please???
I'll lead you to the answer, but I'll try to explain along the way. Let's say you had 10 different letters. You'd have 10 choices for the first, 9 for the second, 8 for the third, etc. Ultimately, there would be 10*9*8*7*6...*2*1 = 10! possibilities. However, sometimes you'll have multiple instances of the same letter. For example, using that for the string "aaabcd" would overcount possibilities, because it counts each of the a's as distinct letters, even though they're not. To correct for that, you would have to divide by the factorial of the number of repeated letters. A good way to calculate the total number of possibilities would be (total number of letters factorial)/ (product of the factorials of the number of repeated instances of each letter).
For example:
There are 6!/(3!) ways to arrange the letters in "aaabcd"
There are 6! ways to arrange the letters is "abcdef"
There are 6!/(3!*2!) ways to arrange the letters in "aaabbc"
There are 10!/(5!*3!*2!) ways to arrange the letters in "aaaaabbbcc"
I hope this helps.
For the possible counting number, it's exactly the same as computing Multinomial Coefficient
A simple explanation is that, for no repeating characters,
It's simply permutation = n!
(It is easy to understand if you draw a tree diagram, with first character has n choices, second character has n-1choices...etc.)
However as you may have repeating characters, you will double count many of them.
Let's see an simple example: for aaa, how many possible arrangements IF WE COUNT EVEN THE OUTCOME IS THE SAME?
Answer is 3!(aaa,aaa,aaa,aaa,aaa,aaa)
This gives us an idea that, when we have a character appearing for m times, we will count m! instead of 1
So the counting is just n!(all possible arrangements, including same outcome) / m! (a character appear for m times)
Same for more characters repeating: n!/a!b!c!.. (first character appear a times, another appear for b times...)
If you understand the concept behind, then you will find that, actually for those "non-repeating" characters, it's just dividing an 1!. For eg, character (multi)set = {a,a,a,b,b,c}, #a = 3, #b = 2, #c = 1, so the answer (without repeating count) is (3+2+1)!/3!2!1! and fraction of this format is named multinomial coefficient as stated above.
In programming point of view, you can just pre-compute all factorials (with a pretty small n though as n~30 is already too large for a variable to store) with simple for loop
declare frac = array(n);
frac[0] = 1;
FOR i=1; i<=n;i++
frac[i] = i*frac[i-1]
For a larger n, you may just calculate double/float division on the fly in the loop to avoid overflow..you may face precision problem though.
If you further need to output the different strings, you may use DFS to backtrack all the possible outcomes. Or if you could use another language like C++, you can use built-in function like next_permutation() after sort the character set.