Fuzzy string matching optimization (not checking certain words) - Excel VBA function - vba

I have a function in Excel that calculates the Levenshtein Distance between two strings (the number of insertions, deletions, and/or substitutions needed to transform one string into another). I am using this as part of a project I'm working on that involves "fuzzy string matching."
Below you will see the code for the LevenshteinDistance function and a valuePhrase function. The latter exists for the purposes of executing the function in my spreadsheet. I have taken this from what I read in this thread.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`
Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, cost As Long 'loop counters and cost of
'substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and
Substitution
L1 = Len(S1): L2 = Len(S2)
ReDim D(0 To L1, 0 To L2)
For i = 0 To L1: D(i, 0) = i: Next i
For j = 0 To L2: D(0, j) = j: Next j
For j = 1 To L2
For i = 1 To L1
cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
cI = D(i - 1, j) + 1
cD = D(i, j - 1) + 1
cS = D(i - 1, j - 1) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
End If
Next i
Next j
LevenshteinDistance = D(L1, L2)
End Function
Public Function valuePhrase#(ByRef S1$, ByRef S2$)
valuePhrase = LevenshteinDistance(S1, S2)
End Function
I am executing this valuePhrase function in a table in one of my sheets where the column and row headers are names of insurance companies. Ideally, the smallest number in any given row (the shortest Levenshtein distance) should correspond to a column header with the name of the insurance company in the table that most closely matches the name of that insurance company in the row header.
My problem is that I am trying to calculate this in a case where the strings in question are names of insurance companies. With that in mind, the code above strictly calculates the Levenshtein distance and is not tailored specifically to this case. To illustrate, a simple example of why this can be an issue is because the Levenshtein distance between two insurance company names can be quite small if they both share the words "insurance" and "company" (which, as you might expect, is common), even if the insurance companies have totally different names with respect to their unique words. So, I may want the function to ignore those words when comparing two strings.
I am new to VBA. Is there a way I can implement this fix in the code? As a secondary question, are there other unique issues that could arise from comparing the names of insurance companies? Thank you for the help!

Your whole question can be replaced by "How do I use the replace function in VBA?". In general, the algorithm in the question looked interesting, thus I have done this for you. Simply add anything in the Array() of the function, it will work (Just write in lower case the values in the array):
Public Function removeSpecificWords(s As String) As String
Dim arr As Variant
Dim cnt As Long
arr = Array("insurance", "company", "firma", "firm", "holding")
removeSpecificWords = s
For cnt = LBound(arr) To UBound(arr)
removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
Next cnt
End Function
Public Sub TestMe()
Debug.Print removeSpecificWords("InsHolding")
Debug.Print removeSpecificWords("InsuranceInsHoldingStar")
End Sub
In your case:
S1 = removeSpecificWords(S1)
S2 = removeSpecificWords(S2)
valuePhrase = LevenshteinDistance(S1, S2)

When I had a similar issue in trying to remove duplicate addresses, I approached the problem the other way and used the Longest Common Substring.
Function DetermineLCS(source As String, target As String) As Double
Dim results() As Long
Dim sourceLen As Long
Dim targetLen As Long
Dim counter1 As Long
Dim counter2 As Long
sourceLen = Len(source)
targetLen = Len(target)
ReDim results(0 To sourceLen, 0 To targetLen)
For counter1 = 1 To sourceLen
For counter2 = 1 To targetLen
If Mid$(source, counter1, 1) = Mid$(target, counter2, 1) Then
results(counter1, counter2) = results(counter1 - 1, counter2 - 1) + 1
Else
results(counter1, counter2) = WorksheetFunction.Max(results(counter1, _
counter2 - 1), results(counter1 - 1, counter2))
End If
Next counter2
Next counter1
'return the percentage of the LCS to the length of the source string
DetermineLCS = results(sourceLen, targetLen) / sourceLen
End Function
For addresses, I've found that about an 80% match gets me close to a hundred percent matches. with insurance agency names (and I used to work in the industry, so I know the problem you face), I might suggest a 90% target or even a mix of the Levenshtein Distance and LCS, minimizing the former while maximizing the latter.

Related

Split text into 80 character lines, issue with last line

I'm trying to take a body of text and add line breaks around 80 characters on each line. The issue I'm having is on the last line it's adding an extra line break than would be desired. For instance this string should not have a line break on the second to last line:
Alice was beginning to get very tired of sitting by her sister on the bank, and
of having nothing to do: once or twice she had peeped into the book her sister
was reading, but it had no pictures or conversations in it, and what is the use
of a book, thought Alice without pictures or
conversations?
should look like this (note "conversations" has been moved up):
Alice was beginning to get very tired of sitting by her sister on the bank, and
of having nothing to do: once or twice she had peeped into the book her sister
was reading, but it had no pictures or conversations in it, and what is the use
of a book, thought Alice without pictures or conversations?
Here's the code:
Sub StringChop()
Dim OrigString As String
Dim NewString As String
Dim counter As Long
Dim length As Long
Dim LastSpace As Long
Dim LineBreak As Long
Dim TempString As String
Dim TempNum As Long
OrigString = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book, thought Alice without pictures or conversations?"
length = Len(OrigString)
counter = 1
Do While counter < length
'Extract next 80 characters from last position
TempString = Mid(OrigString, counter, 80)
'Determine last space in string
LastSpace = InStrRev(TempString, " ")
'Determine first line break in string
LineBreak = InStr(TempString, vbNewLine)
'If line break exists in sentence...
'only count characters up to line break, and set counter to that amount
Select Case LastSpace 'What to do if there are spaces in sentence
Case Is > 0 'There are spaces in sentence
Select Case LineBreak 'What to do if there are line breaks in sentence
Case Is = 0
'From last counter position,
NewString = NewString & Mid(OrigString, counter, LastSpace) & vbNewLine
counter = counter + LastSpace
Case Is <> 0
NewString = NewString & Mid(OrigString, counter, LineBreak)
counter = counter + LineBreak
End Select
Case Is = 0 'There are no more spaces left in remaining sentence
NewString = NewString & Mid(OrigString, counter)
counter = length
End Select
Loop
Debug.Print NewString
End Sub
Word wrapping is an interesting problem. I wrote the following code once as an experiment. You might find it helpful:
Option Explicit
'Implements a dynamic programming approach to word wrap
'assumes fixed-width font
'a word is defined to be a white-space delimited string which contains no
'whitespace
'the cost of a line is the square of the number of blank spaces at the end
'of a line
Const INFINITY As Long = 1000000
Dim optimalCost As Long
Function Cost(words As Variant, i As Long, j As Long, L As Long) As Long
'words is a 0-based array of strings, assumed to have no white spaces
'i, j are indices in range 0,...,n, where n is UBOUND(words)+1
'L is the maximum length of a line
'Cost returns the cost of a line which begins with words(i) and ends with
'words(j-1). It returns INFINITY if the line is too short to hold the words
'or if j <= i
Dim k As Long
Dim sum As Long
If j <= i Or Len(words(i)) > L Then
Cost = INFINITY
Exit Function
End If
sum = Len(words(i))
k = i + 1
Do While k < j And sum <= L
sum = sum + 1 + Len(words(k)) 'for space
k = k + 1
Loop
If sum > L Then
Cost = INFINITY
Else
Cost = (L - sum) ^ 2
End If
End Function
Function WordWrap(words As Variant, L As Long) As String
'returns string consisting of words with spaces and
'line breaks inserted at the appropriate places
Dim v() As Long, d() As Long
Dim n As Long
Dim i As Long, j As Long
Dim candidate As Long
n = UBound(words) + 1
ReDim v(0 To n)
ReDim d(0 To n)
v(0) = 0
d(0) = -1
For j = 1 To n
v(j) = INFINITY 'until something better is found
i = j - 1
Do
candidate = v(i) + Cost(words, i, j, L)
If candidate < v(j) Then
v(j) = candidate
d(j) = i
End If
i = i - 1
Loop While i >= 0 And candidate < INFINITY
If v(j) = INFINITY Then
MsgBox "Some words are too long for the given length"
Exit Function
End If
Next j
optimalCost = v(n)
'at this stage, optimal path has been found
'just need to follow d() backwards, inserting line breaks
i = d(n) 'beginning of current line
WordWrap = words(n - 1)
j = n - 2
Do While i >= 0
Do While j >= i
WordWrap = words(j) & " " & WordWrap
j = j - 1
Loop
If i > 0 Then WordWrap = vbCrLf & WordWrap
i = d(i)
Loop
End Function
The above function expects an array of words. You would have to split a string before using it as input:
Sub test()
Dim OrigString As String
OrigString = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book, thought Alice without pictures or conversations?"
Debug.Print WordWrap(Split(OrigString), 80)
End Sub
Output:
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book
her sister was reading, but it had no pictures or conversations in it, and
what is the use of a book, thought Alice without pictures or conversations?

Cells containing number that is equal or greater than

what I am currently trying to do is to find and highlight cells that contain simultaneously a certain phrase and (among some other text) a number that is equal or greater than 20 (including numbers with decimals like 25.8332). I tried using FormatConditions, but I wasn't able to make it consider two simultaneous conditions (a phrase and a number). So I decided to use a combination of If and InStr, but I wonder how to fill in the number that is equal or greater than 20?
Select the cells you wish to process and run:
Sub ColorMeYellow()
Dim r As Range, s As String, n As Double
Dim happy As String, CH As String, temp As String
Dim L As Long, i As Long
happy = "happy"
For Each r In Selection
s = r.Value
If InStr(1, s, happy) > 0 Then
L = Len(s)
temp = ""
For i = 1 To L
CH = Mid(s, i, 1)
If CH Like "[0-9]" Or CH = "." Then
temp = temp & CH
End If
Next i
If IsNumeric(temp) Then
If CDbl(temp) > 20 Then
r.Interior.ColorIndex = 6
End If
End If
End If
Next r
End Sub
It will look for cells containing both *"happy" and a number greater than 20.

Lee-Ready tick test using VBA

I am trying to build Lee-Ready tick test for estimating trade direction from tick data using Excel. I have a dataset containing the trade prices in descending order, and I am trying to build a VBA code that is able to loop over all the 4m+ cells in as efficient manner as possible.
The rule for estimating trade direciton goes as follows:
If Pt>Pt-1, then d=1
If Pt<Pt-1, then d=-1
If Pt=Pt-1, then d is the last value taken by d.
So to give a concrete example, I would like to transform this:
P1;P2;P3;P4
1.01;2.02;3.03;4.04
1.00;2.03;3.03;4.02
1.01;2.02;3.01;4.04
1.00;2.03;3.00;4.04
into this
d1;d2;d3;d4
1;-1;1;1
-1;1;1;-1
1;-1;1;0
0;0;0;0
Fairly straightforward nested loops suffice:
Function LeeReady(Prices As Variant) As Variant
'Given a range or 1-based, 2-dimensional variant array
'Returns an array of same size
'consisiting of an array of the same size
'of trade directions computed according to
'Lee-Ready rule
Dim i As Long, j As Long
Dim m As Long, n As Long
Dim priceData As Variant, directions As Variant
Dim current As Variant, previous As Variant
If TypeName(Prices) = "Range" Then
priceData = Prices.Value
Else
priceData = Prices
End If
m = UBound(priceData, 1)
n = UBound(priceData, 2)
ReDim directions(1 To m, 1 To n) As Long 'implicitly fills in bottom row with 0s
For i = m - 1 To 1 Step -1
For j = 1 To n
current = priceData(i, j)
previous = priceData(i + 1, j)
If current > previous Then
directions(i, j) = 1
ElseIf current < previous And previous > 0 Then
directions(i, j) = -1
Else
directions(i, j) = directions(i + 1, j)
End If
Next j
Next i
LeeReady = directions
End Function
This can be called from a sub or used directly on the worksheet:
Here I just highlighted a block of cells of the correct size to hold the output and then used the formula =LeeReady(A2:D5) (pressing Ctrl+Shift+Enter to accept it as an array formula).
On Edit: I modified the code slightly (by adding the clause And previous > 0 to the If statement in the main loop) so that it can now handle ranges in which come of the columns have more rows than other columns. The code assumes that price data is always > 0 and fills in the return array with 0s as place holders in the columns that end earlier than other columns:

Custom sort routine for unique string A being place after another string B, C, D, etc if string A is found within them

Situation
I have a UDF that works with a range that it is passed that is of variable height and 2 columns wide. The first row will contain text in column 1 and an empty column2. The remainder of column 1 will contain unsorted text with an associated value in the same row in column 2. I need to sort the data such that if some text in column 1 also appears in some other text in column.
Problem
My VBA skills are all self taught and mimimal at best. I remember a few decades ago in university we did bubble sorts and played with pointers, but I no longer remember how we achieved any of that. I do well reading code but creating is another story.
Objective
I need to generate a sort procedure that will produce unique text towards the bottom of the list. I'll try wording this another way. If text in column1 can be found within other text in column, that the original text need to be placed below the other text it can be found in along with its associated data in column 2. The text is case sensitive. Its not an ascending or descending sort.
I am not sure if its a restriction of the UDF or not, but the list does not need to be written back to excel, it just needs to be available for use in my UDF.
What I have
Public Function myFunk(rng As Range) As Variant
Dim x As Integer
Dim Datarange As Variant
Dim Equation As String
Dim VariablesLength As Integer
Dim Variable As String
Datarange = rng.Value
'insert something around here to get the list "rng or Datarange" sorted
'maybe up or down a line of code depending on how its being done.
Equation = Datarange(1, 1)
For x = 2 To UBound(Datarange, 1)
VariablesLength = Len(Datarange(x, 1)) - 1
Variable = Left$(Datarange(x, 1), VariablesLength)
Equation = Replace$(Equation, Variable, Datarange(x, 2))
Next x
myFunk = rng.Worksheet.Evaluate(Equation)
End Function
Example Data
Any help with this would be much appreciated. In that last example I should point out that the "=" is not part of the sort. I have a routine that strips that off the end of the string.
So in order to achieve what I was looking for I added a SWAP procedure and changed my code to look like this.
Public Function MyFunk(rng As Range) As Variant
Dim x As Integer
Dim y As Integer
Dim z As Integer
Dim datarange As Variant
Dim Equation As String
Dim VariablesLength As Integer
Dim Variable As String
'convert the selected range into an array
datarange = rng.Value
'verify selected range is of right shape/size
If UBound(datarange, 1) < 3 Or UBound(datarange, 2) <> 2 Then
MyFunk = CVErr(xlErrNA)
Exit Function
End If
'strip the equal sign off the end if its there
For x = 2 To UBound(datarange, 1)
If Right$(datarange(x, 1), 1) = "=" Then
datarange(x, 1) = Left$(datarange(x, 1), Len(datarange(x, 1)) - 1)
End If
Next x
'sort the array so that a variable does not get substituted into another variable
'do a top down swap and repeat? Could have sorted by length apparently.
For x = 2 To UBound(datarange, 1) - 1
For y = x + 1 To UBound(datarange, 1)
If InStr(1, datarange(y, 1), datarange(x, 1)) <> 0 Then
For z = LBound(datarange, 2) To UBound(datarange, 2)
Call swap(datarange(y, z), datarange(x, z))
Next z
y = UBound(datarange, 1)
x = x - 1
End If
Next y
Next x
'Set the Equation
Equation = datarange(1, 1)
'Replace the variables in the equation with values
For x = 2 To UBound(datarange, 1)
Equation = Replace$(Equation, datarange(x, 1), datarange(x, 2))
Next x
'rest of function here
End Function
Public Sub swap(A As Variant, B As Variant)
Dim Temp As Variant
Temp = A
A = B
B = Temp
End Sub
I sorted by checking to see if text would substitute into other text in the list. Byron Wall made a good point that I could have sorted based on text length. Since I had completed this before I saw the suggestion it did not get implemented though I think it may have been a simpler approach.

VBA - Run WorksheetFunction on [Range derived] Variant array?

I have a need to run successive passes of built in excel functions on a single matrix of input.
The problem is, the input [range] is what I assume, a pointer constant.
So sure, I can do a WorkSheetFunction calculations on the [range] input and place the output into a variant.
But, I do have a need to run more passes on the variant data. I have a more advanced calculation that is going to run 4 transforms on data that use standard excel functions like average, and median.
Here's my code
Public Function RankECDF(ByRef r_values As Range, Optional ByVal zeroFlag As Boolean = 0) As Variant()
Dim i As Integer, j As Integer, N As Integer, M As Integer
Dim total As Integer
Dim y() As Variant
N = r_values.Rows.Count
M = r_values.Columns.Count
y = r_values.Value 'copy values from sheet into an array
Dim V() As Variant
Dim AltV As Variant
Dim OutV As Variant
Dim OutAltV As Variant
'quite possible to makes the Variant larger to hold the "other arrays"
ReDim V(1 To N, 1 To M)
ReDim AltV(1 To N, 1 To M)
ReDim OutV(1 To N, 1 To M)
ReDim OutAltV(1 To N, 1 To M)
'first pass just checks for zero's. Could speed this process up by implementing the zeroFlag check to skip the double loop
total = WorksheetFunction.Sum(r_values)
For R = 1 To N
For C = 1 To M
If y(R, C) = "" Then
V(R, C) = ""
AltV(R, C) = 0
Else
'would error if cell was ""
'V(R, C) = WorksheetFunction.Average(WorksheetFunction.Rank(y(R, C), r_values, 1), WorksheetFunction.CountIf(r_values, "<=" & y(R, C))) / WorksheetFunction.Count(r_values)
V(R, C) = y(R, C)
AltV(R, C) = y(R, C)
End If
Next C
Next R
'second loop does rankecdf conversions
For RA = 1 To N
For CA = 1 To M
'OutV(RA, CA) = 1
'OutV(RA, CA) = WorksheetFunction.Rank(V(RA, CA), V, 1)
'OutAltV(RA, CA) = 2
'OutAltV(RA, CA) = WorksheetFunction.Average(WorksheetFunction.Rank(y(RA, CA), r_values, 1), WorksheetFunction.CountIf(r_values, "<=" & y(RA, CA))) / WorksheetFunction.Count(r_values)
Next CA
Next RA
If (zeroFlag) Then
RankECDF = AltV
'RankECDF = OutAltV(1 to N, 1 to M)
Else
RankECDF = V
'RankECDF = OutV(N, M)
End If
End Function
The problem can be identified right around here:
OutV(RA, CA) = WorksheetFunction.Rank(V(RA, CA), V, 1)
WorksheetFunction.Rank(y(R, C), r_values, 1)
You cannot put an Array on arg1. Just do:
i = y(R, C)
Then:
WorksheetFunction.Rank(i, r_values, 1)
It worked fine for me
Updated from comments as I see the answer I initially posited misread the problem:
As a general rule, arrays and performing calculations purely in memory are faster than you might think. For one example I used to use the Application.Match function to find the index position of a value in an array, rather than simple brute force iteration. Turns out that iteration was a far faster (up to 10x faster!!!) method. Check out Tim's answer to my question about Matching values in a string array.
I suspect it is the same with rank/sorting. Worksheet functions are expensive. For/Next is not, relatively speaking.
As for the specific needs to rank from an array, there are examples of custom functions which rank and sort arrays, collections, dictionaries, etc. I ultimately end up using a bunch of Chip Pearson's Array helper functions, he has a number of them; which do really cool sh!t like reversing an array, sorting array, determining whether an array is allocated (I use this one a lot) or empty, or all numeric, etc. There are about 30 of them.
here is the code to sort an array.
Note: I did not post his code because there is a lot of it. WHile it appears daunting, because it is a lot of code to re-invent the wheel, it does work and saves a lot of trouble and is very useful. I don't even use these in Excel, since I do most of my dev in PowerPoint now -- I think all of these modules ported over with zero or almost zero debugging on my end. They're really buttoned up quite nicely.
Getting the rank
Once the array is "sorted" then determining the rank of any value within it is trivial and only requires some tweaking since you may want to handle ties appropriately. One common way of dealing with ties is to "skip" the next value, so if there is a two-way tie for 2nd place, the rank would go {1, 2, 2, 4, 5, 6, etc.}
Function GetRank(arr As Variant, val As Variant)
'Assumes arr is already sorted ascending and is a one-dimensional array
Dim rank As Long, i As Long
Dim dictRank As Object
Set dictRank = CreateObject("Scripting.Dictionary")
rank = 0
For i = LBound(arr) To UBound(arr)
rank = rank + 1
If dictRank.Exists(arr(i)) Then
'Do nothing, handles ties
Else
'store the Key as your value, and the Value as the rank position:
dictRank(arr(i)) = rank
End If
If arr(i) = val Then Exit For
Next
GetRank = rank
End Function