How to compare Strings for Percentage Match using vb.net? - vb.net

I am banging my head against the wall for a while now trying different techniques.
None of them are working well.
I have two strings.
I need to compare them and get an exact percentage of match,
ie. "four score and seven years ago" TO "for scor and sevn yeres ago"
Well, I first started by comparing every word to every word, tracking every hit, and percentage = count \ numOfWords. Nope, didn't take into account misspelled words.
("four" <> "for" even though it is close)
Then I started by trying to compare every char in each char, incrementing the string char if not a match (to count for misspellings). But, I would get false hits because the first string could have every char in the second but not in the exact order of the second. ("stuff avail" <> "stu vail" (but it would come back as such, low percentage, but a hit. 9 \ 11 = 81%))
SO, I then tried comparing PAIRS of chars in each string. If string1[i] = string2[k] AND string1[i+1] = string2[k+1], increment the count, and increment the "k" when it doesn't match (to track mispellings. "for" and "four" should come back with a 75% hit.) That doesn't seem to work either. It is getting closer, but even with an exact match it is only returns 94%. And then it really gets screwed up when something is really misspelled. (Code at the bottom)
Any ideas or directions to go?
Code
count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
' To ignore non letters or digits '
If Not strTempName(j).IsLetter(strTempName(j)) Then
j += 1
End If
' To ignore non letters or digits '
If Not strTempFile(k).IsLetter(strTempFile(k)) Then
k += 1
End If
' compare pair of chars '
While (strTempName(j) <> strTempFile(k) And _
strTempName(j + 1) <> strTempFile(k + 1) And _
k < strTempFile.Length - 2)
k += 1
End While
count += 1
j += 1
k += 1
End While
perc = count / (strTempName.Length - 1)

Edit: I have been doing some research and I think I initially found the code from here and translated it to vbnet years ago. It uses the Levenshtein string matching algorithm.
Here is the code I use for that, hope it helps:
Sub Main()
Dim string1 As String = "four score and seven years ago"
Dim string2 As String = "for scor and sevn yeres ago"
Dim similarity As Single =
GetSimilarity(string1, string2)
' RESULT : 0.8
End Sub
Public Function GetSimilarity(string1 As String, string2 As String) As Single
Dim dis As Single = ComputeDistance(string1, string2)
Dim maxLen As Single = string1.Length
If maxLen < string2.Length Then
maxLen = string2.Length
End If
If maxLen = 0.0F Then
Return 1.0F
Else
Return 1.0F - dis / maxLen
End If
End Function
Private Function ComputeDistance(s As String, t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim distance As Integer(,) = New Integer(n, m) {}
' matrix
Dim cost As Integer = 0
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
'init1
Dim i As Integer = 0
While i <= n
distance(i, 0) = System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Max(System.Threading.Interlocked.Increment(j), j - 1)
End While
'find min distance
For i = 1 To n
For j = 1 To m
cost = (If(t.Substring(j - 1, 1) = s.Substring(i - 1, 1), 0, 1))
distance(i, j) = Math.Min(distance(i - 1, j) + 1, Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost))
Next
Next
Return distance(n, m)
End Function

Did not work for me unless one (or both) of following are done:
1) use option compare statement "Option Compare Text" before any Import declarations and before Class definition (i.e. the very, very first line)
2) convert both strings to lowercase using .tolower

Xavier's code must be correct to:
While i <= n
distance(i, 0) = System.Math.Min(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Min(System.Threading.Interlocked.Increment(j), j - 1)
End While

Related

VBA Function where double loop only provides one output

I am new with VBA and trying to do a trinomial option pricing tree with parameters from -n to n defined. How can I rewrite my VBA function with a double loop to provide all output as it seems to only show the value of S. In the end I want to sum all values and discount them but I cannot get past my double loop providing only one output. If I put the MsgBox between k and i loop it returns nothing. Please can someone assist getting this to work as I don't have the hang of VBA yet.
Public Function LoopTest(S As Double, u As Double, n As Integer)
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim St() As Double
ReDim St(2 * n + 1)
For k = 0 To n Step 1
For i = -k To k Step 1
St(i, k) = S * u ^ i
Debug.Print St(i, k)
MsgBox ("DebugToPrint")
' Payoff(i,k) = Application.WorksheetFunction.Max((St(i,k) - 20), 0)
Next i
Next k
' For i from n to 2 step -1
' For j from 1 to n - 2
' Payoff(i - 1, j) = Payoff(i, j) + Payoff(i, j + 1) + Payoff(i, j + 2)
' Next i
' Next k
LoopTest = St()
End Function
How do I get output
k=0 St = S*u^0
k=1 St = S*u^-1 , St = S*u^0 , St = S*u^1
k=2 St = S*u^-2 , St = S*u^-1 , St = S*u^0 , St = S*u^1 , St = S*u^2
etc and then sum everything afterwards as output?
I'm not totally following the example but from the description it sounds like you're interested in a recursive process. It can be tricky and will leave you in an endless loop if you're not careful. Does this previous post help you? VBA recursive "For loops" Permutation?

Excel VBA XIRR not working as expected

I am working a code, and I have a problem with Excel's XIRR function.
You have a matrix with 2 columns (dates and amounts), so the inputs are the matrix, a date, and a quantity. Inside the code it takes the values below the date you used as input, makes a new array with those values, and add also the date and amount you entered as inputs. And the output should be the XIRR of that array. It doesn´t seem to work. It works with IRR, with dates are an important input. Does someone know how to fix this problem? Thanks in advance!
Function Retorno(matriz As Range, dia As Date, valuacion As Double) As Double
Dim Datos As Range
Dim Aux1()
Dim Aux2()
Dim i, j, m, n As Integer
Set Datos = matriz
j = 0
For i = 1 To Datos.Rows.Count
If Datos(i, 1) <= dia Then
j = j + 1
End If
Next i
ReDim Aux1(1 To j + 1)
ReDim Aux2(1 To j + 1)
For n = 1 To j + 1
Aux1(n) = Datos(n, 2)
Next n
Aux1(j + 1) = valuacion
For m = 1 To j + 1
Aux2(m) = Datos(m, 1)
Next m
Aux2(j + 1) = dia
Retorno = WorksheetFunction.Xirr(Aux1, Aux2)
End Function
Your last Aux2(j + 1) = dia is overwriting the second date in the array with the first date, giving you two identical dates in the date array.
Possibly you want to delete that line.
The other possible answer to this problem is to convert the date to numbers if you do this: Aux2(m) = Datos(m, 1)*1 XIRR will work too.

Label a set of objects with (A->Z,AA->ZZ, AAA->ZZZ) in VBA

I have a set which has an unknown number of objects. I want to associate a label to each one of these objects. Instead of labeling each object with a number I want to label them with letters.
For example the first object would be labeled A the second B and so on.
When I get to Z, the next object would be labeled AA
AZ? then BA, BB, BC.
ZZ? then AAA, AAB, AAC and so on.
I'm working using Mapbasic (similar to VBA), but I can't seem to wrap my head around a dynamic solution. My solution assumes that there will be a max number of objects that the set may or may not exceed.
label = pos1 & pos2
Once pos2 reaches ASCII "Z" then pos1 will be "A" and pos2 will be "A". However, if there is another object after "ZZ" this will fail.
How do I overcome this static solution?
Basically what I needed was a Base 26 Counter. The function takes a parameter like "A" or "AAA" and determines the next letter in the sequence.
Function IncrementAlpha(ByVal alpha As String) As String
Dim N As Integer
Dim num As Integer
Dim str As String
Do While Len(alpha)
num = num * 26 + (Asc(alpha) - Asc("A") + 1)
alpha = Mid$(alpha, 2,1)
Loop
N = num + 1
Do While N > 0
str = Chr$(Asc("A") + (N - 1) Mod 26) & str
N = (N - 1) \ 26
Loop
IncrementAlpha = str
End Function
If we need to convert numbers to a "letter format" where:
1 = A
26 = Z
27 = AA
702 = ZZ
703 = AAA etc
...and it needs to be in Excel VBA, then we're in luck. Excel's columns are "numbered" the same way!
Function numToLetters(num As Integer) As String
numToLetters = Split(Cells(1, num).Address(, 0), "$")(0)
End Function
Pass this function a number between 1 and 16384 and it will return a string between A and XFD.
Edit:
I guess I misread; you're not using Excel. If you're using VBA you should still be able to do this will the help of an reference to an Excel Object Library.
This should get you going in terms of the logic. Haven't tested it completely, but you should be able to work from here.
Public Function GenerateLabel(ByVal Number As Long) As String
Const TOKENS As String = "ZABCDEFGHIJKLMNOPQRSTUVWXY"
Dim i As Long
Dim j As Long
Dim Prev As String
j = 1
Prev = ""
Do While Number > 0
i = (Number Mod 26) + 1
GenerateLabel = Prev & Mid(TOKENS, i, 1)
Number = Number - 26
If j > 0 Then Prev = Mid(TOKENS, j + 1, 1)
j = j + Abs(Number Mod 26 = 0)
Loop
End Function

Longest common substring large strings?

I need some help with this function. I am trying to find the longest common string between 2 strings. Here is the function that I am currently using:
Public Shared Function LCS(str1 As Char(), str2 As Char())
Dim l As Integer(,) = New Integer(str1.Length - 1, str2.Length - 1) {}
Dim lcs__1 As Integer = -1
Dim substr As String = String.Empty
Dim [end] As Integer = -1
For i As Integer = 0 To str1.Length - 1
For j As Integer = 0 To str2.Length - 1
If str1(i) = str2(j) Then
If i = 0 OrElse j = 0 Then
l(i, j) = 1
Else
l(i, j) = l(i - 1, j - 1) + 1
End If
If l(i, j) > lcs__1 Then
lcs__1 = l(i, j)
[end] = i
End If
Else
l(i, j) = 0
End If
Next
Next
For i As Integer = [end] - lcs__1 + 1 To [end]
substr += str1(i)
Next
Return substr
End Function
This works great on strings of up to around 600 words or so. If I try to compare strings with a larger word count than that it starts to throw system.outofmemoryexception. Obviously, this is hitting the memory pretty hard. Is there any way to fine tune this function or is there possibly another way of doing this that is more streamlined?

Performance difference between two implementations of the same algorithm

I'm working on an application that will require the Levenshtein algorithm to calculate the similarity of two strings.
Along time ago I adapted a C# version (which can be easily found floating around in the internet) to VB.NET and it looks like this:
Public Function Levenshtein1(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim cost As Integer
Dim s1c As Char
For i = 1 To n
d(i, 0) = i
Next
For j = 1 To m
d(0, j) = j
Next
For i = 1 To n
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Then, trying to tweak it and improve its performance, I ended with version:
Public Function Levenshtein2(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim s1c As Char
Dim cost As Integer
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Basically, I thought that the array of distances d(,) could be initialized inside of the main for cycles, instead of requiring two initial (and additional) cycles. I really thought this would be a huge improvement... unfortunately, not only does not improve over the original, it actually runs slower!
I have already tried to analyze both versions by looking at the generated IL code but I just can't understand it.
So, I was hoping that someone could shed some light on this issue and explain why the second version (even when it has fewer for cycles) runs slower than the original?
NOTE: The time difference is about 0.15 nano seconds. This don't look like much but when you have to check thousands of millions of strings... the difference becomes quite notable.
It's because of this:
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j 'THIS LINE HERE
You were just initializing this array at the beginning, but now you are initializing it n times. There is a cost involved with accessing memory in an array like this, and you are doing it an extra n times now. You could change the line to say: If i = 1 Then d(0, j) = j. However, in my tests, you still basically end up with a slightly slower version than the original. And that again makes sense. You're performing this if statement n*m times. Again there is some cost. Moving it out like it is in the original version is a lot cheaper It ends up being O(n). Since the overall algorithm is O(n*m), any step you can move out into an O(n) step is going to be a win.
You can split the following line:
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
as follows:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)
It this way you avoid one summation
Further more you can place the last "min" comparison inside the if part and avoid assigning cost:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If
So you save a summation when s1c = s2(j - 1)
Not the direct answer to your question, but for faster performance you should consider either using a jagged array (array of arrays) instead of a multidimensional array. What are the differences between a multidimensional array and an array of arrays in C#? and Why are multi-dimensional arrays in .NET slower than normal arrays?
You will see that the jagged array has a code size of 7 as opposed to 10 with multidimensional arrays.
The code below is uses a jagged array, single dimensional array
Public Function Levenshtein3(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d()() As Integer = New Integer(n)() {}
Dim cost As Integer
Dim s1c As Char
For i = 0 To n
d(i) = New Integer(m) {}
Next
For j = 1 To m
d(0)(j) = j
Next
For i = 1 To n
d(i)(0) = i
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
Next
Next
Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function