Performance difference between two implementations of the same algorithm - vb.net

I'm working on an application that will require the Levenshtein algorithm to calculate the similarity of two strings.
Along time ago I adapted a C# version (which can be easily found floating around in the internet) to VB.NET and it looks like this:
Public Function Levenshtein1(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim cost As Integer
Dim s1c As Char
For i = 1 To n
d(i, 0) = i
Next
For j = 1 To m
d(0, j) = j
Next
For i = 1 To n
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Then, trying to tweak it and improve its performance, I ended with version:
Public Function Levenshtein2(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim s1c As Char
Dim cost As Integer
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Basically, I thought that the array of distances d(,) could be initialized inside of the main for cycles, instead of requiring two initial (and additional) cycles. I really thought this would be a huge improvement... unfortunately, not only does not improve over the original, it actually runs slower!
I have already tried to analyze both versions by looking at the generated IL code but I just can't understand it.
So, I was hoping that someone could shed some light on this issue and explain why the second version (even when it has fewer for cycles) runs slower than the original?
NOTE: The time difference is about 0.15 nano seconds. This don't look like much but when you have to check thousands of millions of strings... the difference becomes quite notable.

It's because of this:
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j 'THIS LINE HERE
You were just initializing this array at the beginning, but now you are initializing it n times. There is a cost involved with accessing memory in an array like this, and you are doing it an extra n times now. You could change the line to say: If i = 1 Then d(0, j) = j. However, in my tests, you still basically end up with a slightly slower version than the original. And that again makes sense. You're performing this if statement n*m times. Again there is some cost. Moving it out like it is in the original version is a lot cheaper It ends up being O(n). Since the overall algorithm is O(n*m), any step you can move out into an O(n) step is going to be a win.

You can split the following line:
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
as follows:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)
It this way you avoid one summation
Further more you can place the last "min" comparison inside the if part and avoid assigning cost:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If
So you save a summation when s1c = s2(j - 1)

Not the direct answer to your question, but for faster performance you should consider either using a jagged array (array of arrays) instead of a multidimensional array. What are the differences between a multidimensional array and an array of arrays in C#? and Why are multi-dimensional arrays in .NET slower than normal arrays?
You will see that the jagged array has a code size of 7 as opposed to 10 with multidimensional arrays.
The code below is uses a jagged array, single dimensional array
Public Function Levenshtein3(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d()() As Integer = New Integer(n)() {}
Dim cost As Integer
Dim s1c As Char
For i = 0 To n
d(i) = New Integer(m) {}
Next
For j = 1 To m
d(0)(j) = j
Next
For i = 1 To n
d(i)(0) = i
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
Next
Next
Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function

Related

How to compare Strings for Percentage Match using vb.net?

I am banging my head against the wall for a while now trying different techniques.
None of them are working well.
I have two strings.
I need to compare them and get an exact percentage of match,
ie. "four score and seven years ago" TO "for scor and sevn yeres ago"
Well, I first started by comparing every word to every word, tracking every hit, and percentage = count \ numOfWords. Nope, didn't take into account misspelled words.
("four" <> "for" even though it is close)
Then I started by trying to compare every char in each char, incrementing the string char if not a match (to count for misspellings). But, I would get false hits because the first string could have every char in the second but not in the exact order of the second. ("stuff avail" <> "stu vail" (but it would come back as such, low percentage, but a hit. 9 \ 11 = 81%))
SO, I then tried comparing PAIRS of chars in each string. If string1[i] = string2[k] AND string1[i+1] = string2[k+1], increment the count, and increment the "k" when it doesn't match (to track mispellings. "for" and "four" should come back with a 75% hit.) That doesn't seem to work either. It is getting closer, but even with an exact match it is only returns 94%. And then it really gets screwed up when something is really misspelled. (Code at the bottom)
Any ideas or directions to go?
Code
count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
' To ignore non letters or digits '
If Not strTempName(j).IsLetter(strTempName(j)) Then
j += 1
End If
' To ignore non letters or digits '
If Not strTempFile(k).IsLetter(strTempFile(k)) Then
k += 1
End If
' compare pair of chars '
While (strTempName(j) <> strTempFile(k) And _
strTempName(j + 1) <> strTempFile(k + 1) And _
k < strTempFile.Length - 2)
k += 1
End While
count += 1
j += 1
k += 1
End While
perc = count / (strTempName.Length - 1)
Edit: I have been doing some research and I think I initially found the code from here and translated it to vbnet years ago. It uses the Levenshtein string matching algorithm.
Here is the code I use for that, hope it helps:
Sub Main()
Dim string1 As String = "four score and seven years ago"
Dim string2 As String = "for scor and sevn yeres ago"
Dim similarity As Single =
GetSimilarity(string1, string2)
' RESULT : 0.8
End Sub
Public Function GetSimilarity(string1 As String, string2 As String) As Single
Dim dis As Single = ComputeDistance(string1, string2)
Dim maxLen As Single = string1.Length
If maxLen < string2.Length Then
maxLen = string2.Length
End If
If maxLen = 0.0F Then
Return 1.0F
Else
Return 1.0F - dis / maxLen
End If
End Function
Private Function ComputeDistance(s As String, t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim distance As Integer(,) = New Integer(n, m) {}
' matrix
Dim cost As Integer = 0
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
'init1
Dim i As Integer = 0
While i <= n
distance(i, 0) = System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Max(System.Threading.Interlocked.Increment(j), j - 1)
End While
'find min distance
For i = 1 To n
For j = 1 To m
cost = (If(t.Substring(j - 1, 1) = s.Substring(i - 1, 1), 0, 1))
distance(i, j) = Math.Min(distance(i - 1, j) + 1, Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost))
Next
Next
Return distance(n, m)
End Function
Did not work for me unless one (or both) of following are done:
1) use option compare statement "Option Compare Text" before any Import declarations and before Class definition (i.e. the very, very first line)
2) convert both strings to lowercase using .tolower
Xavier's code must be correct to:
While i <= n
distance(i, 0) = System.Math.Min(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Min(System.Threading.Interlocked.Increment(j), j - 1)
End While

run time error 5 in VBA excel when working with array

I use vba on excel 2007, OS: windows vista, to make calculation using kinematic wave equation in finite difference scheme. But, when it runs the run-time 5 (invalid procedure call or arguments) message appears. I really don't what is going wrong. Anyone can help?
Sub kwave()
Dim u(500, 500), yy(500, 500), alpha, dt, dx, m, n, so, r, f, X, L, K As Single
Dim i, j As Integer
dx = 0.1
dt = 0.01
L = 10
m = 5 / 3
r = 1
f = 0.5
n = 0.025
so = 0.1 'this is slope
alpha = 1 / n * so ^ 0.5
X = 0
For i = 0 To 100
Cells(i + 1, 1) = X
u(i, 1) = L - so * X
X = X + dx
Cells(i + 1, 2) = u(i, 1)
Next i
For j = 0 To 100
For i = 1 To 100
'predictor step
u(i, j + 1) = u(i, j) - alpha * dt / dx * (u(i + 1, j) ^ m - u(i, j) ^ m) + (r - f) * dt
'corrector step
K = u(i, j + 1) ^ m - u(i - 1, j + 1) ^ m '<<<<----- RUNTIME ERROR 5 HAPPENS AT THIS LINE
yy(i, j + 1) = 0.5 * ((yy(i, j) + u(i, j + 1)) - alpha * dt / dx * K + (r - f) * dt)
Next i
Next j
End Sub
You are declaring the variables wrong- the array should store a double/single but it is defaulting to a variant. See this article.
http://www.cpearson.com/excel/declaringvariables.aspx -
"Pay Attention To Variables Declared With One Dim Statement
VBA allows declaring more than one variable with a single Dim
statement. I don't like this for stylistic reasons, but others do
prefer it. However, it is important to remember how variables will be
typed. Consider the following code:
Dim J, K, L As Long You may think that all three variables are
declared as Long types. This is not the case. Only L is typed as a
Long. The variables J and K are typed as Variant. This declaration is
functionally equivalent to the following:
Dim J As Variant, K As Variant, L As Long You should use the As Type
modifier for each variable declared with the Dim statement:
Dim J As Long, K As Long, L As Long "
Additionally, when i = 99 and j = 10, u(99,11), which is j+1, produces a negative number. Note that this does not fully cause the problem though, because you can raise negative numbers to exponents. Ex, -5^3 = -125

The results of my functions when I call a spline function gives wrong values

I have a function that only call the spline function when something happens..in this case when a division is less than zero..the inputs for the function is the same that for the spline function(called CUBIC), the spline was tested and works well when I call it direct! someone can help me?...follows a party of the code
Function NDF6(T As Variant, dias As Variant, taxas As Variant)
If T <= dias(1) Then
NDF6 = taxas(1)
Exit Function
End If
If T >= dias(tam) Then
NDF6 = taxas(tam)
Exit Function
End If
For i = 1 To tam
If T <= dias(i) Then
If taxas(i) / taxas(i - 1) < 0 Then
Call CUBIC(T, dias, taxas)
Else
i0 = ((taxas(i - 1) * dias(i - 1)) / 360) + 1
i1 = ((taxas(i - 1) * dias(i - 1)) / 360) + 1
irel = i1 / i0
i2 = irel ^ ((T - dias(i - 1)) / (dias(i) - dias(i - 1)))
i2rel = i2 * i0
i2real = i2rel - 1
NDF6 = i2real * (360 / T)
End If
Public Function CUBIC(x As Variant, input_column As Variant, output_column As Variant)
The function returns a zero value when I call the cubic function. The inputs are a cell with a value with a value equivalent a day and two arrays(DUONOFF and ONOFF) equivalent a days and rates, I call the function like:
NDF6(512,DUONOFF,ONOFF)
follows the CUBIC function
Public Function CUBIC(x As Variant, input_column As Variant, output_column As Variant)
'Purpose: Given a data set consisting of a list of x values
' and y values, this function will smoothly interpolate
' a resulting output (y) value from a given input (x) value
' This counts how many points are in "input" and "output" set of data
Dim input_count As Integer
Dim output_count As Integer
input_count = input_column.Rows.Count
output_count = output_column.Rows.Count
Next check to be sure that "input" # points = "output" # points
If input_count <> output_count Then
CUBIC = "Something's messed up! The number of indeces number of output_columnues don't match!"
GoTo out
End If
ReDim xin(input_count) As Single
ReDim yin(input_count) As Single
Dim c As Integer
For c = 1 To input_count
xin(c) = input_column(c)
yin(c) = output_column(c)
Next c
values are populated
Dim N As Integer 'n=input_count
Dim i, k As Integer 'these are loop counting integers
Dim p, qn, sig, un As Single
ReDim u(input_count - 1) As Single
ReDim yt(input_count) As Single 'these are the 2nd deriv values
N = input_count
yt(1) = 0
u(1) = 0
For i = 2 To N - 1
sig = (xin(i) - xin(i - 1)) / (xin(i + 1) - xin(i - 1))
p = sig * yt(i - 1) + 2
yt(i) = (sig - 1) / p
u(i) = (yin(i + 1) - yin(i)) / (xin(i + 1) - xin(i)) - (yin(i) - yin(i - 1)) / (xin(i) - xin(i - _1))
u(i) = (6 * u(i) / (xin(i + 1) - xin(i - 1)) - sig * u(i - 1)) / p
Next i
qn = 0
un = 0
yt(N) = (un - qn * u(N - 1)) / (qn * yt(N - 1) + 1)
For k = N - 1 To 1 Step -1
yt(k) = yt(k) * yt(k + 1) + u(k)
Next k
now eval spline at one point
Dim klo, khi As Integer
Dim h, b, a As Single
first find correct interval
klo = 1
khi = N
Do
k = khi - klo
If xin(k) > x Then
khi = k
Else
klo = k
End If
k = khi - klo
Loop While k > 1
h = xin(khi) - xin(klo)
a = (xin(khi) - x) / h
b = (x - xin(klo)) / h
y = a * yin(klo) + b * yin(khi) + ((a ^ 3 - a) * yt(klo) + (b ^ 3 - b) * yt(khi)) * (h ^ 2) _/ 6
CUBIC = y
out:
End Function

My MInverse function will not work in VBA

EDIT: I fixed it, the ReDim and all starts at 0 and not 1 so I had a cell that was empty which wasn't supposed to be there!
It now works, thanks for the help!
I'm trying to take a matrix and invert it, but for some reason I get this error:
Unable to get the MInverse property of the WorksheetFunction class.
My (relevant) code is as following:
Dim covar() As Variant
ReDim covar(UBound(assetNames), UBound(assetNames))
Dim corr() As Double
ReDim corr(UBound(assetNames), UBound(assetNames))
Dim covarTmp As Double
For i = 0 To UBound(assetNames) - 1
For j = 0 To UBound(assetNames) - 1
covarTmp = 0
For t = 1 To wantedT
covarTmp = covarTmp + (Log((prices(histAmount + 1 - t, i + 1)) / (prices(histAmount - t, i + 1))) - mu(i) * dt) * (Log((prices(histAmount + 1 - t, j + 1)) / (prices(histAmount - t, j + 1))) - mu(j) * dt)
Next t
covar(i, j) = covarTmp * (1 / ((wantedT - 1) * dt))
corr(i, j) = covar(i, j) / (sigma(i) * sigma(j))
Next j
Next i
Dim covarInv() As Variant
ReDim covarInv(UBound(assetNames), UBound(assetNames))
'ReDim covar(1 To UBound(assetNames), 1 To UBound(assetNames))
covarInv = Application.WorksheetFunction.MInverse(covar)
This last row is where the error occurs.
I've tried many things, having covar and covarInv dim as double, variant etc. Different ReDims on covar and covarInv.
You don't say what version of Excel you are using, but with Excel 2010 there seems to be a Minverse maximum limit of 200 * 200 (for Excel 2003 its probably around 80 * 80): How many asset names do you have?

Simulation runs fine # 10k cycles, but gets error 13 (type mismatch) # 100k cycles

First off, here's my code:
Sub SimulatePortfolio()
Dim lambda As Double
Dim num As Integer
Dim cycles As Long
Column = 12
q = 1.5
lambda = 0.05
cycles = 100000
Dim data(1 To 100000, 1 To 10) As Integer
Dim values(1 To 10) As Double
For i = 1 To 10
values(i) = 0
Next i
temp = lambda
For i = 1 To cycles
lambda = temp
num = 10
t = 0
Dim temps(1 To 10) As Integer
For k = 1 To 10
temps(k) = 1000
Next k
Do While (t < 10 And num > 0)
t = t + tsim(lambda, num)
For j = 1 To 10
If (j > t) Then
temps(j) = temps(j) - 50
End If
Next j
num = num - 1
If (num <= 0) Then
Exit Do
End If
lambda = lambda * q
Loop
For l = 1 To 10
values(l) = values(l) + temps(l)
data(i, l) = temps(l)
Next l
Next i
For i = 1 To 10
Cells(i + 1, Column) = values(i) / cycles
'Problem occurs on this line:
Cells(i + 1, Column + 1).Value = Application.WorksheetFunction.Var(Application.WorksheetFunction.Index(data, i, 0))
Next i
End Sub
Function tsim(lambda As Double, num As Integer) As Double
Dim v As Double
Dim min As Double
Randomize
min = (-1 / lambda) * Log(Rnd)
For i = 1 To (num - 1)
Randomize
v = (-1 / lambda) * Log(Rnd)
If (min > v) Then
min = v
End If
Next i
tsim = min
End Function
When I set the value for cycles to 10000, it runs fine without a hitch. When I go to 100000 cycles, it gets an Error 13 at the indicated line of code.
Having been aware that Application.Tranpose is limited to 65536 rows with variants (throwing the same error) I tested the same issue with Index
It appears that Application.WorksheetFunction.Index also has a limit of 65536 rows when working with variants - but standard ranges are fine
So you will need to either need to dump data to a range and work on the range with Index, or work with two arrays
Sub Test()
Dim Y
Dim Z
'works in xl07/10
Debug.Print Application.WorksheetFunction.Index(Range("A1:A100000"), 1, 1)
Y = Range("A1:A65536")
`works
Debug.Print Application.WorksheetFunction.Index(Y, 1, 1)
'fails in xl07/10
Z = Range("A1:A65537")
Debug.Print Application.WorksheetFunction.Index(Z, 1, 1)
End Sub