Longest common substring large strings? - vb.net

I need some help with this function. I am trying to find the longest common string between 2 strings. Here is the function that I am currently using:
Public Shared Function LCS(str1 As Char(), str2 As Char())
Dim l As Integer(,) = New Integer(str1.Length - 1, str2.Length - 1) {}
Dim lcs__1 As Integer = -1
Dim substr As String = String.Empty
Dim [end] As Integer = -1
For i As Integer = 0 To str1.Length - 1
For j As Integer = 0 To str2.Length - 1
If str1(i) = str2(j) Then
If i = 0 OrElse j = 0 Then
l(i, j) = 1
Else
l(i, j) = l(i - 1, j - 1) + 1
End If
If l(i, j) > lcs__1 Then
lcs__1 = l(i, j)
[end] = i
End If
Else
l(i, j) = 0
End If
Next
Next
For i As Integer = [end] - lcs__1 + 1 To [end]
substr += str1(i)
Next
Return substr
End Function
This works great on strings of up to around 600 words or so. If I try to compare strings with a larger word count than that it starts to throw system.outofmemoryexception. Obviously, this is hitting the memory pretty hard. Is there any way to fine tune this function or is there possibly another way of doing this that is more streamlined?

Related

VB trying to call function with multiple outputs getting "Unable to cast object of type 'System.Object[,]' to type 'System.String[,]'." Error

I'm trying to use a function with multiple outputs (an array and two integers). I thought I had it working but today I am getting an error, "Unable to cast object of type 'System.Object[,]' to type 'System.String[,]'."
The function:
Public Function convArray(ByVal inputArray As String(,)) As (outputArray As String(,), outputRows As Integer, outputCol As Integer)
Dim sColumns As Integer
Dim sRows As Integer
Dim AscArray As Boolean
sRows = inputArray.GetLength(1)
sColumns = inputArray.GetLength(0)
Dim outputArray(sColumns - 1, sRows - 1)
If inputArray(0, 1) > inputArray(0, 2) Then
AscArray = False
Else
AscArray = True
End If
For k As Integer = 0 To sColumns - 1
outputArray(k, 0) = inputArray(k, 0)
Next
For i As Integer = 0 To sColumns - 1
For j As Integer = 1 To sRows - 1
If AscArray Then
outputArray(i, j) = inputArray(i, j)
Else
outputArray(i, j) = inputArray(i, sRows - j)
End If
Next
Next
Return (outputArray, sRows, sColumns)
End Function
The call:
Dim blrArray = convArray(s)
sDRows = blrArray.outputRows
sDColumns = blrArray.outputCol
ReDim sD(sDColumns - 1, sDRows - 1)
sD = blrArray.outputArray
s and sD are arrays defined elsewhere.
I'm getting the error on the "Return" part of the function.
I apologize in advance for my inefficient code, i'm still pretty new at it.

VB.NET textbox remove last dash

How can I remove the last - added after the code has been entered.
All the - are automatically added.
Here my code :
Dim strKeyTextField As String = txtAntivirusCode.Text
Dim n As Integer = 5
Dim intlength As Integer = txtAntivirusCode.TextLength
While intlength > 4
If txtAntivirusCode.Text.Length = 5 Then
strKeyTextField = strKeyTextField.Insert(5, "-")
End If
Dim singleChar As Char
singleChar = strKeyTextField.Chars(n)
While (n + 5) < intlength
If singleChar = "-" Then
n = n + 6
If n = intlength Then
strKeyTextField = strKeyTextField.Insert(n, "-")
End If
End If
End While
intlength = intlength - 5
End While
'' Define total variable with dashes
txtAntivirusCode.Text = strKeyTextField
'sets focus at the end of the string
txtAntivirusCode.Select(txtAntivirusCode.Text.Length, 0)
Output is : XXXXX-XXXXX-XXXXX-XXXXX-XXXXX-
What I want : XXXXX-XXXXX-XXXXX-XXXXX-XXXXX
You could just remove the last char in the string like that:
txtAntivirusCode.Text = strKeyTextField.Substring(0, strKeyTextField.Length - 1)
or
txtAntivirusCode.Text = strKeyTextField.Remove(strKeyTextField.Length - 1)
or
txtAntivirusCode.Text = strKeyTextField.Trim({" "c, "-"c})
or
txtAntivirusCode.Text = strKeyTextField.TrimEnd(CChar("-"))
If there is a possibility of a space at the end of the string use .Trim() before Substring and/or Remove
The other way from removing the last "-" is to not add the last "-", for example:
Dim s = "ABCDE-FGHIJKLMNOPQRSTUVWXYZ"
Dim batchSize = 5
Dim nBatches = 5
Dim nChars = nBatches * batchSize
' take out any dashes
s = s.Replace("-", "")
' make sure there are not too many characters
If s.Length > nChars Then
s = s.Substring(0, nChars)
End If
Dim sb As New Text.StringBuilder
For i = 1 To s.Length
sb.Append(s.Chars(i - 1))
If i Mod batchSize = 0 AndAlso i <> nChars Then
sb.Append("-")
End If
Next
Console.WriteLine(sb.ToString())
Console.ReadLine()
Outputs:
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY

array without any duplicate value

the code to generate no. of arrays from one is working..I'm try to make some change to it like below
Function myarray(ByVal arra1() As Integer, ByVal arran() As Integer, ByVal arrNumber As Integer) As Integer()
arran = arra1.Clone()
For i As Integer = 0 To arra1.Length - 1
If i = (arrNumber - 1) Then ' IF arrNumber is 1 then +1 to index 0, If it is 2 then +1 to index 1
arran(i) = arra1(i) + 1
'If there are two duplicate value make on of them zero at a time
For k = 0 To arran.Length - 1
For j = k + 1 To arran.Length - 1
If arran(k) = arran(j) Then
arran(k) = 0
End If
'make any value great than 11 zero
If arran(i) > 11 Then
arran(i) = 0
End If
Next
Next
Else
arran(i) = arra1(i)
End If
Next
'Print the array
For i = 0 To arran.Length - 1
Console.Write(arran(i) & " ")
Next
Console.WriteLine()
Return arran
End Function
what I really need is to decompose for example {1,4,5,5} to be {1,4,0,5} and then {1,4,5,0} the above code generate only {1,4,0,5}
I haven't tested this, but I believe the following code will do what you want. Based on your comments, I've changed the function to return all resulting arrays as an array of arrays, rather than requiring the index to change as an input and returning one array. I also ignored matches of 0, as the conditions you describe don't seem designed to handle them. Because of it's recursion, I think this approach will successfully handle input such as {3, 3, 3, 3}.
Public Function jaggedArray(ByVal inputArray() As Integer) As Integer()()
If inputArray Is Nothing Then
Return Nothing
Else
Dim resultArrays()(), i, j As Integer
Dim arrayMax As Integer = inputArray.GetUpperBound(0)
If arrayMax = 0 Then 'prevents errors later if only one number passed
ReDim resultArrays(0)
If inputArray(0) > 11 Then
resultArrays(0) = {1}
ElseIf inputArray(0) = 11 Then
resultArrays(0) = {0}
Else
resultArrays(0) = {inputArray(0) + 1}
End If
Return resultArrays
End If
For i = 0 To arrayMax
Dim tempArray() As Integer = inputArray.Clone
For j = 0 To arrayMax
If tempArray(j) > 11 Then
tempArray(j) = 0
End If
Next
If tempArray(i) = 11 Then
tempArray(i) = 0
Else
tempArray(i) += 1
End If
splitArray(resultArrays, tempArray)
Next
Return resultArrays
End If
End Function
Private Sub splitArray(ByRef arrayList()() As Integer, ByVal sourceArray() As Integer)
Dim x, y As Integer 'positions of matching numbers
If isValid(sourceArray, x, y) Then
If arrayList Is Nothing Then
ReDim arrayList(0)
Else
ReDim Preserve arrayList(arrayList.Length)
End If
arrayList(arrayList.GetUpperBound(0)) = sourceArray
Else
Dim xArray(), yArray() As Integer
xArray = sourceArray.Clone
xArray(x) = 0
splitArray(arrayList, xArray)
yArray = sourceArray.Clone
yArray(y) = 0
splitArray(arrayList, yArray)
End If
End Sub
Private Function isValid(ByRef testArray() As Integer, ByRef match1 As Integer, ByRef match2 As Integer) As Boolean
For i As Integer = 0 To testArray.GetUpperBound(0) - 1
If testArray(i) > 11 Then
testArray(i) = 0
End If
For j As Integer = i + 1 To testArray.GetUpperBound(0)
If testArray(j) > 11 Then
testArray(j) = 0
End If
If testArray(i) = testArray(j) AndAlso testArray(i) > 0 Then 'added second test to prevent infinite recursion
match1 = i
match2 = j
Return False
End If
Next
Next
match1 = -1
match2 = -1
Return True
End Function

How to compare Strings for Percentage Match using vb.net?

I am banging my head against the wall for a while now trying different techniques.
None of them are working well.
I have two strings.
I need to compare them and get an exact percentage of match,
ie. "four score and seven years ago" TO "for scor and sevn yeres ago"
Well, I first started by comparing every word to every word, tracking every hit, and percentage = count \ numOfWords. Nope, didn't take into account misspelled words.
("four" <> "for" even though it is close)
Then I started by trying to compare every char in each char, incrementing the string char if not a match (to count for misspellings). But, I would get false hits because the first string could have every char in the second but not in the exact order of the second. ("stuff avail" <> "stu vail" (but it would come back as such, low percentage, but a hit. 9 \ 11 = 81%))
SO, I then tried comparing PAIRS of chars in each string. If string1[i] = string2[k] AND string1[i+1] = string2[k+1], increment the count, and increment the "k" when it doesn't match (to track mispellings. "for" and "four" should come back with a 75% hit.) That doesn't seem to work either. It is getting closer, but even with an exact match it is only returns 94%. And then it really gets screwed up when something is really misspelled. (Code at the bottom)
Any ideas or directions to go?
Code
count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
' To ignore non letters or digits '
If Not strTempName(j).IsLetter(strTempName(j)) Then
j += 1
End If
' To ignore non letters or digits '
If Not strTempFile(k).IsLetter(strTempFile(k)) Then
k += 1
End If
' compare pair of chars '
While (strTempName(j) <> strTempFile(k) And _
strTempName(j + 1) <> strTempFile(k + 1) And _
k < strTempFile.Length - 2)
k += 1
End While
count += 1
j += 1
k += 1
End While
perc = count / (strTempName.Length - 1)
Edit: I have been doing some research and I think I initially found the code from here and translated it to vbnet years ago. It uses the Levenshtein string matching algorithm.
Here is the code I use for that, hope it helps:
Sub Main()
Dim string1 As String = "four score and seven years ago"
Dim string2 As String = "for scor and sevn yeres ago"
Dim similarity As Single =
GetSimilarity(string1, string2)
' RESULT : 0.8
End Sub
Public Function GetSimilarity(string1 As String, string2 As String) As Single
Dim dis As Single = ComputeDistance(string1, string2)
Dim maxLen As Single = string1.Length
If maxLen < string2.Length Then
maxLen = string2.Length
End If
If maxLen = 0.0F Then
Return 1.0F
Else
Return 1.0F - dis / maxLen
End If
End Function
Private Function ComputeDistance(s As String, t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim distance As Integer(,) = New Integer(n, m) {}
' matrix
Dim cost As Integer = 0
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
'init1
Dim i As Integer = 0
While i <= n
distance(i, 0) = System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Max(System.Threading.Interlocked.Increment(j), j - 1)
End While
'find min distance
For i = 1 To n
For j = 1 To m
cost = (If(t.Substring(j - 1, 1) = s.Substring(i - 1, 1), 0, 1))
distance(i, j) = Math.Min(distance(i - 1, j) + 1, Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost))
Next
Next
Return distance(n, m)
End Function
Did not work for me unless one (or both) of following are done:
1) use option compare statement "Option Compare Text" before any Import declarations and before Class definition (i.e. the very, very first line)
2) convert both strings to lowercase using .tolower
Xavier's code must be correct to:
While i <= n
distance(i, 0) = System.Math.Min(System.Threading.Interlocked.Increment(i), i - 1)
End While
Dim j As Integer = 0
While j <= m
distance(0, j) = System.Math.Min(System.Threading.Interlocked.Increment(j), j - 1)
End While

Performance difference between two implementations of the same algorithm

I'm working on an application that will require the Levenshtein algorithm to calculate the similarity of two strings.
Along time ago I adapted a C# version (which can be easily found floating around in the internet) to VB.NET and it looks like this:
Public Function Levenshtein1(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim cost As Integer
Dim s1c As Char
For i = 1 To n
d(i, 0) = i
Next
For j = 1 To m
d(0, j) = j
Next
For i = 1 To n
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Then, trying to tweak it and improve its performance, I ended with version:
Public Function Levenshtein2(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim s1c As Char
Dim cost As Integer
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
Basically, I thought that the array of distances d(,) could be initialized inside of the main for cycles, instead of requiring two initial (and additional) cycles. I really thought this would be a huge improvement... unfortunately, not only does not improve over the original, it actually runs slower!
I have already tried to analyze both versions by looking at the generated IL code but I just can't understand it.
So, I was hoping that someone could shed some light on this issue and explain why the second version (even when it has fewer for cycles) runs slower than the original?
NOTE: The time difference is about 0.15 nano seconds. This don't look like much but when you have to check thousands of millions of strings... the difference becomes quite notable.
It's because of this:
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j 'THIS LINE HERE
You were just initializing this array at the beginning, but now you are initializing it n times. There is a cost involved with accessing memory in an array like this, and you are doing it an extra n times now. You could change the line to say: If i = 1 Then d(0, j) = j. However, in my tests, you still basically end up with a slightly slower version than the original. And that again makes sense. You're performing this if statement n*m times. Again there is some cost. Moving it out like it is in the original version is a lot cheaper It ends up being O(n). Since the overall algorithm is O(n*m), any step you can move out into an O(n) step is going to be a win.
You can split the following line:
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
as follows:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)
It this way you avoid one summation
Further more you can place the last "min" comparison inside the if part and avoid assigning cost:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If
So you save a summation when s1c = s2(j - 1)
Not the direct answer to your question, but for faster performance you should consider either using a jagged array (array of arrays) instead of a multidimensional array. What are the differences between a multidimensional array and an array of arrays in C#? and Why are multi-dimensional arrays in .NET slower than normal arrays?
You will see that the jagged array has a code size of 7 as opposed to 10 with multidimensional arrays.
The code below is uses a jagged array, single dimensional array
Public Function Levenshtein3(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d()() As Integer = New Integer(n)() {}
Dim cost As Integer
Dim s1c As Char
For i = 0 To n
d(i) = New Integer(m) {}
Next
For j = 1 To m
d(0)(j) = j
Next
For i = 1 To n
d(i)(0) = i
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
Next
Next
Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function