I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:
Node Class Module:
Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean
Main Module:
Dim tree As Node
Sub build_trie()
Set tree = New Node
Dim file, a, b, c As Integer
Dim current As Node
Dim wordlist As Collection
Set wordlist = New Collection
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
Do While Not EOF(file)
Dim line As String
Line Input #file, line
wordlist.add line
Loop
For a = 1 To wordlist.Count
Set current = tree
For b = 1 To Len(wordlist.Item(a))
Dim match As Boolean
match = False
Dim char As String
char = Mid(wordlist.Item(a), b, 1)
For c = 1 To current.next_nodes.Count
If char = current.next_nodes.Item(c).letter Then
Set current = current.next_nodes.Item(c)
match = True
Exit For
End If
Next c
If Not match Then
Dim new_node As Node
Set new_node = New Node
new_node.letter = char
current.next_nodes.add new_node
Set current = new_node
End If
Next b
current.is_word = True
Next a
End Sub
My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collections are not as efficient as Dictionarys and so I attempted a Dictionary-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.
Thanks in advance
NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)
There are a number of small issues and a few larger opportunities here. You did say this is your first vba work, so forgive me if I'm telling you things you already know
Small things first:
Dim file, a, b, c As Integer declares file, a and b as variants. Integer is 16 bit sign, so there may be risk of overflows, use Long instead.
DIM'ing inside loops is counter-productive: unlike C++ they are not loop scoped.
The real opportunity is:
Use For Each where you can to iterate collections: its faster than indexing.
On my hardware your original code ran in about 160s. This code in about 2.5s (both plus time to load word file into the collection, about 4s)
Sub build_trie()
Dim t1 As Long
Dim wd As Variant
Dim nd As Node
Set tree = New Node
' Dim file, a, b, c As Integer : declares file, a, b as variant
Dim file As Integer, a As Long, b As Long, c As Long ' Integer is 16 bit signed
Dim current As Node
Dim wordlist As Collection
Set wordlist = New Collection
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
' no point in doing inside loop, they are not scoped to the loop
Dim line As String
Dim match As Boolean
Dim char As String
Dim new_node As Node
Do While Not EOF(file)
'Dim line As String
Line Input #file, line
wordlist.Add line
Loop
t1 = GetTickCount
For Each wd In wordlist ' for each is faster
'For a = 1 To wordlist.Count
Set current = tree
For b = 1 To Len(wd)
'Dim match As Boolean
match = False
'Dim char As String
char = Mid$(wd, b, 1)
For Each nd In current.next_nodes
'For c = 1 To current.next_nodes.Count
If char = nd.letter Then
'If char = current.next_nodes.Item(c).letter Then
Set current = nd
'Set current = current.next_nodes.Item(c)
match = True
Exit For
End If
Next nd
If Not match Then
'Dim new_node As Node
Set new_node = New Node
new_node.letter = char
current.next_nodes.Add new_node
Set current = new_node
End If
Next b
current.is_word = True
Next wd
Debug.Print "Time = " & GetTickCount - t1 & " ms"
End Sub
EDIT:
loading the word list into a dynamic array will reduce load time to sub second. Be aware that Redim Preserve is expensive, so do it in chunks
Dim i As Long, sz As Long
sz = 10000
Dim wordlist() As String
ReDim wordlist(0 To sz)
file = FreeFile
Open "C:\corncob_caps.txt" For Input As file
i = 0
Do While Not EOF(file)
'Dim line As String
Line Input #file, line
wordlist(i) = line
i = i + 1
If i > sz Then
sz = sz + 10000
ReDim Preserve wordlist(0 To sz)
End If
'wordlist.Add line
Loop
ReDim Preserve wordlist(0 To i - 1)
then loop through it like
For i = 0 To UBound(wordlist)
wd = wordlist(i)
I'm out of practice with VBA, but IIRC, iterating the Collection using For Each should be a bit faster than going numerically:
Dim i As Variant
For Each i In current.next_nodes
If i.letter = char Then
Set current = i
match = True
Exit For
End If
Next node
You're also not using the full capabilities of Collection. It's a Key-Value map, not just a resizeable array. You might get better performance if you use the letter as a key, though looking up a key that isn't present throws an error, so you have to use an ugly error workaround to check for each node. The inside of the b loop would look like:
Dim char As String
char = Mid(wordlist.Item(a), b, 1)
Dim node As Node
On Error Resume Next
Set node = Nothing
Set node = current.next_nodes.Item(char)
On Error Goto 0
If node Is Nothing Then
Set node = New Node
current.next_nodes.add node, char
Endif
Set current = node
You won't need the letter variable on class Node that way.
I didn't test this. I hope it's all right...
Edit: Fixed the For Each loop.
Another thing you can do which will possibly be slower but will use less memory is use an array instead of a collection, and resize with each added element. Arrays can't be public on classes, so you have to add methods to the class to deal with it:
Public letter As String
Private next_nodes() As Node
Public is_word As Boolean
Public Sub addNode(new_node As Node)
Dim current_size As Integer
On Error Resume Next
current_size = UBound(next_nodes) 'ubound throws an error if the array is not yet allocated
On Error GoTo 0
ReDim next_nodes(0 To current_size) As Node
Set next_nodes(current_size) = new_node
End Sub
Public Function getNode(letter As String) As Node
Dim n As Variant
On Error Resume Next
For Each n In next_nodes
If n.letter = letter Then
Set getNode = n
Exit Function
End If
Next
End Function
Edit: And a final optimization strategy, get the Integer char value with the Asc function and store that instead of a String.
You really need to profile it, but if you think Collections are slow maybe you can try using dynamic arrays?
Related
Sorry for the terrible wording on my last question, I was half asleep and it was midnight. This time I'll try to be more clear.
I'm currently writing some code for a mini barcode scanner and stock manager program. I've got the input and everything sorted out, but there is a problem with my arrays.
I'm currently trying to extract the contents of the stock file and sort them out into product tables.
This is my current code for getting the data:
Using fs As StreamReader = New StreamReader("The File Path (Is private)")
Dim line As String = "ERROR"
line = fs.ReadLine()
While line <> Nothing
Dim pos As Integer = 0
Dim split(3) As String
pos = products.Length
split = line.Split("|")
productCodes(productCodes.Length) = split(0)
products(products.Length, 0) = split(1)
products(products.Length, 1) = split(2)
products(products.Length, 2) = split(3)
line = fs.ReadLine()
End While
End Using
I have made sure that the file path does, in fact, go to the file. I have looked through debug to find that all the data is going through into my "split" table. The error throws as soon as I start trying to transfer the data.
This is where I declare the two tables being used:
Dim productCodes() As String = {}
Dim products(,) As Object = {}
Can somebody please explain why this is happening?
Thanks in advance
~Hydro
By declaring the arrays like you did:
Dim productCodes() As String = {}
Dim products(,) As Object = {}
You are assigning size 0 to all your arrays, so during your loop, it will eventually try to access a position that haven't been previously declared to the compiler. It is the same as declaring an array of size 10 Dim MyArray(10) and try to access the position 11 MyArray(11) = something.
You should either declare it with a proper size, or redim it during execution time:
Dim productCodes(10) As String
or
Dim productCodes() As String
Dim Products(,) As String
Dim Position as integer = 0
'code here
While line <> Nothing
Redim Preserve productCodes(Position)
Redim Preserve products(2,Position)
Dim split(3) As String
pos = products.Length
split = line.Split("|")
productCodes(Position) = split(0)
products(0,Position) = split(1)
products(1,Position) = split(2)
products(2,Position) = split(3)
line = fs.ReadLine()
Position+=1
End While
i'm trying to compare these 2 sort algorithms.
I've written a vb.net console program and used excel to create a csv file of 10000 integers randomly created between 0 and 100000.
Insertion sort seems to take approx 10x longer which can't be correct can it?
can anyone point out where i'm going wrong?
module Module1
Dim unsortedArray(10000) As integer
sub main
dim startTick as long
dim endTick as long
loadDataFromFile
startTick = date.now.ticks
insertionsort
endTick = date.now.ticks
console.writeline("ticks for insertion sort = " & (endTick-startTick))
loadDataFromFile
startTick = date.now.ticks
bubblesort
endTick = date.now.ticks
console.writeline("ticks for bubble sort = " & (endTick-startTick))
end sub
sub bubbleSort
dim temp as integer
dim swapped as boolean
dim a as integer = unsortedArray.getupperbound(0)-1
do
swapped=false
for i = 0 to a
if unsortedArray(i)>unsortedArray(i+1) then
temp=unsortedArray(i)
unsortedArray(i)=unsortedArray(i+1)
unsortedArray(i+1)=temp
swapped=true
end if
next i
'a = a - 1
loop until not swapped
end sub
sub insertionSort()
dim temp as string
dim ins as integer
dim low as integer = 0
dim up as integer = unsortedArray.getupperbound(0)
console.writeline()
for i = 1 to up
temp = unsortedArray(i)
ins = i-1
while (ins >= 0) andalso (temp < unsortedArray(ins))
unsortedArray(ins+1) = unsortedArray(ins)
ins = ins -1
end while
unsortedArray(ins+1) = temp
next
end sub
sub loadDataFromFile()
dim dataItem as integer
fileopen(1,FileIO.FileSystem.CurrentDirectory & "\10000.csv", openmode.input)
'set up to loop through each row in the array
for i = 0 to 9999
input(1,dataItem)
'save that data item in correct array positon
unsortedArray(i) = dataItem
next i
fileclose(1)
end sub
dim temp as string
You've declared your temporary variable as a string instead of an integer. VB.Net is perfectly happy to allow you to do this sort of sloppy thing, and it will convert the numeric value to a string and back. This is a very expensive operation.
If you go into your project options, under "Compile", do yourself a favour and turn on "Option Strict". This will disallow implicit type conversions like this and force you to fix it, showing you exactly where you made the error.
"Option Strict" is off by default for legacy reasons, simply to allow badly written legacy VB code to be compiled without complaint in vb.net. There is otherwise no sane reason to leave it turned off.
Changing the declaration to
Dim temp As Integer
reveals that the insertion sort is indeed about 3-5 times faster than the bubble on average.
I have a Streamreader which is trowing an error after checking every line in Daycounts.txt. It is not a stable txt file. String lines in it are not stable. Count of lines increasing or decresing constantly. Thats why I am using a range 0 to 167. But
Here is the content of Daycounts.txt: Daycounts
Dim HourSum as integer
Private Sub Change()
Dim R As IO.StreamReader
R = New IO.StreamReader("Daycounts.txt")
Dim sum As Integer = 0
For p = 0 To 167
Dim a As String = R.ReadLine
If a.Substring(0, 2) <> "G." Then
sum += a.Substring(a.Length - 2, 2)
Else
End If
Next
HourSum = sum
R.Close()
End Sub
If you don't know how many lines are present in your text file then you could use the method File.ReadAllLines to load all lines in memory and then apply your logic
Dim HourSum As Integer
Private Sub Change()
Dim lines = File.ReadAllLines("Daycounts.txt")
Dim sum As Integer = 0
For Each line In lines
If line.Substring(0, 2) <> "G." Then
sum += Convert.ToInt32(line.Substring(line.Length - 2, 2))
Else
....
End If
Next
HourSum = sum
End Sub
This is somewhat inefficient because you loop over the lines two times (one to read them in, and one to apply your logic) but with a small set of lines this should be not a big problem
However, you could also use File.ReadLines that start the enumeration of your lines without loading them all in memory. According to this question, ReadLines locks writes your file until the end of your read loop, so, perhaps this could be a better option for you only if you don't have something external to your code writing concurrently to the file.
Dim HourSum As Integer
Private Sub Change()
Dim sum As Integer = 0
For Each line In File.ReadLines("Daycounts.txt")
If line.Substring(0, 2) <> "G." Then
sum += Convert.ToInt32(line.Substring(line.Length - 2, 2))
Else
....
End If
Next
HourSum = sum
End Sub
By the way, notice that I have added a conversion to an integer against the loaded line. In your code, the sum operation is applied directly on the string. This could work only if you have Option Strict set to Off for your project. This setting is a very bad practice maintained for VB6 compatibility and should be changed to Option Strict On for new VB.NET projects
I would ask if you could give me some alternatives in my problems.
basically I'm reading a .txt log file averaging to 8 million lines. Around 600megs of pure raw txt file.
I'm currently using streamreader to do 2 passes on those 8 million lines doing sorting and filtering important parts in the log file, but to do so, My computer is taking ~50sec to do 1 complete run.
One way that I can optimize this is to make the first pass to start reading at the end because the most important data is located approximately at the final 200k line(s) . Unfortunately, I searched and streamreader can't do this. Any ideas to do this?
Some general restriction
# of lines varies
size of file varies
location of important data varies but approx at the final 200k line
Here's the loop code for the first pass of the log file just to give you an idea
Do Until sr.EndOfStream = True 'Read whole File
Dim streambuff As String = sr.ReadLine 'Array to Store CombatLogNames
Dim CombatLogNames() As String
Dim searcher As String
If streambuff.Contains("CombatLogNames flags:0x1") Then 'Keyword to Filter CombatLogNames Packets in the .txt
Dim check As String = streambuff 'Duplicate of the Line being read
Dim index1 As Char = check.Substring(check.IndexOf("(") + 1) '
Dim index2 As Char = check.Substring(check.IndexOf("(") + 2) 'Used to bypass the first CombatLogNames packet that contain only 1 entry
If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then 'Stricter Filters for CombatLogNames
Dim endCLN As Integer = 0 'Signifies the end of CombatLogNames Packet
Dim x As Integer = 0 'Counter for array
While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick") 'Loops until the end keyword for CombatLogNames is seen
streambuff = sr.ReadLine 'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then
endCLN = 1 'Value change to determine end of CombatLogName packet
Else
ReDim Preserve CombatLogNames(x) 'Resizes the array while preserving the values
searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'")) 'Additional filtering to get only valuable data
CombatLogNames(x) = search(searcher)
x += 1 '+1 to Array counter
End If
End While
Else
'MsgBox("Something went wrong, Flame the coder of this program!!") 'Bug Testing code that is disabled
End If
Else
End If
If (sr.EndOfStream = True) Then
ReDim GlobalArr(CombatLogNames.Length - 1) 'Resizing the Global array to prime it for copying data
Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length) 'Just copying the array to make it global
End If
Loop
You CAN set the BaseStream to the desired reading position, you just cant set it to a specfic LINE (because counting lines requires to read the complete file)
Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
For i = 1 To 100
sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
Next
End Using
Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
sr.BaseStream.Seek(-100, SeekOrigin.End)
Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
While Not sr.EndOfStream
Dim line = sr.ReadLine
Console.WriteLine(line)
End While
End Using
For any later read attempt on the same file, you could simply save the final position (of the basestream) and on the next read to advance to that position before you start reading lines.
What worked for me was skipping first 4M lines (just a simple if counter > 4M surrounding everything inside the loop), and then adding background workers that did the filtering, and if important added the line to an array, while main thread continued reading the lines. This saved about third of the time at the end of a day.
How do i free up Memory?
Say I have a string
Dim TestStri As String
TestStri = "Test"
' What do i have to type up to get rid of the variable?
' I know
TestStri = Nothing
' will give it the default value, but the variable is still there.
Can I use the same Method for other variables i.e. Long, int etc.
I'm assuming you are referring to VB6 and VBA as indicated by your title, not VB.Net, as indicated by a keyword.
In VB6 and VBA the memory consumption of a string variable consists of a fixed part for the string's length and a terminator and a variable length part for the string contents itself. See http://www.aivosto.com/vbtips/stringopt2.html#memorylayout for a good explanation of this.
So, when you set the string variable to an empty string or vbNullString, you will be freeing up the variable part of the string but not the fixed part.
Other types like long, int, bool and date consume a fixed amount of memory.
You can't "free" local variables in VB completely (come to think of it, is there ANY programming language where you can do that?), and for the most part, you wouldn't care because the local variables themselves (the fixed portion) is usually very small.
The only case I can think of where the memory consumption of local varibles could get big is if you have recursive function calls with deep recursion/wide recursion.
I went a differs route :
I was hoping MemoryUsage would be useful. It wasn't, apparently...
I run a vba script that goes through multiple files (since access cannot handle anything too large); and append them to a table, transform it and then spit out a summary.
The script loops through files and runs macros against each of them.
The quick answer is to pull the memory usage from the task manager and then if it exceeds 1 GB; pause the subroutine so no corrupt records get in.
How do we do this?
Insert this memory usage Function with the readfile function.
You will need to create an if statement in your code that says:
dim memory as long
memory = memory_usage
' 1000000 ~ 1 GB
If memory > 1000000 then
End Sub
end if
=================================================
[path to file] = "C:\….\ShellOutputfile.txt"
Function Memory_Usage() as Long
Dim lines As Long
Dim linestring As String
Shell "tasklist /fi " & """IMAGENAME EQ MSACCESS.EXE""" & ">" & """[path to file]"""
'get_list_data
lines = CInt(get_listing_data("[path to file]", 1, 0))
linestring = get_listing_data("[path to file]", 2, 4)
linestring = Right(linestring, 11)
linestring = Replace(linestring, " K", "") ' K
linestring = Replace(linestring, " ", "")
lines = CLng(linestring)
Memory_Usage = lines
End Function
=============================
Public Function get_listing_data(PATH As String, Choice As Integer, typeofreading As Integer) As String
' parse in the variable, of which value you need.
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Dim tmp_var_str As String
Dim fso, ts, fileObj, filename
Dim textline As String
Dim tmp_result As String
Dim TMP_PATH As String
Dim tmpchoice As Integer
Dim tor As Integer
Dim counter As Integer
' type of reading determines what loop is used
' type of reading = 0; to bypass; > 0, you are choosing a line to read.
counter = 0
TMP_PATH = PATH
tmp_var_str = var_str
tmp_result = ""
tor = typeofreading
' choice = 1 (count the lines)
' choice = 2 (read a specific line)
tmpchoice = Choice
' Create the file, and obtain a file object for the file.
If Right(PATH, 1) = "\" Then TMP_PATH = Left(PATH, Len(PATH) - 1)
filename = TMP_PATH '& "\Profit_Recognition.ini"
Set fso = CreateObject("Scripting.FileSystemObject")
Set fileObj = fso.GetFile(filename)
' Open a text stream for output.
Set ts = fileObj.OpenAsTextStream(ForReading, TristateUseDefault)
Do While ts.AtEndOfStream <> True
If tmpchoice = 1 Then
counter = counter + 1
textline = ts.ReadLine
tmp_result = CStr(counter)
End If
If tmpchoice = 2 Then
counter = counter + 1
tmp_result = ts.ReadLine
If counter = tor Then
Exit Do
End If
End If
Loop
get_listing_data = tmp_result
End Function