Loop takes forever with large count - vb.net

This loop takes forever to run as the amount of items in the loop approach anything close to and over 1,000, close to like 10 minutes. This needs to run fast for amounts all the way up to like 30-40 thousand.
'Add all Loan Record Lines
Dim loans As List(Of String) = lar.CreateLoanLines()
Dim last As Integer = loans.Count - 1
For i = 0 To last
If i = last Then
s.Append(loans(i))
Else
s.AppendLine(loans(i))
End If
Next
s is a StringBuilder. The first line there
Dim loans As List(Of String) = lar.CreateLoanLines()
Runs in only a few seconds even with thousands of records. It's the actual loop that's taking a while.
How can this be optimized???

Set the initial capacity of your StringBuilder to a large value. (Ideally, large enough to contain the entire final string.) Like so:
s = new StringBuilder(loans.Count * averageExpectedStringSize)
If you don't specify a capacity, the builder will likely end up doing a large amount of internal reallocations, and this will kill performance.

You could take the special case out of the loop, so you wouldn't need to be checking it inside the loop. I would expect this to have almost no impact on performance, however.
For i = 0 To last - 1
s.AppendLine(loans(i))
Next
s.Append(loans(last))

Though, internally, the code is very similar, if you're using .NET 4, I'd consider replacing your method with a single call to String.Join:
Dim result as String = String.Join(Envionment.NewLine, lar.CreateLoanLines())

I can't see how the code you have pointed out could be slow unless:
The strings you are dealing with are huggggge (e.g. if the resulting string is 1 gigabyte).
You have another process running on your machine consuming all your clock cycles.
You haven't got enough memory in your machine.
Try stepping through the code line by line and check that the strings contain the data that you expect, and check Task Manager to see how much memory your application is using and how much free memory you have.

My guess would be that every time you're using append it's creating a new string. You seem to know how much memory you'll need, if you allocate all of the memory first and then just copy it into memory it should run much faster. Although I may be confused as to how vb.net works.

You could look at doing this another way.
Dim str As String = String.Join(Environment.NewLine, loans.ToArray)

Related

Match Words and Add Quantities vb.net

I am trying to program a way to read a text file and match all the values and their quantites. For example if the text file is like this:
Bread-10 Flour-2 Orange-2 Bread-3
I want to create a list with the total quantity of all the common words. I began my code, but I am having trouble understanding to to sum the values. I'm not asking for anyone to write the code for me but I am having trouble finding resources. I have the following code:
Dim query = From data In IO.File.ReadAllLines("C:\User\Desktop\doc.txt")
Let name As String = data.Split("-")(0)
Let quantity As Integer = CInt(data.Split("-")(1))
Let sum As Integer = 0
For i As Integer = 0 To query.Count - 1
For j As Integer = i To
Next
Thanks
Ok, lets break this down. And I not seen the LET command used for a long time (back in the GWBASIC days!).
But, that's ok.
So, first up, we going to assume your text file is like this:
Bread-10
Flour-2
Orange-2
Bread-3
As opposed to this:
Bread-10 Flour-2 Orange-2 Bread-3
Now, we could read one line, and then process the information. Or we can read all lines of text, and THEN process the data. If the file is not huge (say a few 100 lines), then performance is not much of a issue, so lets just read in the whole file in one shot (and your code also had this idea).
Your start code is good. So, lets keep it (well ok, very close).
A few things:
We don't need the LET for assignment. While older BASIC languages had this, and vb.net still supports this? We don't need it. (but you will see examples of that still floating around in vb.net - especially for what we call "class" module code, or "custom classes". But again lets just leave that for another day.
Now the next part? We could start building up a array, look for the existing value, and then add it. However, this would require a few extra arrays, and a few extra loops.
However, in .net land, we have a cool thing called a dictionary.
And that's just a fancy term of for a collection VERY much like an array, but it has some extra "fancy" features. The fancy feature is that it allows one to put into the handly list things by a "key" name, and then pull that "value" out by the key.
This saves us a good number of extra looping type of code.
And it also means we don't need a array for the results.
This key system is ALSO very fast (behind the scene it uses some cool concepts - hash coding).
So, our code to do this would look like this:
Note I could have saved a few lines here or there - but that would make this code hard to read.
Given that you look to have Fortran, or older BASIC language experience, then lets try to keep the code style somewhat similar. it is stunning that vb.net seems to consume even 40 year old GWBASIC type of syntax here.
Do note that arrays() in vb.net do have some fancy "find" options, but the dictionary structure is even nicer. It also means we can often traverse the results with out say needing a for i = 1 to end of array, and having to pull out values that way.
We can use for each.
So this would work:
Dim MyData() As String ' an array() of strings - one line per array
MyData = File.ReadAllLines("c:\test5\doc.txt") ' read each line to array()
Dim colSums As New Dictionary(Of String, Integer) ' to hold our values and sum them
Dim sKey As String
Dim sValue As Integer
For Each strLine As String In MyData
sKey = Split(strLine, "-")(0)
sValue = Split(strLine, "-")(1)
If colSums.ContainsKey(sKey) Then
colSums(sKey) = colSums(sKey) + sValue
Else
colSums.Add(sKey, sValue)
End If
Next
' display results
Dim KeyPair As KeyValuePair(Of String, Integer)
For Each KeyPair In colSums
Debug.Print(KeyPair.Key & " = " & KeyPair.Value)
Next
The above results in this output in the debug window:
Bread = 13
Flour = 2
Orange = 2
I was tempted here to write this code using just pure array() in vb.net, as that would give you a good idea of the "older" types of coding and syntax we could use here, and a approach that harks all the way back to those older PC basic systems.
While the dictionary feature is more advanced, it is worth the learning curve here, and it makes this problem a lot easier. I mean, if this was for a longer list? Then I would start to consider introduction of some kind of data base system.
However, without some data system, then the dictionary feature is a welcome approach due to that "key" value lookup ability, and not having to loop. It also a very high speed system, so the result is not much looping code, and better yet we write less code.

Mid() usage and for loops - Is this good practice?

Ok so I was in college and I was talking to my teacher and he said my code isn't good practice. I'm a bit confused as to why so here's the situation. We basically created a for loop however he declared his for loop counter outside of the loop because it's considered good practice (to him) even though we never used the variable later on in the code so to me it looks like a waste of memory. We did more to the code then just use a message box but the idea was to get each character from a string and do something with it. He also used the Mid() function to retrieve the character in the string while I called the variable with the index. Here's an example of how he would write his code:
Dim i As Integer = 0
Dim justastring As String = "test"
For i = 1 To justastring.Length Then
MsgBox( Mid( justastring, i, 1 ) )
End For
And here's an example of how I would write my code:
Dim justastring As String = "test"
For i = 0 To justastring.Length - 1 Then
MsgBox( justastring(i) )
End For
Would anyone be able to provide the advantages and disadvantages of each method and why and whether or not I should continue how I am?
Another approach would be, to just use a For Each on the string.
Like this no index variable is needed.
Dim justastring As String = "test"
For Each c As Char In justastring
MsgBox(c)
Next
I would suggest doing it your way, because you could have variables hanging around consuming(albeit a small amount) of memory, but more importantly, It is better practice to define objects with as little scope as possible. In your teacher's code, the variable i is still accessible when the loop is finished. There are occasions when this is desirable, but normally, if you're only using a variable in a limited amount of code, then you should only declare it within the smallest block that it is needed.
As for your question about the Mid function, individual characters as you know can be access simply by treating the string as an array of characters. After some basic benchmarking, using the Mid function takes a lot longer to process than just accessing the character by the index value. In relatively simple bits of code, this doesn't make much difference, but if you're doing it millions of times in a loop, it makes a huge difference.
There are other factors to consider. Such as code readability and modification of the code, but there are plenty of websites dealing with that sort of thing.
Finally I would suggest changing some compiler options in your visual studio
Option Strict to On
Option Infer to Off
Option Explicit to On
It means writing more code, but the code is safer and you'll make less mistakes. Have a look here for an explanation
In your code, it would mean that you have to write
Dim justastring As String = "test"
For i As Integer = 0 To justastring.Length - 1 Then
MsgBox( justastring(i) )
End For
This way, you know that i is definitely an integer. Consider the following ..
Dim i
Have you any idea what type it is? Me neither.
The compiler doesn't know what so it defines it as an object type which could hold anything. a string, an integer, a list..
Consider this code.
Dim i
Dim x
x = "ab"
For i = x To endcount - 1
t = Mid(s, 999)
Next
The compiler will compile it, but when it is executed you'll get an SystemArgumenException. In this case it's easy to see what is wrong, but often it isn't. And numbers in strings can be a whole new can of worms.
Hope this helps.

Unexpected OutOfMemoryException in ILNumerics

The following VB .net code gives me an out of memory exception. Does anybody knows why?
Dim vArray As ILArray(Of Double) = ILMath.rand(10000000)
Using ILScope.Enter(vArray)
For i As Integer = 1 To 100
vArray = ILMath.add(vArray, vArray)
Next
End Using
Thank you very much.
In this toy example you can simply remove the artificial scope and it will run fine:
Dim vArray As ILArray(Of Double) = ILMath.rand(10000000)
For i As Integer = 1 To 100
vArray = ILMath.add(vArray, vArray)
Next
Console.WriteLine("OK: " + vArray(0).ToString())
Console.ReadKey()
However, in a more serious situation, ILScope will be your friend. As stated on the ILNumerics page an artificial scope ensures a deterministic memory management:
All arrays created inside the scope are disposed once the block was
left.
Otherwise one had to rely on the GC for cleanup. And, as you know, this involves a gen 2 collection for large objects – with all disadvantages in terms of performance.
In order to be able to dispose the arrays they need to be collected and tracked somehow. Whether or not this qualifies for the term 'memory leak' is rather a philosophical question. I will not go into it here. The deal is: after the instruction pointer runs out of the scope these arrays are taken care of: their memory is put into the memory pool and will be reused. As a consequence, no GC will be triggered.
The scheme is especially useful for long running operations and for large data. Currently, the arrays are released only AFTER the scope block was left. So if you create an algorithm/ loop which requires more memory than available on your machine you need to clean up DURING the loop already:
Dim vArray As ILArray(Of Double) = ILMath.rand(10000000)
For i As Integer = 1 To 100
Using ILScope.Enter
vArray.a = ILMath.add(vArray, vArray)
' ...
End Using
Next
Here, the scope cleans up the memory after each iteration of the loop. This affects all local arrays assigned within the loop body. If we want an array value to survive the loop iteration we can assign to its .a property as shown with vArray.a.

A faster way to read lines in text files quickly

My application is looking at huge text files (upwards to half a million lines) from a proxy server log. The problem is that a normal StreamRead iteration of the logs can take an excessive amount of time to process, so I'm looking for something faster.
On the form, the user picks the file they need to parse and enters up to three site filters to check for. The application then opens the file and begins to parse the date stamp and website URL from each line in the file. The average speed is about two lines per second, so for a file with 200,000 lines in it, this process will take about 28 hours to process a file.
I've been reading on the Task class, and I'm thinking this would probably be the route to take, but Microsoft doesn't give a very good example, so how can I can accomplish it?
I think you could use File.ReadLines() when reading large files.
According to MSDN :
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
For more detail, see MSDN File.ReadLines()
Instead of guessing about why it is slow, is it reading the file, processing the lines, etc. start by measuring how long it takes to read the file line-by-line.
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim stpw As New Stopwatch
Dim path As String = "path to your file here"
Dim sr As New IO.StreamReader(path)
Dim linect As Integer = 0
stpw.Restart()
Do While Not sr.EndOfStream
Dim s As String = sr.ReadLine
linect += 1
Loop
stpw.Stop()
sr.Close()
Debug.WriteLine(stpw.Elapsed.ToString)
Debug.WriteLine(linect)
End Sub
I ran this against a test file I have that is 20MB. It is close to 3,000,000 lines long(the lines are very short). It took about .3 of a second to run.
After you run this you will know whether the problem is the read or the processing, or both.
Thanks, dbasnett... the results were:
00:00:00.6991336
172900
Believe it or not, I found the problem. I had the textbox inside a GroupBox and was using the GroupBox.Text property to update statistics back to the user, using GroupBox.Refresh() to update the line x of y and matches found, etc. so the user had some idea of what was being found.
By leaving that information out and putting in a progress bar, the speed of the scans went up exponentially. Using 3 filters, I was able to parse 172900 lines in a matter of 3:19 minutes:
Scan complete!
Process complete!
Scanned 172900 lines out of 172900 lines.
Percentage (icc): 0.0052% (900 matches)
Percentage (facebook): 0.0057% (988 matches)
Percentage (illinois): 0.0005% (95 matches)
Total Matches: 1983
Elapsed Time: 00:03:19.1088851

Is Try/Catch ever LESS expensive than a hash lookup?

I'm aware that exception trapping can be expensive, but I'm wondering if there are cases when it's actually less expensive than a lookup?
For example, if I have a large dictionary, I could either test for the existence of a key:
If MyDictionary.ContainsKey(MyKey) Then _
MyValue = MyDictionary(MyKey) ' This is 2 lookups just to get the value.
Or, I could catch an exception:
Try
MyValue = MyDictionary(MyKey) ' Only doing 1 lookup now.
Catch(e As Exception)
' Didn't find it.
End Try
Is exception trapping always more expensive than lookups like the above, or is it less so in some circumstances?
The thing about dictionary lookups is that they happen in constant or near-constant time. It takes your computer about the same amount of time whether your dictionary holds one item or one million items. I bring this up because you're worried about making two lookups in a large dictionary, and reality is that it's not much different from making two lookups in a small dictionary. As a side note, one of the implications here is that dictionaries are not always the best choice for small collections, though I normally find the extra clarity still outweighs any performance issues for those small collections.
One of the things that determines just how fast a dictionary can make it's lookups is how long it takes to generate a hash value for a particular object. Some objects can do this much faster than others. That means the answer here depends on the kind of object in your dictionary. Therefore, the only way to know for sure is to build a version that tests each method a few hundred thousand times to find out which completes the set faster.
Another factor to keep in mind here is that it's mainly just the Catch block that is slow with exception handling, and so you'll want to look for the right combination of lookup hits and misses that reasonably matches what you'd expect in production. For this reason, you can't find a general guideline here, or if you do it's likely to be wrong. If you only rarely have a miss, then I would expect the exception handler to do much better (and, by virtue of the a miss being somewhat, well, exceptional, it would also be the right solution). If you miss more often, I might prefer a different approach
And while we're at it, let's not forget about Dictionary.TryGetValue()
I tested performance of ContainsKey vs TryCatch, here are the results:
With debugger attached:
Without debugger attached:
Tested on Release build of a Console application with just the Sub Main and below code. ContainsKey is ~37000 times faster with debugger and still 355 times faster without debugger attached, so even if you do two lookups, it would not be as bad as if you needed to catch an extra exception. This is assuming you are looking for missing keys quite often.
Dim dict As New Dictionary(Of String, Integer)
With dict
.Add("One", 1)
.Add("Two", 2)
.Add("Three", 3)
.Add("Four", 4)
.Add("Five", 5)
.Add("Six", 6)
.Add("Seven", 7)
.Add("Eight", 8)
.Add("Nine", 9)
.Add("Ten", 10)
End With
Dim stw As New Stopwatch
Dim iterationCount As Long = 0
Do
stw.Start()
If Not dict.ContainsKey("non-existing key") Then 'always true
stw.Stop()
iterationCount += 1
End If
If stw.ElapsedMilliseconds > 5000 Then Exit Do
Loop
Dim stw2 As New Stopwatch
Dim iterationCount2 As Long = 0
Do
Try
stw2.Start()
Dim value As Integer = dict("non-existing key") 'always throws exception
Catch ex As Exception
stw2.Stop()
iterationCount2 += 1
End Try
If stw2.ElapsedMilliseconds > 5000 Then Exit Do
Loop
MsgBox("ContainsKey: " & iterationCount / 5 & " per second, TryCatch: " & iterationCount2 / 5 & " per second.")
If you are trying to find an item in a data structure of some kind which is not easily searched (e.g. finding an item containing the word "flabbergasted" in an unindexed string array of 100K items, then yes, letting it throw the exception would be faster because you'd only be doing the look-up once. If you check if the item exists first, then get the item, you are doing the look-up twice. However, in your example, where you are looking up an item in a dictionary (hash table), it should be very quick, so doing the lookup twice would likely be faster than letting it fail, but it's hard to say without testing it. It all depends how quickly the hash value for the object can be calculated and how many items in the list share the same hash value.
As others have suggested, in the case of the Dictionary, the TryGetValue would provide the best of both methods. Other list types offer similar functionality.