VB.NET (2013) - Check string against huge file - vb.net

I have a text file that is 125Mb in size, it contains 2.2 million records. I have another text file which doesn't match the original but I need to find out where it differs. Normally, with a smaller file I would read each line and process it in some way, or read the whole file into a string and do likewise, however the two files are too big for that and so I would like to create something to achieve my goal. Here's what I currently have.. excuse the mess of it.
Private Sub refUpdateBtn_Click(sender As Object, e As EventArgs) Handles refUpdateBtn.Click
Dim refOrig As String = refOriginalText.Text 'Original Reference File
Dim refLatest As String = refLatestText.Text 'Latest Reference
Dim srOriginal As StreamReader = New StreamReader(refOrig) 'start stream of original file
Dim srLatest As StreamReader = New StreamReader(refLatest) 'start stream of latest file
Dim recOrig, recLatest, baseDIR, parentDIR, recOutFile As String
baseDIR = vb.Left(refOrig, InStrRev(refOrig, ".ref") - 1) 'find parent folder
parentDIR = Path.GetDirectoryName(baseDIR) & "\"
recOutFile = parentDIR & "Updated.ref"
Me.Text = "Processing Reference File..." 'update the application
Update()
If Not File.Exists(recOutFile) Then
FileOpen(55, recOutFile, OpenMode.Append)
FileClose(55)
End If
Dim x As Integer = 0
Do While srLatest.Peek() > -1
Application.DoEvents()
recLatest = srLatest.ReadLine
recOrig = srOriginal.ReadLine ' check the original reference file
Do
If Not recLatest.Equals(recOrig) Then
recOrig = srOriginal.ReadLine
Else
FileOpen(55, recOutFile, OpenMode.Append)
Print(55, recLatest & Environment.NewLine)
FileClose(55)
x += 1
count.Text = "Record No: " & x
count.Refresh()
srOriginal.BaseStream.Seek(0, SeekOrigin.Begin)
GoTo 1
End If
Loop
1:
Loop
srLatest.Close()
srOriginal.Close()
FileClose(55)
End Sub
It's got poor programming and scary loops, but that's because I'm not a professional coder, just a guy trying to make his life easier.
Currently, this uses a form to insert the original file and the latest file and outputs each line that matches into a new file. This is less than perfect, but I don't know how to cope with the large file sizes as streamreader.readtoend crashes the program. I also don't need the output to be a copy of the latest input, but I don't know how to only output the records it doesn't find. Here's a sample of the records each file has:
doc:ARCHIVE.346CCBD3B06711E0B40E00163505A2EF
doc:ARCHIVE.346CE683B29811E0A06200163505A2EF
doc:ARCHIVE.346CEB15A91711E09E8900163505A2EF
doc:ARCHIVE.346CEC6AAA6411E0BEBB00163505A2EF
The program I have currently works... to a fashion, however I know there are better ways of doing it and I'm sure much better ways of using the CPU and memory, but I don't know this level of programming. All I would like is for you to take a look and offer your best answers to all or some of the code. Tell me what you think will make it better, what will help with one line, or all of it. I have no time limit on this because the code works, albeit slowly, I would just like someone to tell me where my code could be better and what I could do to get round the huge file sizes.

Your code is slow because it is doing a lot of file IO. You're on the right track by reading one line at a time, but this can be improved.
Firstly, I've created some test files based off the data that you provided. Those files contain three million lines and are about 130 MB in size (2.2 million records was less than 100 MB so I've increased the number of lines to get to the file size that you state).
Reading the entire file into a single string uses up about 600 MB of memory. Do this with two files (which I assume you were doing) and you have over 1GB of memory used, which may have been causing the crash (you don't say what error was shown, if any, when the crash occurred, so I can only assume that it was an OutOfMemoryException).
Here's a few tips before I go through your code:
Use Using Blocks
This won't help with performance, but it does make your code cleaner and easier to read.
Whenever you're dealing with a file (or anything that implements the IDisposable interface), it's always a good idea to use a Using statement. This will automatically dispose of the file (which closes the file), even if an error happens.
Don't use FileOpen
The FileOpen method is outdated (and even stated as being slow in its documentation). There are better alternatives that you are already (almost) using: StreamWriter (the cousin of StreamReader).
Opening and closing a file two million times (like you are doing inside your loop) won't be fast. This can be improved by opening the file once outside the loop.
DoEvents() is evil!
DoEvents is a legacy method from back in the VB6 days, and it's something that you really want to avoid, especially when you're calling it two million times in a loop!
The alternative is to perform all of your file processing on a separate thread so that your UI is still responsive.
Using a separate thread here is probably overkill, and there are a number of intricacies that you need to be aware of, so I have not used a separate thread in the code below.
So let's look at each part of your code and see what we can improve.
Creating the output file
You're almost right here, but you're doing some things that you don't need to do. GetDirectoryName works with file names, so there's no need to remove the extension from the original file name first. You can also use the Path.Combine method to combine a directory and file name.
recOutFile = Path.Combine(Path.GetDirectoryName(refOrig), "Updated.ref")
Reading the files
Since you're looping through each line in the "latest" file and finding a match in the "original" file, you can continue to read one line at a time from the "latest" file.
But instead of reading a line at a time from the "original" file, then seeking back to the start when you find a match, you will be better off reading all of those lines into memory.
Now, instead of reading the entire file into memory (which took up 600 MB as I mentioned earlier), you can read each line of the file into an array. This will use up less memory, and is quite easy to do thanks to the File class.
originalLines = File.ReadAllLines(refOrig)
This reads all of the lines from the file and returns a String array. Searching through this array for matches will be slow, so instead of reading into an array, we can read into a HashSet(Of String). This will use up a bit more memory, but it will be much faster to seach through.
originalLines = New HashSet(Of String)(File.ReadAllLines(refOrig))
Searching for matches
Since we now have all of the lines from the "original" line in an array or HashSet, searching for a line is very easy.
originalLines.Contains(recLatest)
Putting it all together
So let's put all of this together:
Private Sub refUpdateBtn_Click(sender As Object, e As EventArgs)
Dim refOrig As String
Dim refLatest As String
Dim recOutFile As String
Dim originalLines As HashSet(Of String)
refOrig = refOriginalText.Text 'Original Reference File
refLatest = refLatestText.Text 'Latest Reference
recOutFile = Path.Combine(Path.GetDirectoryName(refOrig), "Updated.ref")
Me.Text = "Processing Reference File..." 'update the application
Update()
originalLines = New HashSet(Of String)(File.ReadAllLines(refOrig))
Using latest As New StreamReader(refLatest),
updated As New StreamWriter(recOutFile, True)
Do
Dim line As String
line = latest.ReadLine()
' ReadLine returns Nothing when it reaches the end of the file.
If line Is Nothing Then
Exit Do
End If
If originalLines.Contains(line) Then
updated.WriteLine(line)
End If
Loop
End Using
End Sub
This uses around 400 MB of memory and takes about 4 seconds to run.

Related

Read of text file never being released from memory

I'm using .NET Framework 4.6.2 (VB) for a Windows Service. I'm using NLog to write a log file without issue. I'm now adding a log viewer utility which will show the last 100 lines of the log file. I've used various methods to read the file but can't seem to escape the reality that I eventually need to iterate through the entire file to get to the lines that I need. That's not a problem.
Where I'm having an issue is that after I've finished reading the file it NEVER seems to get released from memory. When I start my application, it's using approximately 16MB of memory. After the read (of an at most 10MB file) it's using around 38.5MB. Even doing things like clearing the List(Of String) or a forced Garbage Collection is never fully releasing the memory.
I'm using probably the simplest version of a read:
Dim LogEntries As List(Of String) = System.IO.File.ReadLines(LogFile).ToList()
LogEntries.Clear()
I am performing other tasks between the ReadLines and LogEntries.Clear() steps, but the issue is present even if I use only the lines shown above.
I would expect that on clearing the LogEntries list would return the memory usage to approximately 16MB, but the lowest I've been able to get it (after a GC.Collect()) is about 22MB. Can anyone explain this to me?
The whole point of calling ReadLines is that it doesn't read every line at the same time. If you then call ToList on the result then you force it to wait until all lines are read. That's silly. If you want the last 100 lines then you have no choice but to read the whole lot but there's no point keeping it all.
Dim lines = File.ReadAllLines(filePath)
lines = lines.Skip(lines.Length - 100).ToArray()
The first line reads the entire file into a String array and then the second line creates a second array containing just the last 100 elements and discards the first array.
Another option that would reduce memory consumption at the expense of performance would be this:
Dim lines As New List(Of String)
Using reader As New StreamReader(filePath)
Do Until reader.EndOfStream
lines.Add(reader.ReadLine())
If lines.Count > 100 Then
lines.RemoveAt(0)
End If
Loop
End Using

how to search and display specific line from a text file vb.net

Hi I am trying to search for a line which contains whats the user inputs in a text box and display the whole line. My code below doesnt display a messsagebox after the button has been clicked and i am not sure if the record has been found
Dim filename, sr As String
filename = My.Application.Info.DirectoryPath + "\" + "mul.txt"
Dim file As String()
Dim i As Integer = 0
file = IO.File.ReadAllLines(filename)
Dim found As Boolean
Dim linecontain As Char
sr = txtsr.ToString
For Each line As String In file
If line.Contains(sr) Then
found = True
Exit For
End If
i += 1
If found = True Then
MsgBox(line(i))
End If
Next
End Sub
You should be calling ReadLines here rather than ReadAllLines. The difference is that ReadAllLines reads the entire file contents into an array first, before you can start processing any of it, while ReadLines doesn't read a line until you have processed the previous one. ReadAllLines is good if you want random access to the whole file or you want to process the data multiple times. ReadLines is good if you want to stop processing data when a line satisfies some criterion. If you're looking for a line that contains some text and you have a file with one million lines where the first line matches, ReadAllLines would read all one millions lines whereas ReadLines would only read the first.
So, here's how you display the first line that contains specific text:
For Each line In File.ReadLines(filePath)
If line.Contains(substring) Then
MessageBox.Show(line)
Exit For
End If
Next
With regards to your original code, your use of i makes no sense. You seem to be using i as a line counter but there's no point because you're using a For Each loop so line contains the line. If you already have the line, why would you need to get the line by index? Also, when you try to display the message, you are using i to index line, which means that you're going to get a single character from the line rather than a single line from the array. If the index of the line is greater than the number of characters in the line then that is going to throw an IndexOutOfRangeException, which I'm guessing is what's happening to you.
This is what comes from writing code without knowing what it actually has to do first. If you had written out an algorithm before writing the code, it would have been obvious that the code didn't implement the algorithm. If you have no algorithm though, you have nothing to compare your code to to make sure that it makes sense.

Avoid updating textbox in real time in vb.net

I have a very simple code in a VB.NET program to load all paths in a folder in a text box. The code works great, the problem is that it adds the lines in real time, so it takes about 3 minutes to load 20k files while the interface is displaying line by line.
This is my code:
Dim ImageryDB As String() = IO.Directory.GetFiles("c:\myimages\")
For Each image In ImageryDB
txtbAllimg.AppendText(image & vbCrLf)
Next
How can I force my program to load the files in chunks or update the interface every second?
Thanks in advance
Yes, you can do that. You'll need to load the file names into an off-screen data structure of some kind rather than loading them directly into the control. Then you can periodically update the control to display whatever is loaded so far. However, I think you'll find that the slowness comes only from updating the control. Once you remove that part, there will be no need to update the control periodically during the loading process since it will be nearly instantaneous.
You could just load all of the file names into a string and then only set the text box to that string after it's been fully loaded, like this:
Dim imagePaths As String = ""
For Each image As String In Directory.GetFiles("c:\myimages\")
imagePaths &= image & Environment.NewLine
Next
txtbAllimg.Text = imagePaths
However, that's not as efficient as using the StringBuilder:
Dim imagePaths As New StringBuilder()
For Each image As String In Directory.GetFiles("c:\myimages\")
imagePaths.AppendLine(image)
Next
txtbAllimg.Text = imagePaths.ToString()
However, since the GetFiles method is already returning the complete list of paths to you as a string array, it would be even more convenient (and likely even more efficient) to just use the String.Join method to combine all of the items in the array into a single string:
txtbAllimg.Text = String.Join(Environment.NewLine, Directory.GetFiles("c:\myimages\"))
I know that this is not an answer to your actual question, but AppendText is slow. Using a ListBox and Adding the items to it is approx. 3 times faster. The ListBox also has the benefit of being able to select an item easily (at least more easily than a TextBox)
For each image in ImageryDB
Me.ListBox1.Items.add (image)
Next
However, there is probably an even more useful and faster way to do this. Using FileInfo.
Dim dir As New IO.DirectoryInfo("C:\myImages")
Dim fileInfoArray As IO.FileInfo() = dir.GetFiles()
Dim fileInfo As IO.FileInfo
For Each fileInfo In fileInfoArray
Me.ListBox2.Items.Add(fileInfo.Name)
Next

Let VB read certain area in text file, change it and save it

I want my program to read a certain part of a huge txt file, change one value and save the file again.
The file that needs editing looks like this:
168575 = {
name="Hidda"
female=yes
dynasty=9601
religion="catholic"
culture="german"
father=168573
960.1.1 = {
birth=yes
}
1030.1.1 = {
death=yes
}
}
My VB program takes the IDs from the blocks it has to change from another textbox like this.
31060
106551
106550
168575
40713
106523
106522
106555
As you can see, the number I want changed is in the middle of the textbox, the code I use to get the number from the line and look for it in the huge file is
Dim strText() As String
strText = Split(chars.Text, vbCrLf)
and later
If line.Contains(strText(0) & " = {") Then
TextBox1.AppendText(line & Environment.NewLine)
To form a code like:
Dim strText() As String
strText = Split(chars.Text, vbCrLf)
Label4.Text = strText(0)
Dim line As String = Nothing
Dim lines2 As Integer = 0
Using reader2 As New StreamReader("c:/dutch.txt")
While (reader2.Peek() <> -1)
line = reader2.ReadLine()
If line.Contains(strText(0) & " = {") Then
TextBox1.AppendText(line & Environment.NewLine)
End If
lines2 = lines2 + 1
Label2.Text = lines2
End While
End Using
Naturally, this only writes in a textbox the line that it found, how do I get the whole code with that IDs I take from 1 textbox, change the culture to another value and save it again? And repeat this for all the IDs in a textbox? Im not a coding legend but this has been bothering me for ages now :(
There are a few issues to consider here. If you're dealing with a large text file as a "database" and you wish to edit only parts of it without affecting the other parts, then you may wish to investigate editing it as a binary file instead of as a text stream. This has several downsides, however, since it means that you have to be aware of how big your records are and deal with things like padding.
If you can spare the disk IO and RAM (I don't know how huge you mean when you say huge) it would probably be vastly easier to simply load the entire file into an array or List(Of String), find the line representing the person, seek a few lines below that for the field you want (you said culture), change that field in the array or List, and then just resave the entire array or List back to a text file. This would make it fairly easy to do inserts and you wouldn't have to worry about padding, mostly you'd just have to worry about the line endings and the file encoding (and the amount of disk IO and RAM).
Finally, I would suggest that using a custom format text file as a database is generally a "bad" idea in 2014 unless you have a really good reason to be doing that. Your format looks very similar to JSON - perhaps you could consider using that instead of your existing format. Then there would be libraries such as JSON.Net to help you do the right thing and you wouldn't need to do any custom IO code.

A faster way to read lines in text files quickly

My application is looking at huge text files (upwards to half a million lines) from a proxy server log. The problem is that a normal StreamRead iteration of the logs can take an excessive amount of time to process, so I'm looking for something faster.
On the form, the user picks the file they need to parse and enters up to three site filters to check for. The application then opens the file and begins to parse the date stamp and website URL from each line in the file. The average speed is about two lines per second, so for a file with 200,000 lines in it, this process will take about 28 hours to process a file.
I've been reading on the Task class, and I'm thinking this would probably be the route to take, but Microsoft doesn't give a very good example, so how can I can accomplish it?
I think you could use File.ReadLines() when reading large files.
According to MSDN :
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
For more detail, see MSDN File.ReadLines()
Instead of guessing about why it is slow, is it reading the file, processing the lines, etc. start by measuring how long it takes to read the file line-by-line.
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim stpw As New Stopwatch
Dim path As String = "path to your file here"
Dim sr As New IO.StreamReader(path)
Dim linect As Integer = 0
stpw.Restart()
Do While Not sr.EndOfStream
Dim s As String = sr.ReadLine
linect += 1
Loop
stpw.Stop()
sr.Close()
Debug.WriteLine(stpw.Elapsed.ToString)
Debug.WriteLine(linect)
End Sub
I ran this against a test file I have that is 20MB. It is close to 3,000,000 lines long(the lines are very short). It took about .3 of a second to run.
After you run this you will know whether the problem is the read or the processing, or both.
Thanks, dbasnett... the results were:
00:00:00.6991336
172900
Believe it or not, I found the problem. I had the textbox inside a GroupBox and was using the GroupBox.Text property to update statistics back to the user, using GroupBox.Refresh() to update the line x of y and matches found, etc. so the user had some idea of what was being found.
By leaving that information out and putting in a progress bar, the speed of the scans went up exponentially. Using 3 filters, I was able to parse 172900 lines in a matter of 3:19 minutes:
Scan complete!
Process complete!
Scanned 172900 lines out of 172900 lines.
Percentage (icc): 0.0052% (900 matches)
Percentage (facebook): 0.0057% (988 matches)
Percentage (illinois): 0.0005% (95 matches)
Total Matches: 1983
Elapsed Time: 00:03:19.1088851