I have an application that reads a 5gb text file line by line and converts double quoted strings that are comma delimited to pipe delimited format.
i.e. "Smith, John","Snow, John" --> Smith, John|Snow, John
I have provided my code below. My question is: Is there a more efficient way of processing large files?
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
strWrite.WriteLine(newLine)
Loop
strWrite.Close()
UPDATED CODE
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
line = line.Replace(Chr(34) + Chr(44) + Chr(34), "|")
line = line.Replace(Chr(34), "")
strWrite.WriteLine(line)
Loop
strWrite.Close()
I tested your code and attempted to make a speed improvement by accumulating output lines into a StringBuilder. I also moved the regex declaration outside the loop.
When that did not work, I examined the CPU usage and disk I/O with Windows Process Monitor and it turned out that the bottleneck is the CPU (even when using an HDD instead of an SSD).
That led me to try an alternative method for modifying the text: if all you need to do is replace "," with | and remove any remaining double-quotes, then
newLine = line.Replace(""",""", "|").Replace("""", "")
turns out to be much faster (roughly fourfold in my testing) than using a regex.
(Further improvement might be possible with multi-threading, as #Werdna suggested, as long as more than one processor is available and you can coordinate writing back the modified data in the correct order.)
Related
This is a follow on question to Select block of text and merge into new document
I have a SGM document with comments added and comments in my sgm file. I need to extract the strings in between the start/stop comments so I can put them in a temporary file for modification. Right now it's selecting everything including the start/stop comments and data outside of the start/stop comments.
Dim DirFolder As String = txtDirectory.Text
Dim Directory As New IO.DirectoryInfo(DirFolder)
Dim allFiles As IO.FileInfo() = Directory.GetFiles("*.sgm")
Dim singleFile As IO.FileInfo
Dim Prefix As String
Dim newMasterFilePath As String
Dim masterFileName As String
Dim newMasterFileName As String
Dim startMark As String = "<!--#start#-->"
Dim stopMark As String = "<!--#stop#-->"
searchDir = txtDirectory.Text
Prefix = txtBxUnique.Text
For Each singleFile In allFiles
If File.Exists(singleFile.FullName) Then
Dim fileName = singleFile.FullName
Debug.Print("file name : " & fileName)
' A backup first
Dim backup As String = fileName & ".bak"
File.Copy(fileName, backup, True)
' Load lines from the source file in memory
Dim lines() As String = File.ReadAllLines(backup)
' Now re-create the source file and start writing lines inside a block
' Evaluate all the lines in the file.
' Set insideBlock to false
Dim insideBlock As Boolean = False
Using sw As StreamWriter = File.CreateText(backup)
For Each line As String In lines
If line = startMark Then
' start writing at the line below
insideBlock = True
' Evaluate if the next line is <!Stop>
ElseIf line = stopMark Then
' Stop writing
insideBlock = False
ElseIf insideBlock = True Then
' Write the current line in the block
sw.WriteLine(line)
End If
Next
End Using
End If
Next
This is the example text to test on.
<chapter id="Chapter_Overview"> <?Pub Lcl _divid="500" _parentid="0">
<title>Learning how to gather data</title>
<!--#start#-->
<section>
<title>ALTERNATE MISSION EQUIPMENT</title>
<para0 verdate="18 Jan 2019" verstatus="ver">
<title>
<applicabil applicref="xxx">
</applicabil>Three-Button Trackball Mouse</title>
<para>This is the example to grab all text between start and stop comments.
</para></para0>
</section>
<!--#stop#-->
Things to note: the start and stop comments ALWAYS fall on a new line, a document can have multiple start/stop sections
I thought maybe using a regex on this
(<section>[\w+\w]+.*?<\/section>)\R(<\?Pub _gtinsert.*>\R<pgbrk pgnum.*?>\R<\?Pub /_gtinsert>)*
Or maybe use IndexOf and LastIndexOf, but I couldn't get that working.
You can read the entire file and split it into an array using the string array of {"<!--#start#-->", "<!--#stop#-->"} to split, into this
Element 0: Text before "<!--#start#-->"
Element 1: Text between "<!--#start#-->" and "<!--#stop#-->"
Element 2: Text after "<!--#stop#-->"
and take element 1. Then write it to your backup.
Dim text = File.ReadAllText(backup).Split({startMark, stopMark}, StringSplitOptions.RemoveEmptyEntries)(1)
Using sw As StreamWriter = File.CreateText(backup)
sw.Write(text)
End Using
Edit to address comment
I did make the original code a little compact. It can be expanded out into the following, which allows you to add some validation
Dim text = File.ReadAllText(backup)
Dim split = text.Split({startMark, stopMark}, StringSplitOptions.RemoveEmptyEntries)
If split.Count() <> 3 Then Throw New Exception("File didn't contain one or more delimiters.")
text = split(1)
Using sw As StreamWriter = File.CreateText(backup)
sw.Write(text)
End Using
I have a String file with 8 items (separated by commas) in each row, e.g., CA,23456,aName,aType,anotherName,aWord,secondword,number. I want to create a new string of items consisting of the 2nd item (an Integer) of each row of the original file. I know there are many ways to do this but someone out there knows how to do it with very few lines of code, which is what I am looking for. I prefer not to use a parser.
The way to show what I have tried is to look at the code below.
Dim sn2 As String = ""
Dim sn2S As String = ""
Using readFile As New StreamReader(newFile1)
Do While readFile.Peek() <> -1
sn2S = readFile.ReadLine(1)
sn2 = sn2 & sn2S & ","
Loop
End Using
The code returns the second character of each row not the second item. What I hope to get is a string that looks like: 123,1345,4325,3321,3456,3211 etc. Where each number is the second item in each row of the original file.
You could split it up by cells
Dim row As String = "CA,23456,aName,aType,anotherName,aWord,secondword,number"
Dim cells() As String = row.Split(",")
Dim cellValue As String = cells(1)
But in your case, I would just do a search and Substring by the index of the delimiter.
Dim startPosition As Integer = row.IndexOf(",") + 1
Dim endPosition As Integer = row.IndexOf(",", startPosition)
Dim cellValue As String = row.Substring(startPosition, endPosition - startPosition)
If you have the whole file in memory, there could be some regex that could do the job with one pass.
As for this line
sn2 = sn2 & sn2S & ","
You might want to check at doing a join or using stringbuilder.
You could try
Dim sn2 As String = ""
Dim sn2S(7) As String = ""
Using readFile As New StreamReader(newFile1)
Do While readFile.Peek() <> -1
Array.Clear(sn25,0,sn25.Length)
sn2S = readFile.ReadLine(1).Split(",")
sn2 = sn2 & sn2S(1) & ","
Loop
End Using
In one line
Dim sn2 = String.Join(",", File.ReadAllLines(newFile1).Select(Function(s) s.Split(","c)(1)))
From the inside-out:
File.ReadAllLines(newFile1) splits the file into lines and results in a string array holding those lines, which is fed into...
...Select(Function(s) s.Split(","c)(1)) which operates on each line by splitting the line by comma s.Split(","c) and then indexing the resulting array (1) to return the second (zero-based) element. This is fed into...
String.Join(",", ... ) which takes those second elements and joins then together with comma.
I want to read and write the same file with StreamReader and StreamWriter. I know that in my code I am trying to open the file twice and that is the problem. Could anyone give me another way to do this? I got confused a bit.
As for the program, I wanted to create a program where I create a text if it doesnt exist. If it exists then it compares each line with a Listbox and see if the value from the Listbox appears there. If it doesnt then it will add to the text.
Dim SR As System.IO.StreamReader
Dim SW As System.IO.StreamWriter
SR = New System.IO.StreamReader("D:\temp\" & Cerberus.TextBox1.Text & "_deleted.txt", True)
SW = New System.IO.StreamWriter("D:\temp\" & Cerberus.TextBox1.Text & "_deleted.txt", True)
Dim strLine As String
Do While SR.Peek <> -1
strLine = SR.ReadLine()
For i = 0 To Cerberus.ListBox2.Items.Count - 1
If Cerberus.ListBox2.Items.Item(i).Contains(strLine) = False Then
SW.WriteLine(Cerberus.ListBox2.Items.Item(i))
End If
Next
Loop
SR.Close()
SW.Close()
SR.Dispose()
SW.Dispose()
MsgBox("Duplicates Removed!")
If your file is not that large, consider using File.ReadAllLines and File.WriteAllLines.
Dim path = "D:\temp\" & Cerberus.TextBox1.Text & "_deleted.txt"
Dim lines = File.ReadAllLines(path) 'String() -- holds all the lines in memory
Dim linesToWrite = Cerberus.ListBox2.Items.Cast(Of String).Except(lines)
File.AppendAllLines(path, linesToWrite)
If the file is large, but you only have to write a few lines, then you can use File.ReadLines:
Dim lines = File.ReadLines(path) 'IEnumerable(Of String)\
'holds only a single line in memory at a time
'but the file remains open until the iteration is finished
Dim linesToWrite = Cerberus.ListBox2.Items.Cast(Of String).Except(lines).ToList
File.AppendAllLines(path, linesToWrite)
If there are a large number of lines to write, then use the answers from this question.
I have searched high and low on the internet and I can't find a straight answer to this !
I have a file that has approx 100,000 characters in one long line.
I need to read this file in and write it out again in its entirety, in lines 102 character long ending with VbCrLf. There are no delimiters.
I thought there were a number of ways to tackle issues like this in VB Script... but
apparently not !
Can anyone please provide me with a pointer ?
Here's something (off the top of my head - untested!) that should get you started.
Const ForReading = 1
Const ForWriting = 2
Dim sNewLine
Set fso = CreateObject("Scripting.FileSystemObject")
Set tsIn = fso.OpenTextFile("OldFile.txt", ForReading) ' Your input file
Set tsOut = fso.OpenTextFile("NewFile.txt", ForWriting) ' New (output) file
While Not tsIn.AtEndOfStream ' While there is still text
sNewLine = tsIn.Read(102) ' Read 120 characters
tsOut.Write sNewLine & vbCrLf ' Write out to new file + CR/LF
Wend ' Loop to repeat
tsIn.Close
tsOut.Close
I won't cover the reading of files, since that is stuff you can find everywhere. And since it's been years I've coded in vb or vbscript, I hope that .net code will suffice.
pseudo: read line from file, put it in for example a string (performance issues anyone?).
A simple algorithm would be and this might have performance issues (multithreading, parallel could be a solution):
Public Sub foo()
Dim strLine As String = "foo²"
Dim strLines As List(Of String) = New List(Of String)
Dim nrChars = strLine.ToCharArray.Count
Dim iterations = nrChars / 102
For i As Integer = 0 To iterations - 1
strLines.Add(strLine.Substring(0, 102))
strLine = strLine.Substring(103)
Next
'save it to file
End Sub
I have some large csv files (1.5gb each) where I need to replace specific values. The method I'm currently using is terribly slow and I'm fairly certain that there should be a way to speed this up but I'm just not experienced enough to know what I should be doing. This is my first post and I tried searching through to find something relevant but didn't come across anything. Any help would be appreciated.
My other thought would be to break the file into chunks so that I can read the entire thing into memory, do all of the replacements there and then output to a consolidated file. I tried this but the way I did it actually ended up seeming slower than my current method.
Thanks!
Sub Main()
Dim fName As String = "2009.csv"
Dim wrtFile As String = "2009.1.csv"
Dim lRead
Dim lwrite As String
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim bulkWrite As String
bulkWrite = ""
Do While strRead.Peek <> -1
lRead = Split(strRead.ReadLine(), ",")
If lRead(9) = "5MM+" Then lRead(9) = "5000000"
If lRead(9) = "1MM+" Then lRead(9) = "1000000"
lwrite = ""
For i = LBound(lRead) To UBound(lRead)
lwrite = lwrite & lRead(i) & ","
Next
strWrite.WriteLine(lwrite)
Loop
strRead.Close()
strWrite.Close()
End Sub
You are splitting and the combining, which can take some time.
Why not just read the line of text. Then replace any occurance of "5MM+" and "1MM+" with the approiate value and then write the line.
Do While ...
s = strRead.ReadLine();
s = s.Replace("5MM+", "5000000")
s = s.Replace("1MM+", "1000000")
strWrite(s);
Loop