Creating Fixed Width files from strings - file-io

I have searched high and low on the internet and I can't find a straight answer to this !
I have a file that has approx 100,000 characters in one long line.
I need to read this file in and write it out again in its entirety, in lines 102 character long ending with VbCrLf. There are no delimiters.
I thought there were a number of ways to tackle issues like this in VB Script... but
apparently not !
Can anyone please provide me with a pointer ?

Here's something (off the top of my head - untested!) that should get you started.
Const ForReading = 1
Const ForWriting = 2
Dim sNewLine
Set fso = CreateObject("Scripting.FileSystemObject")
Set tsIn = fso.OpenTextFile("OldFile.txt", ForReading) ' Your input file
Set tsOut = fso.OpenTextFile("NewFile.txt", ForWriting) ' New (output) file
While Not tsIn.AtEndOfStream ' While there is still text
sNewLine = tsIn.Read(102) ' Read 120 characters
tsOut.Write sNewLine & vbCrLf ' Write out to new file + CR/LF
Wend ' Loop to repeat
tsIn.Close
tsOut.Close

I won't cover the reading of files, since that is stuff you can find everywhere. And since it's been years I've coded in vb or vbscript, I hope that .net code will suffice.
pseudo: read line from file, put it in for example a string (performance issues anyone?).
A simple algorithm would be and this might have performance issues (multithreading, parallel could be a solution):
Public Sub foo()
Dim strLine As String = "foo²"
Dim strLines As List(Of String) = New List(Of String)
Dim nrChars = strLine.ToCharArray.Count
Dim iterations = nrChars / 102
For i As Integer = 0 To iterations - 1
strLines.Add(strLine.Substring(0, 102))
strLine = strLine.Substring(103)
Next
'save it to file
End Sub

Related

VB read/write 5 gb text files

I have an application that reads a 5gb text file line by line and converts double quoted strings that are comma delimited to pipe delimited format.
i.e. "Smith, John","Snow, John" --> Smith, John|Snow, John
I have provided my code below. My question is: Is there a more efficient way of processing large files?
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
strWrite.WriteLine(newLine)
Loop
strWrite.Close()
UPDATED CODE
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
line = line.Replace(Chr(34) + Chr(44) + Chr(34), "|")
line = line.Replace(Chr(34), "")
strWrite.WriteLine(line)
Loop
strWrite.Close()
I tested your code and attempted to make a speed improvement by accumulating output lines into a StringBuilder. I also moved the regex declaration outside the loop.
When that did not work, I examined the CPU usage and disk I/O with Windows Process Monitor and it turned out that the bottleneck is the CPU (even when using an HDD instead of an SSD).
That led me to try an alternative method for modifying the text: if all you need to do is replace "," with | and remove any remaining double-quotes, then
newLine = line.Replace(""",""", "|").Replace("""", "")
turns out to be much faster (roughly fourfold in my testing) than using a regex.
(Further improvement might be possible with multi-threading, as #Werdna suggested, as long as more than one processor is available and you can coordinate writing back the modified data in the correct order.)

search for filenames in textfiles

I have a powershell script that pulls out lines containing ".html" ".css" and so forth
however what I need is to be able to strip out the entire filename
using a pattern.... the entire pattern is returned example
.........\.html returns
src="blank.html"
my answer came in VB (with a bunch of work and even more research) I wanted to share with you all the results, it's not pretty but it works. is there an easier way?
I have commented the code to help in understanding.
Private Sub find()
Dim reader As StreamReader = My.Computer.FileSystem.OpenTextFileReader(openWork.FileName)
Dim a As String
Dim SearchForThis As String
Dim allfilenames As New System.Text.StringBuilder
Dim first1 As String
Dim FirstCharacter As Integer
'Dim lines As Integer
SearchForThis = txtFind.Text
Do
a = reader.ReadLine 'reader.Readling
If a = "" Then
a = reader.ReadLine
End If
If a Is Nothing Then 'without this check the for loops run with bad data, but I can't check "a" without reading it first.
Else
For FirstCharacter = 2 To a.Length - SearchForThis.Length ' start at 2 to prevent errors in the ")" check
If Mid(a, FirstCharacter, SearchForThis.Length) = SearchForThis Then ' compare the line character by character to find the searchstring
If Mid(a, FirstCharacter - 1, 1) <> ")" Then ' checks for ")" just before the searchstring (a common problem with my .CSS finds)
For y = FirstCharacter To 1 Step -1
If Mid(a, y, 1) = Mid(a, FirstCharacter + SearchForThis.Length, 1) Then ' compares the character after searchstring till I find another one
Dim temp = Mid(a, y + 1, (FirstCharacter + SearchForThis.Length) - 1 - y) ' puts the entire filename into variable "temp"
allfilenames.Append(temp & Chr(13)) 'adds the contents of temp (and a carrage return) to the allfilenames stringbuilder
y = 1
Else
End If
Next
End If
End If
Next
End If
Loop Until a Is Nothing
Document.Text = allfilenames.ToString
reader.Close()
End Sub
(updating for comments... thanks for the input)
each line in the .css search file looks something like this.
addPbrDlg.html:12:<link rel=stylesheet href="swl_styles-5.0o-97581885.css" TYPE="text/css">
addPbrDlg.html:727: html(getFrame(statusFrame).strErrorMessage).css('color','red');
for this I want to return
swl_styles-5.0o-97581885.css
but not return
statusFrame).strErrorMessage).css
basically I want to strip out the file names from HTML code
but if I use a pattern like
.............................\.css
it would return something like
t href="swl_styles-5.0o-97581885.css
Finally... there are some variables that I don't need to worry about (due to my personal situation) like I know that all web pages are ".html" all images are ".gif" there are ".css" and ".js" files as well that I want to pull. But because the designers are extremely consistant I know that there aren't any surprise files (.jpg or .htm)
I can also assume that if there is a single quote after the filename, there will be a single quote before. same with double quote.
Thanks for your input so far... I appreciate your time and knowledge.
You need to use Regex and do something like this
Dim files = Regex.Matches("<your whole file text>", "Your regex pattern");
Your regex pattern will look something like this "\Asrc="".+((\.html)|(\.css))"")". This is probably wrong but when you get that straight follow with
Dim fileList as new List(of String)
For Each file as Match in files
' strip " src=" " and last " " "
fileList.Add(file.Value.Substring(5, file.Value.Length - 6))
Next

Start reading massive text file from the end

I would ask if you could give me some alternatives in my problems.
basically I'm reading a .txt log file averaging to 8 million lines. Around 600megs of pure raw txt file.
I'm currently using streamreader to do 2 passes on those 8 million lines doing sorting and filtering important parts in the log file, but to do so, My computer is taking ~50sec to do 1 complete run.
One way that I can optimize this is to make the first pass to start reading at the end because the most important data is located approximately at the final 200k line(s) . Unfortunately, I searched and streamreader can't do this. Any ideas to do this?
Some general restriction
# of lines varies
size of file varies
location of important data varies but approx at the final 200k line
Here's the loop code for the first pass of the log file just to give you an idea
Do Until sr.EndOfStream = True 'Read whole File
Dim streambuff As String = sr.ReadLine 'Array to Store CombatLogNames
Dim CombatLogNames() As String
Dim searcher As String
If streambuff.Contains("CombatLogNames flags:0x1") Then 'Keyword to Filter CombatLogNames Packets in the .txt
Dim check As String = streambuff 'Duplicate of the Line being read
Dim index1 As Char = check.Substring(check.IndexOf("(") + 1) '
Dim index2 As Char = check.Substring(check.IndexOf("(") + 2) 'Used to bypass the first CombatLogNames packet that contain only 1 entry
If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then 'Stricter Filters for CombatLogNames
Dim endCLN As Integer = 0 'Signifies the end of CombatLogNames Packet
Dim x As Integer = 0 'Counter for array
While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick") 'Loops until the end keyword for CombatLogNames is seen
streambuff = sr.ReadLine 'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then
endCLN = 1 'Value change to determine end of CombatLogName packet
Else
ReDim Preserve CombatLogNames(x) 'Resizes the array while preserving the values
searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'")) 'Additional filtering to get only valuable data
CombatLogNames(x) = search(searcher)
x += 1 '+1 to Array counter
End If
End While
Else
'MsgBox("Something went wrong, Flame the coder of this program!!") 'Bug Testing code that is disabled
End If
Else
End If
If (sr.EndOfStream = True) Then
ReDim GlobalArr(CombatLogNames.Length - 1) 'Resizing the Global array to prime it for copying data
Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length) 'Just copying the array to make it global
End If
Loop
You CAN set the BaseStream to the desired reading position, you just cant set it to a specfic LINE (because counting lines requires to read the complete file)
Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
For i = 1 To 100
sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
Next
End Using
Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
sr.BaseStream.Seek(-100, SeekOrigin.End)
Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
While Not sr.EndOfStream
Dim line = sr.ReadLine
Console.WriteLine(line)
End While
End Using
For any later read attempt on the same file, you could simply save the final position (of the basestream) and on the next read to advance to that position before you start reading lines.
What worked for me was skipping first 4M lines (just a simple if counter > 4M surrounding everything inside the loop), and then adding background workers that did the filtering, and if important added the line to an array, while main thread continued reading the lines. This saved about third of the time at the end of a day.

Function to count number of lines in a text file

Need a function that will accept a filename as parameter and then return the number of lines in that file.
Should be take under 30 seconds to get the count of a 10 million line file.
Currently have something along the lines of - but it is too slow with large files:
Dim objFSO, strTextFile, strData, arrLines, LineCount
CONST ForReading = 1
'name of the text file
strTextFile = "sample.txt"
'Create a File System Object
Set objFSO = CreateObject("Scripting.FileSystemObject")
'Open the text file - strData now contains the whole file
strData = objFSO.OpenTextFile(strTextFile,ForReading).ReadAll
'Split by lines, put into an array
arrLines = Split(strData,vbCrLf)
'Use UBound to count the lines
LineCount = UBound(arrLines) + 1
wscript.echo LineCount
'Cleanup
Set objFSO = Nothing
If somebody still looking for faster way, here is the code:
Const ForAppending = 8
Set fso = CreateObject("Scripting.FileSystemObject")
Set theFile = fso.OpenTextFile("C:\textfile.txt", ForAppending, Create:=True)
WScript.Echo theFile.Line
Set Fso = Nothing
Of course, the processing time depend very much of the file size, not only of the lines number. Compared with the RegEx method TextStream.Line property is at least 3 times quicker.
The only alternative I see is to read the lines one by one (EDIT: or even just skip them one by one) instead of reading the whole file at once. Unfortunately I can't test which is faster right now. I imagine skipping is quicker.
Dim objFSO, txsInput, strTemp, arrLines
Const ForReading = 1
Set objFSO = CreateObject("Scripting.FileSystemObject")
strTextFile = "sample.txt"
txsInput = objFSO.OpenTextFile(strTextFile, ForReading)
'Skip lines one by one
Do While txsInput.AtEndOfStream <> True
txsInput.SkipLine ' or strTemp = txsInput.ReadLine
Loop
wscript.echo txsInput.Line-1 ' Returns the number of lines
'Cleanup
Set objFSO = Nothing
Incidentally, I took the liberty of removing some of your 'comments. In terms of good practice, they were superfluous and didn't really add any explanatory value, especially when they basically repeated the method names themselves, e.g.
'Create a File System Object
... CreateObject("Scripting.FileSystemObject")
Too large files...
The following is the fastest-effeciently way I know of:
Dim oFso, oReg, sData, lCount
Const ForReading = 1, sPath = "C:\file.txt"
Set oReg = New RegExp
Set oFso = CreateObject("Scripting.FileSystemObject")
sData = oFso.OpenTextFile(sPath, ForReading).ReadAll
With oReg
.Global = True
.Pattern = "\r\n" 'vbCrLf
'.Pattern = "\n" ' vbLf, Unix style line-endings
lCount = .Execute(sData).Count + 1
End With
WScript.Echo lCount
Set oFso = Nothing
Set oReg = Nothing
You could try some variation on this
cnt = 0
Set fso = CreateObject("Scripting.FileSystemObject")
Set theFile = fso.OpenTextFile(filespec, ForReading, False)
Do While theFile.AtEndOfStream <> True
theFile.SkipLine
c = c + 1
Loop
theFile.Close
WScript.Echo c,"lines"
txt = "c:\YourTxtFile.txt"
j = 0
Dim read
Open txt For Input As #1
Do While Not EOF(1)
Input #1, read
j = j + 1
Loop
Close #1
If it adds an empty last line the result is (j - 1).
It works fine for one column in the txt file.
How to count all lines in the notepad
Answers:
=> Below is the code -
Set t1=createObject("Scripting.FileSystemObject")
Set t2=t1.openTextFile ("C:\temp\temp1\temp2_VBSCode.txt",1)
Do Until t2.AtEndOfStream
strlinenumber = t2.Line
strLine = t2.Readline
Loop
msgbox strlinenumber
t2.Close
I was looking for a faster way than what I already had to determine the number of lines in a text file. I searched the internet and came across 2 promising solution. One was a solution based on SQL thew other the solution I found here based on Fso by Kul-Tigin. I tested them and this is part of the result:
Number of lines Time elapsed Variant
--------------------------------------------------------
110 00:00:00.70 SQL
110 00:00:00.00 Vanilla VBA (my solution)
110 00:00:00.16 FSO
--------------------------------------------------------
1445014 00:00:17.25 SQL
1445014 00:00:09.19 Vanilla VBA (my solution)
1445014 00:00:17.73 FSO
I ran this several times with large and small numbers. Time and again the vanilla VBA came out on top. I know this is far out of date, but for anyone still looking for the fastest way to determine the number of lines in a csv/text file, down here's the code I use.
Public Function GetNumRecs(ASCFile As String) As Long
Dim InStream As Long
Dim Record As String
InStream = FreeFile
GetNumRecs = 0
Open ASCFile For Input As #InStream
Do While Not EOF(InStream)
Line Input #InStream, Record
GetNumRecs = GetNumRecs + 1
Loop
Close #InStream
End Function

VB.Net Replacing Specific Values in a Large Text File

I have some large csv files (1.5gb each) where I need to replace specific values. The method I'm currently using is terribly slow and I'm fairly certain that there should be a way to speed this up but I'm just not experienced enough to know what I should be doing. This is my first post and I tried searching through to find something relevant but didn't come across anything. Any help would be appreciated.
My other thought would be to break the file into chunks so that I can read the entire thing into memory, do all of the replacements there and then output to a consolidated file. I tried this but the way I did it actually ended up seeming slower than my current method.
Thanks!
Sub Main()
Dim fName As String = "2009.csv"
Dim wrtFile As String = "2009.1.csv"
Dim lRead
Dim lwrite As String
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim bulkWrite As String
bulkWrite = ""
Do While strRead.Peek <> -1
lRead = Split(strRead.ReadLine(), ",")
If lRead(9) = "5MM+" Then lRead(9) = "5000000"
If lRead(9) = "1MM+" Then lRead(9) = "1000000"
lwrite = ""
For i = LBound(lRead) To UBound(lRead)
lwrite = lwrite & lRead(i) & ","
Next
strWrite.WriteLine(lwrite)
Loop
strRead.Close()
strWrite.Close()
End Sub
You are splitting and the combining, which can take some time.
Why not just read the line of text. Then replace any occurance of "5MM+" and "1MM+" with the approiate value and then write the line.
Do While ...
s = strRead.ReadLine();
s = s.Replace("5MM+", "5000000")
s = s.Replace("1MM+", "1000000")
strWrite(s);
Loop