Removing stray CRLF embedded in a CSV file column - vb.net

I need some direction on how to solve a problem I am working on. The root issue is that I need to work with CSV files in another program. The source system that creates the CSV files does not strip out CRLF in any of the data fields that get exported (meaning some fields have an embedded CRLF). As a result I receive a CSV file that has malformed rows in it. My end goal is an utility that will
check the first column of each row (which if correct is a GUID with a length of 36, or
count the columns in each row (which is the example below).
In the example below I am looking at the column count. If the correct count is 18 then I want it to write that row to a new file. If the column count is not correct I want to remove the CRLF from that row until the column count is correct.
Again, two ways to solve the issue that I know of:
Check the length of the first column for a length of 36 (before the first comma and excluding the first row which is the title row), or
count the columns and remove any trailing CRLF until the column count is equal to 18 (the total column count).
My issue with the code so far is being able to write out a valid row to a new file. Currently it writes out System.String[] instead of the actual row.
Public Class Form1
Private Sub btnFixit_Click(sender As Object, e As EventArgs) Handles btnFixit.Click
Dim iBadRowNumber As Integer = vbNull
Dim strFixedFile As System.IO.StreamWriter = My.Computer.FileSystem.OpenTextFileWriter(Me.txtFixedFile.Text, True)
Using MyReader As New Microsoft.VisualBasic.FileIO.TextFieldParser(Me.txtBaselineFileToProcess.Text)
MyReader.TextFieldType = FileIO.FieldType.Delimited
MyReader.SetDelimiters(",")
Dim currentRow As String()
While Not MyReader.EndOfData
Try
currentRow = MyReader.ReadFields()
If currentRow.Count = 18 Then
strFixedFile.WriteLine(currentRow)
Else
' Future code here to fix the line
End If
Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
MsgBox("Line " & ex.Message &
"is not valid and will be skipped.")
End Try
End While
End Using
strFixedFile.Close()
End Sub
End Class
Here is an example of 2 correct rows with one incorrect row in the middle. In this example the row beginning with Sometown is really part of the prior row. I have also seen that one true row may be broken into three or more partial rows like similar to what you see in the Sometown row.
CustomerId,CustomerName,Status,Type,CustomerNumber,DBA,Address1,Address2,City,State,ZipCode,WebAddr,EMail,SalesCode,ServiceCode,DivisionCode,BranchCode,DepartmentCode
6d0125cd-70cf-4048-9ee1-8d9682e426a5,"Smith,James",Active,Customer,8,,103 Long Dr,,AnotherTown,NJ,000000,,,!!S,!%9,!!#,!!#,"!""."
35ed375c-c226-4879-a789-469cae63383c,"Doe, John",Active,Customer,55281,,28 Short Drive,,
Sometown,CA,12345,,
email#domain.com,"!$,",!$^,!!#,!!#,!!K
a5972bce-408f-4def-b77c-4ae0148dd045,"Duck,Donald",Active,Customer,25,,236 North Main St,,Mytown,PA,11111,,,!!2,!%9,!!#,!!#,"!""."
There may be much more elegant ways to perform the specific task. I am open either to corrections to my logic above or a totally different way to solve the problem either in VB.net or PowerShell.

Normally, csv can have multiline fields without a problem. But those need to be surrounded with quotes.
In your example this doesn't seem the case, but on the other hand there is no multiline field either, the field with value Sometown starts at a new line. So I wonder if this is the original data.
In case your multiline fields are surrounded with quotes you need to inform your parser about it.
But even with the single lines you will have problems caused by the fields with a seperator inside. Luckily those are quoted (as they should be), so you need to set the TextFieldParser.HasFieldsEnclosedInQuotes property as wel.
Now, if your multiline fields happen to be quoted (as they should be), the above setting should solve everything.
Update
You could do something like this:
currentRow = MyReader.ReadFields()
If currentRow.Count = 18 Then
strFixedFile.WriteLine(currentRow)
Else
'Write current row without newline
'Read next line/row
'WriteLine this row
End If
But you'll have to take care of fields like "Smith,James" with a seperator inside. Make sure your parser handles quoted fields properly (see above).

The most straightforward approach would probably be a variation of your first validation check:
Read the file line-by-line and keep both the current and the previous line in a buffer.
Check if the beginning of the line is a proper GUID (e.g. with a regular expression).
If the current line does not start with a GUID, append it to the previous line.
Otherwise write the previous line to the output file unless it's empty, then replace it with the current line.
I don't know VB.net, but in PowerShell it would look somewhat like this:
$reader = New-Object IO.StreamReader ('C:\path\to\input.csv')
$writer = New-Object IO.StreamWriter ('C:\path\to\output.csv', $false)
$writer.WriteLine($reader.ReadLine()) # copy CSV header
$output = '' # output buffer
$current = '' # pre-buffered current line from input file
while ($reader.Peek() -ge 0) {
# read line into pre-buffer
$current = $reader.ReadLine()
$hasGUID = $current -match '^[a-f0-9]{8}(-[a-f0-9]{4}){3}-[a-f0-9]{12},'
# append line to output buffer if it doesn't have a GUID, otherwise
# write the output buffer to file if it contains data and move the
# current line to the output buffer
if (-not $hasGUID) {
$output += $current
} else {
if ($output) { $writer.WriteLine($output) }
$output = $current
}
}
# write remaining pre-buffered line (if there is one)
if ($current -and $hasGUID) { $writer.WriteLine($current) }
$reader.Close(); $reader.Dispose()
$writer.Close(); $writer.Dispose()

Related

how to search and display specific line from a text file vb.net

Hi I am trying to search for a line which contains whats the user inputs in a text box and display the whole line. My code below doesnt display a messsagebox after the button has been clicked and i am not sure if the record has been found
Dim filename, sr As String
filename = My.Application.Info.DirectoryPath + "\" + "mul.txt"
Dim file As String()
Dim i As Integer = 0
file = IO.File.ReadAllLines(filename)
Dim found As Boolean
Dim linecontain As Char
sr = txtsr.ToString
For Each line As String In file
If line.Contains(sr) Then
found = True
Exit For
End If
i += 1
If found = True Then
MsgBox(line(i))
End If
Next
End Sub
You should be calling ReadLines here rather than ReadAllLines. The difference is that ReadAllLines reads the entire file contents into an array first, before you can start processing any of it, while ReadLines doesn't read a line until you have processed the previous one. ReadAllLines is good if you want random access to the whole file or you want to process the data multiple times. ReadLines is good if you want to stop processing data when a line satisfies some criterion. If you're looking for a line that contains some text and you have a file with one million lines where the first line matches, ReadAllLines would read all one millions lines whereas ReadLines would only read the first.
So, here's how you display the first line that contains specific text:
For Each line In File.ReadLines(filePath)
If line.Contains(substring) Then
MessageBox.Show(line)
Exit For
End If
Next
With regards to your original code, your use of i makes no sense. You seem to be using i as a line counter but there's no point because you're using a For Each loop so line contains the line. If you already have the line, why would you need to get the line by index? Also, when you try to display the message, you are using i to index line, which means that you're going to get a single character from the line rather than a single line from the array. If the index of the line is greater than the number of characters in the line then that is going to throw an IndexOutOfRangeException, which I'm guessing is what's happening to you.
This is what comes from writing code without knowing what it actually has to do first. If you had written out an algorithm before writing the code, it would have been obvious that the code didn't implement the algorithm. If you have no algorithm though, you have nothing to compare your code to to make sure that it makes sense.

VB.net Find And Replace from Data in a DataGridView in a text file

Im sure someone out there can help, im totally new to coding but getting into it and really enjoying. I know this is such a simple question out there for you folks but i have the following, I load a spread sheet of strings (2 columns) into a datagridview the reason i do this because there is over 100,000 find and replaces and these will generally sit within and existing string when searching, then from there i want to simply search a txt file and find and replace a number of strings in it. So it would check each row in a datagrid take from column 1 the find and use column 2 to replace then outputs the string to another txt file once the find and replace has taken place. My current results are that it just takes what was in the first file and copies without replacing in the second find.
Any assistance is gratefully received, many thanks.
Please see below my amateur code:-
Private Sub CmdBtnTestReplace_Click(sender As System.Object, e As System.EventArgs) Handles CmdBtnTestReplace.Click
Dim fName As String = "c:\backup\logs\masterUser.txt"
Dim wrtFile As String = "c:\backup\logs\masterUserFormatted.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim s As String
Dim o As String
For Each row As DataGridViewRow In DataGridView1.Rows
If Not row.IsNewRow Then
Dim Find1 As String = row.Cells(0).Value.ToString
Dim Replace1 As String = row.Cells(1).Value.ToString
Cursor.Current = Cursors.WaitCursor
s = strRead.ReadToEnd()
o = s.Replace(Find1, Replace1)
strWrite.Write(o)
End If
Next
strRead.Close()
strWrite.Close()
Cursor.Current = Cursors.Default
MessageBox.Show("Finished Replacing")
End Sub
1. What you are doing is :
creating a StreamReader whose purpose is to read chars from a File/Stream in sequence.
creating a StreamWriter whose purpose is to add content to a File/Stream.
then looping
a) read the remaining content of file fName and put it in s
b) replace words from s and put the result in o
c) add o to the existing content of the file wrtFile
then the usual closing of the stream reader/writer...
But that doesn't work because, on the secund iteration of the loop, strRead is already at the end of your loaded file, then there is nothing to read anymore, and s is always an empty string starting from the secund iteration.
Furthermore, because s is empty, o will be empty aswell.
And last of all, even if you manage to re-read the content of the file and replace the words, strWrite will not clear the initial content of the output file, but will write the resulting replaced string (o) after the previously updated content of the file.
2. Since you loaded the content of the file in a string (s = strRead.ReadToEnd()), why don't you :
load that s string before the For-Next block
loop the datagridview rows in a For-Next block
replace using the pair Find1/Replace1 s = s.Replace(Find1, Replace1)
then, save the content of s in the targeted file outside the For-Next block
3. However, improving your understanding of how streams work, what should be considered and what are forbidden is a bit outside the scope of SO I think; such documentation could be found/gathered on the MSDN page or with the help of your friend : google. The same applies for finding out/thinking of how you should arrange your code, how to achieve your goal.Let's take an example :
' Content of your file :
One Two Three Four Five Six
' Content of your DataGridView :
One | Two
Two | Three
Three | Four
Four | Five
Five | Six
Six | Seven
The resulting replacement text at the end of a similar routine as yours will be :
Seven Seven Seven Seven Seven Seven ' :/
' while the expected result would be :
Two Three Four Five Six Seven
And that's because of the iteration : already replaced portions of your file (or loaded file content) could get replaced again and again. To avoid that, either :
split the loaded content in single words, and use a "replaced" flag for each word (to avoid replacing that word more than once)
or preload all the pair Find/Replace, and parse the file content in sequence once, replacing that instance when required.
So, before using an interesting object in the framework :
you should know what it does and how it behaves
otherwise -> read the documentation
otherwise -> create a minimalistic test solution which purpose is to brute force testings on that particular object to debunck all its powers and flaws.
So, like I said in 2., move those ReadAllText() and Write() outside the For/Next block to start from and have a look at the resulting output (Ask specific questions in comments when google can't answer) Then if you're OK with it even if issue like the One Two Three example above could occur, then voila ! Otherwise, use google to gather more examples on "splitting text in words" and reformating the whole, have some tries, then get back here if you're stuck on precise issues.

Let VB read certain area in text file, change it and save it

I want my program to read a certain part of a huge txt file, change one value and save the file again.
The file that needs editing looks like this:
168575 = {
name="Hidda"
female=yes
dynasty=9601
religion="catholic"
culture="german"
father=168573
960.1.1 = {
birth=yes
}
1030.1.1 = {
death=yes
}
}
My VB program takes the IDs from the blocks it has to change from another textbox like this.
31060
106551
106550
168575
40713
106523
106522
106555
As you can see, the number I want changed is in the middle of the textbox, the code I use to get the number from the line and look for it in the huge file is
Dim strText() As String
strText = Split(chars.Text, vbCrLf)
and later
If line.Contains(strText(0) & " = {") Then
TextBox1.AppendText(line & Environment.NewLine)
To form a code like:
Dim strText() As String
strText = Split(chars.Text, vbCrLf)
Label4.Text = strText(0)
Dim line As String = Nothing
Dim lines2 As Integer = 0
Using reader2 As New StreamReader("c:/dutch.txt")
While (reader2.Peek() <> -1)
line = reader2.ReadLine()
If line.Contains(strText(0) & " = {") Then
TextBox1.AppendText(line & Environment.NewLine)
End If
lines2 = lines2 + 1
Label2.Text = lines2
End While
End Using
Naturally, this only writes in a textbox the line that it found, how do I get the whole code with that IDs I take from 1 textbox, change the culture to another value and save it again? And repeat this for all the IDs in a textbox? Im not a coding legend but this has been bothering me for ages now :(
There are a few issues to consider here. If you're dealing with a large text file as a "database" and you wish to edit only parts of it without affecting the other parts, then you may wish to investigate editing it as a binary file instead of as a text stream. This has several downsides, however, since it means that you have to be aware of how big your records are and deal with things like padding.
If you can spare the disk IO and RAM (I don't know how huge you mean when you say huge) it would probably be vastly easier to simply load the entire file into an array or List(Of String), find the line representing the person, seek a few lines below that for the field you want (you said culture), change that field in the array or List, and then just resave the entire array or List back to a text file. This would make it fairly easy to do inserts and you wouldn't have to worry about padding, mostly you'd just have to worry about the line endings and the file encoding (and the amount of disk IO and RAM).
Finally, I would suggest that using a custom format text file as a database is generally a "bad" idea in 2014 unless you have a really good reason to be doing that. Your format looks very similar to JSON - perhaps you could consider using that instead of your existing format. Then there would be libraries such as JSON.Net to help you do the right thing and you wouldn't need to do any custom IO code.

Finding the highest value within a single field

I want to find the highest numerical value in a CSV field as this will determine the next highest number.
Dim founditem() As String = Nothing
For Each line As String In File.ReadAllLines("F:\Computing\Spelling Bee\testtests.csv")
Dim item() As String = line.Split(","c)
Do While item(8) = choice
If weeknumber < item(9) Then
weeknumber += 1
Else
Exit Do
End If
Loop
Next
I am getting an "index is out of bounds" exception. Why?
.NET arrays are zero-bound. Their indexes range from 0 to number
of columns - 1.
Do you have ten columns? Because of item(9) you would at least need
to have ten columns.
Note that also empty fields at the end of the line must be separated
by commas in a CSV file. If you have 10 columns, a line must always have 9 commas.
Also an empty line at the end of the file might cause the problem
because it will yield exactly one empty item for that line, instead
of ten.
Add a test for the line length:
Do While item.Length = 10 AndAlso item(8) = choice
If weeknumber < item(9) Then
weeknumber += 1
Else
Exit Do
End If
Loop
If this does not help, set a breakpoint at the beginning of the method, step through it and inspect the variables. The Visual Studio debugger makes it very easy to find most such errors. Even the Exception tells you the line and column numbers of the faulty spot.
Parsing a CSV file is much more than splitting by comma. You may encounter nested commas, such as when comma is part of the value. Your way of parsing will retrieve weird results on such data. You can also have nested newlines, such as when a newline character is part of the value. So your CSV record can span multiple lines. There may be more issues out there, which I don't remember off top of my head.
Better be using a 3rd party CSV parser, such as this one:
KBCsv # Codeplex.

Trim file after a blank line

I have a text file that has multiple blank lines and Im trying to return all the lines between two of them specifically
so if I have a file that looks like this:
____________________________
1########################
2##########################
3
4########################
5##########################
6#######################
7
8#########################
9##########################
10#######################
11####################
12########################
13#########################
14
15##########################
----------------------------
I would like to grab lines 8-13. Unfortunately, it might not always be 8-13 as it could be 9-20 or 7-8, but it will however always be between the 2nd and 3rd line break.
I know how to trim characters and pull out singular lines, but I have no idea how to trim entire sections.
Any help would be appreciated, even if you just point me to a tutorial.
Thanks in advance.
The basic idea here is to get the entire thing as a string, split it into groups at the double line breaks, and then reference the group you want (in your case, the third one).
Dim value As String = File.ReadAllText("C:\test.txt")
Dim breakString As String = Environment.NewLine & Environment.NewLine
Dim groups As String() = value.Split({breakString}, StringSplitOptions.None)
Dim desiredString As String = groups(2)
MsgBox(desiredString)
Edit:
In response to the question in your comment -
Environment.NewLine is a more dynamic way of specifying a line break. Assuming you're running on windows - you could use VbCrLf as well. The idea is that if you were to compile the same code on Linux, it Environment.NewLine would generate a Lf instead. You can see here for more information: http://en.wikipedia.org/wiki/Newline
The reason I used Environment.NewLine & Environment.NewLine is because you want to break your information where there are two line breaks (one at the end of the last line of a paragraph, and one for the blank line before the next paragraph)
What I ended up doing was trimming the last part and searching for what I needed in the first part (I know I didnt include the searching part in the question, but I was just trying to figure out a way to narrow down the search results as it would have had repeated results). Im posting this incase anyone else stumbles upon this looking for some answers.
Dim applist() = System.IO.File.ReadAllLines("C:\applist.txt")
Dim findICSName As String = "pid"
Dim ICSName As New Regex("\:.*?\(")
Dim x = 0
Do Until applist(x).Contains("Total PSS by OOM adjustment:")
If applist(x).Contains(findICSName) Then
app = ICSName.Match(applist(x)).Value
app = app.TrimStart(CChar(": "))
app = app.TrimEnd(CChar("("))
ListBox1.Items.Add(app)
End If
x = x + 1
Loop
End If
How this works is that it looks through each line for the regex until it reaches first word in the breakpoint "Total PSS by OOM adjustment:"