How to read a single line out of a word document - vba

The method I used was for text files and gives gibberish as expected.
In: John Smith
Out: PK!~8ìz‡­[Content_Types].xml ¢( ´”ÏNÂ#Æï&¾C³WÓ.x0ÆP8•D|€a;…Õvw³;ü{{§´5#UôBRf¾ß÷Ív;½Áº,¢%ú ­IE7鈲™6³T¼Lâ[“Aa
¦bƒAú—½ÉÆaˆXmB*æDîNÊ æXBH¬CÕÜúˆýL:Po0CyÝéÜHe
¡¡˜*†è÷†˜Ã¢ h´æ¿ë$SmDt_÷UV©ç
­€¸,—&KÊÛ<×
I'm a novice at VBA, and I'm trying to read a document line by line so that I can eventually have the macro automatically remove entire lines based on their content.
Sub ayaya()
Dim TextLine As String
Open ActiveDocument.Path & "\Doc1.docm" For Input As #1
Do While Not EOF(1) ' Loop until end of file.
Line Input #1, TextLine ' Read line into variable.
Debug.Print TextLine
Loop
Close #1
End Sub
Part of me hoped that it would give "John Smith". I've seen some solutions put the entire document into a text file. Is there any way where I can delimit the data somehow? I'd like to be able to isolate a single line and remove it.

You are trying to read a docx or docm file, which is a zip archive. Word files are not plain text files, so you won't get anything meaningful treating them as such. You need to open the file with Word or another app that can read such files.

Related

VBA - Reading PDF as String - Cannot sometimes but can other times - 'Run time error 62' [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
You can open PDFs in text editors to see the structure of how the PDF is written.
Using VBA I have opened a PDF as a text file and go to extract the text and save it as a string variable in VBA. I want to look through this text to find a specific element; a polyline (called sTreamTrain) and get the vertices of the polyline by using the InStr function.
When i add more vertices to the polyline I cannot seem to extract the text string of the pdf. I get the error 'Run time error 62' which I do not understand what it means or what about the PDF has changed to now have this error.
Attached (via the link) is a PDF that I can read (Document 15) and a PDF I cannot read (Document 16). I have checked in excel so see that the vertices are present in both files. Also there is a copy of my VBA script as a notepad document and also my excel file (but it is difficult to find in my excel file - the script is "Module 6" function called "CoordExtractor_TestBuild01()")
Link:
https://drive.google.com/open?id=1zOhwnFWZZfy9bTAxKiQFSl7qiQLlYIJV
Code snippet of the text extraction process below to reproduce the problem (given an applicable pdf is used):
Sub CoordExtractor_TestBuild01()
'Opening the PDF and getting the coordinates
Dim TextFile As Integer
Dim FilePath As String
Dim FileContent As String
'File Path of Text File
FilePath = "C:\Users\KAllan\Documents\WorkingInformation\sTreamTrain\Document16 - Original.pdf"
'Determine the next file number available for use by the FileOpen function
TextFile = FreeFile
'Open the text file in a Read State
Open FilePath For Input As TextFile
'Store file content inside a variable
Dim Temp As Long
Temp = LOF(TextFile)
FileContent = Input(LOF(TextFile), TextFile)
'Clost Text File
Close TextFile
End Sub
I would like someone to let me know what runtime error 62 is in this context and propose any workflows to get around it in future. Also, I would like to know whether there certain characters you cannot store as strings? - Perhaps these are included when I increase the number of vertices past a certain number.
Also I would prefer to keep the scrips quite simple and not use external libraries because I want to share the script when it is done so others can use it thus its simpler if it works without extra dependencies etc, however, any and all advice welcome since this is only the first half of this project.
Thank you very much.
According to the MSDN documentation, this error is caused by the file containing
...blank spaces or extra returns at the end of the file or the syntax
is not correct.
Since your code works sometimes on documents with very similar names and content to documents where it doesn't work, we can rule out syntax errors in this case.
You can clean up the file contents before processing it any further by replacing the code at the top of your macro with the one below. With this I can read and extract information from your Document16.pdf:
Sub CoordExtractor_TestBuild01()
'Purpose to link together the extracting real PDF information and outputting the results onto a spreadsheet
'########################################################################################
'Opening the PDF and getting the coordinates
Dim n As Long
Dim TextFile As Integer
Dim FilePath As String
Dim FileContent As String
'File Path of Text File
FilePath = "C:\TEST\Document16.pdf" ' change path
'Determine the next file number available for use by the FileOpen function
TextFile = FreeFile
'Open the text file in a Read State
Open FilePath For Input As TextFile
Dim strTextLine As String
Dim vItem As Variant
Line Input #1, strTextLine
vItem = Split(strTextLine, Chr(10))
' clean file of garbage information line by line
For n = LBound(vItem) To UBound(vItem)
' insert appropriate conditions here - in this case if the string "<<" is present
If InStr(1, vItem(n), "<<") > 0 Then
If FileContent = vbNullString Then
FileContent = vItem(n)
Else
FileContent = FileContent & Chr(10) & vItem(n)
End If
End If
Next n
'Clost Text File
Close TextFile
' insert the rest of the code here

VBS Find/ replace double paragraph spacing with single spacing

I wasn't sure how to post a "question" that I found an answer to, but thought that it might be worth sharing my solution to save others the time I spent in figuring out how to do this.
Essentially, I have a PDF (with lots of pages/ formatting) that I want to strip the text out of, and paste into something else. However, a simple copy/paste will still leave text in its columns and automatically insert paragraph spaces that you then need to press end, delete, space, then repeat sequence indefinitely. Well, that's what programming was made for - doing repeated tasks for you so you don't have to.
My answer is posted below. If anyone has a better solution please let me know!
Below I pasted my code from a vbscript that I generated to do so. You will still need to go back through your text file and fix some bits & pieces after running the script that didn't follow the standard template that you programmed for.
Also, I'll note that I used notepad++ to determine how (in windows) Adobe reader handled carriage returns versus line feed (since the distinction is rather blurred today). I reference this article and the answer by AAT, which helped me in understanding the difference. The accepted answer is useful when specifically referencing vbs.
REM Set constants, then open file and copy into a buffer (contents)
Const ForReading = 1, ForWriting = 2
Dim fs, txt, contents
Set fs = CreateObject("Scripting.FileSystemObject")
Set txt = fs.OpenTextFile("originalTextFile.txt", ForReading)
contents = txt.ReadAll
txt.Close
REM Replace a double carriage return with un-repeatable text that as placeholder
contents = Replace(contents, vbCrLf & vbCrLf, "$%^&")
REM then replace leftover carriage returns with blank,
contents = Replace(contents, vbCrLf, "")
contents = Replace(contents, vbCrLf, "")
REM finally, restore original carriage returns for paragraph spacing
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
REM Write to file
Set txt = fs.OpenTextFile("textFileRemovedSpaces.txt", ForWriting)
txt.Write contents
txt.Close
MsgBox("Done!")
Step 1: Save pdf as a text file - this strips out the pictures/ etc. With Adobe Reader, do File -> Save as other -> Text.
Step 2: Save above as Something.vbs, and edit file names in script as appropriate. Make sure to also create the empty text file for the script to save the edited text in. Note in vbs, the text "REM" signifies a comment follows.
Step 3: Run Script.
Step 4: Profit!
I've find this useful, as it for the most part saves a lot of effort in editing a 300 page pdf that I needed to convert to a word document.
Again, if anyone has a better solution please let me know!

line input not working as expected in VBA

I have a text file that I open and attempt to read the individual lines. I have used the same code before on other files with no problem, but for some reason, this particular file is strange. When I do the following command;
Line Input #1, read_string
the string read_string contains the entire sequence of each line in the file concatenated together. When I look at the special chararcters of the file I do see a cariage return. But just so you know what the file looks like, here are the first two lines (daniweb formatting is too strange to print text here),
k_arr[8'h1C]= {10'b001111_0100,10'b110000_1011} ;
k_arr[8'h1C]= {10'b001111_0100,10'b110000_1011} ;
Anybody know how I can read each line? apparently line input doesnt work for this file.
Try
Dim lines() As String
lines = Split(read_string, vbCr) 'splitting with Carriage Return delimiter
'did it work?
Debug.Print lines(1)
Debug.Print lines(2) Dim lines() As String
Each element of the lines array should now contain one line of your text file.
If it didn't work, try with another delimiter instead of vbCr, e.g. vbLf (line feed).

Writing from multiline text box to .txt file, and then reading it back

I have a form with several text boxes and I want to write the contents of each of them to a new line in a .txt file. As in, the user fills in a form, and the info is stored in the file. Then I want to be able to retrieve the info from the file into the same text boxes. I am able to do this, so far, but I encounter problems when one of the text boxes is multiline.
Printline(1, txtBox1.text)
Printline(1, txtBox2.text)´which is the multiline one
Printline(1, txtBox3.text)
When I read this back from the file I get the second line of the multiline text box where I want the text from txtBox3 to be.
LineInput(1, txtBox1.text)
LineInput(1, txtBox2.text)
LineInput(1, txtBox3.text)
How can I get all the lines from the multiline text box to write to one line in the file, and then read it back as separate lines in a multiline text box?
I hope I am making sense? I really would like to keep the logic of "one txtBox - one line in the file"
I guess I need to use different methods of writing and reading, but I am not that familiar with this, so any help is much appreciated.
You can rely on the Lines Property in case of having more than one line. Sample code (curTextBox is the given TextBox Control):
Using writer As System.IO.StreamWriter = New System.IO.StreamWriter("path", True)
Dim curLine As String = curTextBox.Text
If (curTextBox.Lines.Count > 1) Then
curLine = ""
For Each line As String In curTextBox.Lines
curLine = curLine & " " & line
Next
curLine = curLine.Trim()
End If
writer.WriteLine(curLine)
End Using
NOTE: this code puts in one line all the text from the given TextBox independently upon its number of lines. If it has more than one line, it includes a blank space to separate the individual lines (all of them fitting in a single line of the file anyway). You might want to change this last feature by adding a different separating character (replace & " " & with the one you want).
One option would be to escape the newlines so that they aren't in the output, then unescape them on reading back in.
Here's some example code that will do this (I've never written VB before, so this probably isn't idiomatic):
' To output to a file:
Dim output As String = TextBox2.Text
' Escape all the backslashes and then the vbCrLfs
output = output.Replace("\", "\bk").Replace(vbCrLf, "\crlf")
' Write the data from output to the file
' To read data from the file:
Dim input As String = ' Put the data from the file in input
' Put vbCrLfs back for \crlf, then put \ for \bk
input = input.Replace("\crlf", vbCrLf).Replace("\bk", "\")
' Put the text back in its box
TextBox2.Text = input
Another option would be to store your data in XML, JSON, or YAML. Any of those are text-based formats that will require a library to parse, but should cleanly handle the multiline text you have, along with providing increased future flexibility.
the next simple code works for me.
Saving multiline text to a single line in a file:
str = Replace(MyTextBox.Text, Chr(13) & Chr(10), "*LineFeed*") 'something recognizable
Print #1, str 'no quotes
To get the string from the file and put it on a TextBox:
Line Input #1, str
MyTextBox.Text = Replace(str, "*LineFeed*", Chr(13) & Chr(10))
Hope this helps

How to : streamreader in csv file splits to next if lowercase followed by uppercase in line

I am using asp.Net MVC application to upload the excel data from its CSV form to database. While reading the csv file using the Stream Reader, if line contains lower case letter followed by Upper case, it splits in two line . EX.
Line :"1,This is nothing but the Example to explanationIt results wrong, testing example"
This line splits to :
Line 1: 1,This is nothing but the Example to explanation"
Line 2:""
Line 3:It results wrong, testing example
where as CSV file generates right as ""1,This is nothing but the Example to explanationIt results wrong, testing example"
code :
Dim csvFileReader As New StreamReader("my csv file Path")
While Not csvFileReader.EndOfStream()
Dim _line = csvFileReader.ReadLine()
End While
Why should this is happening ? how to resolve this.
When a cell in an excel spreadsheet contains multiple lines, and it is saved to a CSV file, excel separates the lines in the cell with a line-feed character (ASCII value 0x0A). Each row in the spreadsheet is separated with the typical carriage-return/line-feed pair (0x0D 0x0A). When you open the CSV file in notepad, it does not show the lone LF character at all, so it looks like it all runs together on one line. So, in the CSV file, even though notepad doesn't show it, it actually looks like this:
' 1,"This is nothing but the Example to explanation{LF}It results wrong",testing example{CR}{LF}
According to the MSDN documentation on the StreamReader.Readline method:
A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n").
Therefore, when you call ReadLine, it will stop reading at the end of the first line in a multi-line cell. To avoid this, you would need to use a different "read" method and then split on CR/LF pairs rather than on either individually.
However, this isn't the only issue you will run into with reading CSV files. For instance, you also need to properly handle the way quotation characters in a cell are escaped in CSV. In such cases, unless it's really necessary to implement it in your own way, it's better to use an existing library to read the file. In this case, Microsoft provides a class in the .NET framework that properly handles reading CSV files (including ones with multi-line cells). The name of the class is TextFieldParser and it's in the Microsoft.VisualBasic.FileIO namespace. Here's the link to a page in the MSDN that explains how to use it to read a CSV file:
http://msdn.microsoft.com/en-us/library/cakac7e6
Here's an example:
Using reader As New TextFieldParser("my csv file Path")
reader.TextFieldType = FieldType.Delimited
reader.SetDelimiters(",")
While Not reader.EndOfData
Try
Dim fields() as String = reader.ReadFields()
' Process fields in this row ...
Catch ex As MalformedLineException
' Handle exception ...
End Try
End While
End Using