I wasn't sure how to post a "question" that I found an answer to, but thought that it might be worth sharing my solution to save others the time I spent in figuring out how to do this.
Essentially, I have a PDF (with lots of pages/ formatting) that I want to strip the text out of, and paste into something else. However, a simple copy/paste will still leave text in its columns and automatically insert paragraph spaces that you then need to press end, delete, space, then repeat sequence indefinitely. Well, that's what programming was made for - doing repeated tasks for you so you don't have to.
My answer is posted below. If anyone has a better solution please let me know!
Below I pasted my code from a vbscript that I generated to do so. You will still need to go back through your text file and fix some bits & pieces after running the script that didn't follow the standard template that you programmed for.
Also, I'll note that I used notepad++ to determine how (in windows) Adobe reader handled carriage returns versus line feed (since the distinction is rather blurred today). I reference this article and the answer by AAT, which helped me in understanding the difference. The accepted answer is useful when specifically referencing vbs.
REM Set constants, then open file and copy into a buffer (contents)
Const ForReading = 1, ForWriting = 2
Dim fs, txt, contents
Set fs = CreateObject("Scripting.FileSystemObject")
Set txt = fs.OpenTextFile("originalTextFile.txt", ForReading)
contents = txt.ReadAll
txt.Close
REM Replace a double carriage return with un-repeatable text that as placeholder
contents = Replace(contents, vbCrLf & vbCrLf, "$%^&")
REM then replace leftover carriage returns with blank,
contents = Replace(contents, vbCrLf, "")
contents = Replace(contents, vbCrLf, "")
REM finally, restore original carriage returns for paragraph spacing
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
REM Write to file
Set txt = fs.OpenTextFile("textFileRemovedSpaces.txt", ForWriting)
txt.Write contents
txt.Close
MsgBox("Done!")
Step 1: Save pdf as a text file - this strips out the pictures/ etc. With Adobe Reader, do File -> Save as other -> Text.
Step 2: Save above as Something.vbs, and edit file names in script as appropriate. Make sure to also create the empty text file for the script to save the edited text in. Note in vbs, the text "REM" signifies a comment follows.
Step 3: Run Script.
Step 4: Profit!
I've find this useful, as it for the most part saves a lot of effort in editing a 300 page pdf that I needed to convert to a word document.
Again, if anyone has a better solution please let me know!
Related
Is it possible for the following code to produce NUL values within a text file?
var temp_str = "123456;1234567"
My.Computer.FileSystem.WriteAllText(Path & "stats.txt", temp_str, False)
It seems simple, but it writes quite often and I'm seeing several files that get accessed by the application that have Strings written to as:
When opening the file with Notepad++. Some other editors show just squares, and it seems like each character is represented by a block/NUL.
So far I've been unable to reproduce this on my test system. I just find the files on a COMX module's file system that's been running in the field and comes back faulty, but I've been seeing enough of these files to make it a problem that needs to be solved.
Does anyone have an idea to prevent this behaviour?
Hard to say what the problem is without more code, but try this if you want to replace the existing contents of the file:
Dim fileContent = "My UTF-8 file contents"
Using writer As IO.StreamWriter = IO.File.CreateText(fullPathIncludingExtension)
writer.Write(fileContent)
End Using
Or this if you want to append UTF-8 text:
Dim newLines = "My UTF-8 content to append"
Using writer As IO.StreamWriter = IO.File.AppendAllText(fullPathIncludingExtension)
writer.Write(fileContent)
End Using
If you want to append Unicode text, you must use a different constructor for StreamWriter:
Using writer As IO.StreamWriter = New IO.StreamWriter("full/path/to/file.txt", True, Text.Encoding.Unicode)
writer.Write(MyContentToAppend)
End Using
Note that the True argument to the constructor specifies that you want to append text.
The method I used was for text files and gives gibberish as expected.
In: John Smith
Out: PK!~8ìz‡[Content_Types].xml ¢( ´”ÏNÂ#Æï&¾C³WÓ.x0ÆP8•D|€a;…Õvw³;ü{{§´5#UôBRf¾ß÷Ív;½Áº,¢%ú IE7鈲™6³T¼Lâ[“Aa
¦bƒAú—½ÉÆaˆXmB*æDîNÊ æXBH¬CÕÜúˆýL:Po0CyÝéÜHe
¡¡˜*†è÷†˜Ã¢ h´æ¿ë$SmDt_÷UV©ç
€¸,—&KÊÛ<×
I'm a novice at VBA, and I'm trying to read a document line by line so that I can eventually have the macro automatically remove entire lines based on their content.
Sub ayaya()
Dim TextLine As String
Open ActiveDocument.Path & "\Doc1.docm" For Input As #1
Do While Not EOF(1) ' Loop until end of file.
Line Input #1, TextLine ' Read line into variable.
Debug.Print TextLine
Loop
Close #1
End Sub
Part of me hoped that it would give "John Smith". I've seen some solutions put the entire document into a text file. Is there any way where I can delimit the data somehow? I'd like to be able to isolate a single line and remove it.
You are trying to read a docx or docm file, which is a zip archive. Word files are not plain text files, so you won't get anything meaningful treating them as such. You need to open the file with Word or another app that can read such files.
I have a whole bunch of text files that are exported from Photoshop that I need to import into an Excel document. I wrote a macro to get the job done and it seemed to work just fine for my test document but when I tried loading in some of the actual files produced by Photoshop Excel started putting all the data in a separate column except for the first line.
My code that reads the text file:
Open currentDocPath For Input As stream
Do Until EOF(stream)
Input #stream, currentLine
columnContents = Split(currentLine, vbTab)
For n = 0 To UBound(columnContents)
ActiveSheet.Cells(row, Chr(64 + colum + n)).Value = columnContents(n)
Next n
row = row + 1
Loop
Close stream
The text files I am reading look like this, only with much more data:
"Name" "Data" "Info" "blah"
"Name1" "Data1" "Info1" "blah1"
"Name2" "Data2" "Info2" "blah2"
The problem seemed pretty trivial, but when I load it into excel, instaed of looking like it does above it looks like this:
ÿþ"Name" "Data" "Info" "blah"
Name1
Data1
Info1
blah1
Name2
Data2
Info2
blah2
Now I am not sure why this is happening. It seems like the first two characters in the first row are there because those bytes declare the text encoding. Somehow those characters keep the first row formatted correctly while the remaining rows lose their quotation marks and all get moved to new lines.
Could someone who understands UCS-2 Little Endian text encoding explain how I can work around this? When I convert the files to ASCII it works fine.
Cheers!
edit: Okay so I understand now that the encoding is UTF-16 (I don't know a whole lot about character encoding). My main issue is that it's formatting strangely and I don't understand why or how to fix it. Thanks!
As I mentioned in my comment, it appears the file you're trying to import is encoded in UTF-16.
In this vbaexpress.com article, someone suggested that the following should work:
Dim GetOpenFile As String
Dim MyData As String
Dim r As Long
GetOpenFile = Application.GetOpenFilename
r = 1
Open GetOpenFile For Input As #1
Do While Not EOF(1)
Line Input #1, MyData
Cells(r, 1).Value = MyData
r = r + 1
Loop
Close #1
Obviously I can't test it myself, but maybe it'll help you.
Why not just tell excel to import the file. MS has probably put hundreds of thousands of person hours into that code. Record the importation to get easy code.
Remember Excel is a tool for non programmers to do programming things. Use it instead of trying to replace it.
These are the replacement file functions that you use for new code. Add a reference to Microsoft Scripting Runtime.
Opens a specified file and returns a TextStream object that can be used to read from, write to, or append to the file.
object.OpenTextFile(filename[, iomode[, create[, format]]])
Arguments
object
Required. Object is always the name of a FileSystemObject.
filename
Required. String expression that identifies the file to open.
iomode
Optional. Can be one of three constants: ForReading, ForWriting, or ForAppending.
create
Optional. Boolean value that indicates whether a new file can be created if the specified filename doesn't exist. The value is True if a new file is created, False if it isn't created. If omitted, a new file isn't created.
format
Optional. One of three Tristate values used to indicate the format of the opened file. If omitted, the file is opened as ASCII.
The format argument can have any of the following settings:
Constant Value Description
TristateUseDefault
-2
Opens the file using the system default.
TristateTrue
-1
Opens the file as Unicode.
TristateFalse
0
Opens the file as ASCII.
Right, so after hours of searching; I've come up with nothing for excel vba, which I find surprising. Found some vbs that I tried to port over but no luck. I have managed to import the pdf text into sheets and search it, which is good; but this won't allow me to actually highlight the pdf obviously.
What I'm trying to do is open up PDF docs, search them for keywords and then highlight those words and save. I've got adobe acrobat X, so there must be some sort of API that will allow me to do this with excel vba? Am I going to have to use some sort of opensource library like iText; I would prefer not to.
Some of the vbs that I saw involved finding text letter by letter and then drawing rectangles around it and colouring with javascript and that just seemed unnecessarily complicated (couldn't get the port to work anyway...).
CLARIFICATION:
I don't want to highlight the text in excel, I want to highlight it on the PDF. I am only reading it into Excel to search for the text and see if its in the PDF, since I don't know how else to do this.
PS: It would also be nice to be able to use OCR on image pdfs.
Ok, played a little bit around with the code I already have had and js annots.
Attached you will find a VBScript which can mark/highlight a word permanent. It can easily be changed to mark also more as only one word. In the AcroJS help file you can find some options for the markers outfit.
The VBS code I wrote VBA like. So you can copy it direct into your IDE.
Enjoy, Reinhard
'// Save this as xxx.vbs and start with Double Click
'// Acrobat must be opend before with an active document!! -otherwise error-
wordTF = "Reinhard" '//word to find
pdfText = ""
set WshShell = CreateObject ("Wscript.Shell")
WshShell.AppActivate("Adobe Acrobat")
WScript.Sleep 500
'// get the active Document
Set AcroApp = CreateObject("AcroExch.App")
Set AVDoc = AcroApp.GetActiveDoc
Set PDDoc = AVDoc.GetPDDoc
Set AForm = CreateObject("AFormAut.App") 'connect to Form API for later use
maxPages = PdDoc.GetNumPages
for p = 0 to maxPages - 1 '// start the page loop
Set PdfPage = PDDoc.AcquirePage(p) '// p = Pagenumber (zero based)
Set PageHL = CreateObject("AcroExch.HiliteList") '// created to get the page text
PageHLRes = PageHL.Add(0,9000) '<<--SET in FILE! (Start,END[9000=All])
Set PageSel = PdfPage.CreatePageHilite(PageHL)
for i = 0 to PageSel.Getnumtext - 1 '// start the word loop on current page
word = PageSel.getText(i) '// get one word
pdfText = pdfText & word '// gather words on page
if instr(word, wordTF) then '// used instr because the "word" you may get as "word "
msgbox("add:""" &word &"""") Set wordToHl = CreateObject("AcroExch.HiliteList") '// created to get the word on list
wordToHl.Add i, 1 'Hilite the word Reinhard
Set wordHl = PdfPage.CreateWordHilite(wordToHl)
Set rect = wordHl.GetBoundingRect
msgbox("left:" &rect.Left &" bot:" &rect.bottom &" right:"&rect.Right &" top:" &rect.Top)
AVDoc.SetTextSelection(wordHl) '// highlight the word (not really needed)
AVDoc.ShowTextSelect() '// show highlighted text (not really needed)
'// write and execute js to mark permanent (to lazy to translate to jso)
ex = " // set annot for text selection " &vbLf _
& "var sqannot = this.addAnnot({type: ""Square"", page: 1, " &vbLf _
& "rect: [" &rect.left &", "& rect.top &", " &rect.right &", " &rect.bottom &"], " &vbLf _
& "name: ""p" &p &"i" &i &"""});"
msgbox(ex)
AForm.Fields.ExecuteThisJavaScript ex
end if '// word found
Next '// get next word
msgBox(pdfText)
pdfText = ""
next '// get next page
msgbox("Done!")
There are some possibilities to remote control Acrobat. On Mac, it is via AppleScript, and on Windows, it is via VB/VBS (if I remember correctly). In any case, you then have the possibility to run Acrobat JavaScript.
You might download the Acrobat SDK from the Adobe website, and look through the Documentation folder.
Despite the not so good experiences, this is kind of the way to go: loop through all pages of the document, loop through all the "words" on the actual page, read out the coordinates of the bounding box of the found word (also known as "quads"), maybe do some comparisons with other "words", to figure out whether these "words" do belong together. Finally create a Highlight Annotation using as coordinates the read out quads.
Another possibility for finding words in a PDF document would be using the markup part of the Redaction tool (stop the redaction process before the removing and writing back of the redacted document happens). Then you would run an Acrobat JavaScript enumerating all the Redaction type annotations, and replace them with similar Highlight annotations.
I have a form with several text boxes and I want to write the contents of each of them to a new line in a .txt file. As in, the user fills in a form, and the info is stored in the file. Then I want to be able to retrieve the info from the file into the same text boxes. I am able to do this, so far, but I encounter problems when one of the text boxes is multiline.
Printline(1, txtBox1.text)
Printline(1, txtBox2.text)´which is the multiline one
Printline(1, txtBox3.text)
When I read this back from the file I get the second line of the multiline text box where I want the text from txtBox3 to be.
LineInput(1, txtBox1.text)
LineInput(1, txtBox2.text)
LineInput(1, txtBox3.text)
How can I get all the lines from the multiline text box to write to one line in the file, and then read it back as separate lines in a multiline text box?
I hope I am making sense? I really would like to keep the logic of "one txtBox - one line in the file"
I guess I need to use different methods of writing and reading, but I am not that familiar with this, so any help is much appreciated.
You can rely on the Lines Property in case of having more than one line. Sample code (curTextBox is the given TextBox Control):
Using writer As System.IO.StreamWriter = New System.IO.StreamWriter("path", True)
Dim curLine As String = curTextBox.Text
If (curTextBox.Lines.Count > 1) Then
curLine = ""
For Each line As String In curTextBox.Lines
curLine = curLine & " " & line
Next
curLine = curLine.Trim()
End If
writer.WriteLine(curLine)
End Using
NOTE: this code puts in one line all the text from the given TextBox independently upon its number of lines. If it has more than one line, it includes a blank space to separate the individual lines (all of them fitting in a single line of the file anyway). You might want to change this last feature by adding a different separating character (replace & " " & with the one you want).
One option would be to escape the newlines so that they aren't in the output, then unescape them on reading back in.
Here's some example code that will do this (I've never written VB before, so this probably isn't idiomatic):
' To output to a file:
Dim output As String = TextBox2.Text
' Escape all the backslashes and then the vbCrLfs
output = output.Replace("\", "\bk").Replace(vbCrLf, "\crlf")
' Write the data from output to the file
' To read data from the file:
Dim input As String = ' Put the data from the file in input
' Put vbCrLfs back for \crlf, then put \ for \bk
input = input.Replace("\crlf", vbCrLf).Replace("\bk", "\")
' Put the text back in its box
TextBox2.Text = input
Another option would be to store your data in XML, JSON, or YAML. Any of those are text-based formats that will require a library to parse, but should cleanly handle the multiline text you have, along with providing increased future flexibility.
the next simple code works for me.
Saving multiline text to a single line in a file:
str = Replace(MyTextBox.Text, Chr(13) & Chr(10), "*LineFeed*") 'something recognizable
Print #1, str 'no quotes
To get the string from the file and put it on a TextBox:
Line Input #1, str
MyTextBox.Text = Replace(str, "*LineFeed*", Chr(13) & Chr(10))
Hope this helps