How to search and highlight text in PDF using Excel VBA - vba

Right, so after hours of searching; I've come up with nothing for excel vba, which I find surprising. Found some vbs that I tried to port over but no luck. I have managed to import the pdf text into sheets and search it, which is good; but this won't allow me to actually highlight the pdf obviously.
What I'm trying to do is open up PDF docs, search them for keywords and then highlight those words and save. I've got adobe acrobat X, so there must be some sort of API that will allow me to do this with excel vba? Am I going to have to use some sort of opensource library like iText; I would prefer not to.
Some of the vbs that I saw involved finding text letter by letter and then drawing rectangles around it and colouring with javascript and that just seemed unnecessarily complicated (couldn't get the port to work anyway...).
CLARIFICATION:
I don't want to highlight the text in excel, I want to highlight it on the PDF. I am only reading it into Excel to search for the text and see if its in the PDF, since I don't know how else to do this.
PS: It would also be nice to be able to use OCR on image pdfs.

Ok, played a little bit around with the code I already have had and js annots.
Attached you will find a VBScript which can mark/highlight a word permanent. It can easily be changed to mark also more as only one word. In the AcroJS help file you can find some options for the markers outfit.
The VBS code I wrote VBA like. So you can copy it direct into your IDE.
Enjoy, Reinhard
'// Save this as xxx.vbs and start with Double Click
'// Acrobat must be opend before with an active document!! -otherwise error-
wordTF = "Reinhard" '//word to find
pdfText = ""
set WshShell = CreateObject ("Wscript.Shell")
WshShell.AppActivate("Adobe Acrobat")
WScript.Sleep 500
'// get the active Document
Set AcroApp = CreateObject("AcroExch.App")
Set AVDoc = AcroApp.GetActiveDoc
Set PDDoc = AVDoc.GetPDDoc
Set AForm = CreateObject("AFormAut.App") 'connect to Form API for later use
maxPages = PdDoc.GetNumPages
for p = 0 to maxPages - 1 '// start the page loop
Set PdfPage = PDDoc.AcquirePage(p) '// p = Pagenumber (zero based)
Set PageHL = CreateObject("AcroExch.HiliteList") '// created to get the page text
PageHLRes = PageHL.Add(0,9000) '<<--SET in FILE! (Start,END[9000=All])
Set PageSel = PdfPage.CreatePageHilite(PageHL)
for i = 0 to PageSel.Getnumtext - 1 '// start the word loop on current page
word = PageSel.getText(i) '// get one word
pdfText = pdfText & word '// gather words on page
if instr(word, wordTF) then '// used instr because the "word" you may get as "word "
msgbox("add:""" &word &"""") Set wordToHl = CreateObject("AcroExch.HiliteList") '// created to get the word on list
wordToHl.Add i, 1 'Hilite the word Reinhard
Set wordHl = PdfPage.CreateWordHilite(wordToHl)
Set rect = wordHl.GetBoundingRect
msgbox("left:" &rect.Left &" bot:" &rect.bottom &" right:"&rect.Right &" top:" &rect.Top)
AVDoc.SetTextSelection(wordHl) '// highlight the word (not really needed)
AVDoc.ShowTextSelect() '// show highlighted text (not really needed)
'// write and execute js to mark permanent (to lazy to translate to jso)
ex = " // set annot for text selection " &vbLf _
& "var sqannot = this.addAnnot({type: ""Square"", page: 1, " &vbLf _
& "rect: [" &rect.left &", "& rect.top &", " &rect.right &", " &rect.bottom &"], " &vbLf _
& "name: ""p" &p &"i" &i &"""});"
msgbox(ex)
AForm.Fields.ExecuteThisJavaScript ex
end if '// word found
Next '// get next word
msgBox(pdfText)
pdfText = ""
next '// get next page
msgbox("Done!")

There are some possibilities to remote control Acrobat. On Mac, it is via AppleScript, and on Windows, it is via VB/VBS (if I remember correctly). In any case, you then have the possibility to run Acrobat JavaScript.
You might download the Acrobat SDK from the Adobe website, and look through the Documentation folder.
Despite the not so good experiences, this is kind of the way to go: loop through all pages of the document, loop through all the "words" on the actual page, read out the coordinates of the bounding box of the found word (also known as "quads"), maybe do some comparisons with other "words", to figure out whether these "words" do belong together. Finally create a Highlight Annotation using as coordinates the read out quads.
Another possibility for finding words in a PDF document would be using the markup part of the Redaction tool (stop the redaction process before the removing and writing back of the redacted document happens). Then you would run an Acrobat JavaScript enumerating all the Redaction type annotations, and replace them with similar Highlight annotations.

Related

How to disable autocomplete in vba?

I'm creating the program for exporting several excel sheets to pdf with watermarks and form fields etc. Everything works smooth right now but the final pdf is quite large. So I was thinking about the best way to make it smaller and I found out that the best result is by simply opening the pdf in Adobe Acrobat and then print it with "Adobe PDF" printer. This way I reduce the file size to 1/6 of the original size.
So I'm trying to do this via the VBA code and it looks like it's prety straight forward code using the JS.
sPath = "some path"
sPathFinal = "some other path"
Dim AcroApp As AcroAVDoc: Set AcroApp = CreateObject("AcroExch.AVDoc")
Dim Document As AcroPDDoc
Dim JSO, pp
AcroApp.Open sPath, ""
Set Document = AcroApp.GetPDDoc()
Set JSO = Document.GetJSObject
Set pp = JSO.getPrintParams
pp.printerName = "Adobe PDF"
pp.Filename = sPathFinal
JSO.Print (pp)
The problem is in the very last line as it should be
JSO.print(pp) - "print" with lowercase "p"
But everytime I step away from the lane, it gets autocorrected to uppercase "P". I tried to turn everything off in Tools -> Options -> Editor -> Code settings as well as on other places in options tab but had no luck so far.
Is there a way to prevent this autocorrect?
(Also I'm not native english speaking so there is quite big chance that it is called differently :)
Short answare is no you can't, because VBA editor auto-correct the case in your code.
This is because VB is case-sensitive (despite it doesn't look like it is), and the editor tries to prevent typo by changing the case of your variables.
If you want to preserve the case and avoid auto-correct, the simplest solution is to use another editor (like Notepad) and compile your code from the command-line.
Hope this help.

VBS Find/ replace double paragraph spacing with single spacing

I wasn't sure how to post a "question" that I found an answer to, but thought that it might be worth sharing my solution to save others the time I spent in figuring out how to do this.
Essentially, I have a PDF (with lots of pages/ formatting) that I want to strip the text out of, and paste into something else. However, a simple copy/paste will still leave text in its columns and automatically insert paragraph spaces that you then need to press end, delete, space, then repeat sequence indefinitely. Well, that's what programming was made for - doing repeated tasks for you so you don't have to.
My answer is posted below. If anyone has a better solution please let me know!
Below I pasted my code from a vbscript that I generated to do so. You will still need to go back through your text file and fix some bits & pieces after running the script that didn't follow the standard template that you programmed for.
Also, I'll note that I used notepad++ to determine how (in windows) Adobe reader handled carriage returns versus line feed (since the distinction is rather blurred today). I reference this article and the answer by AAT, which helped me in understanding the difference. The accepted answer is useful when specifically referencing vbs.
REM Set constants, then open file and copy into a buffer (contents)
Const ForReading = 1, ForWriting = 2
Dim fs, txt, contents
Set fs = CreateObject("Scripting.FileSystemObject")
Set txt = fs.OpenTextFile("originalTextFile.txt", ForReading)
contents = txt.ReadAll
txt.Close
REM Replace a double carriage return with un-repeatable text that as placeholder
contents = Replace(contents, vbCrLf & vbCrLf, "$%^&")
REM then replace leftover carriage returns with blank,
contents = Replace(contents, vbCrLf, "")
contents = Replace(contents, vbCrLf, "")
REM finally, restore original carriage returns for paragraph spacing
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
contents = Replace(contents, "$%^&", vbCrLf & vbCrLf)
REM Write to file
Set txt = fs.OpenTextFile("textFileRemovedSpaces.txt", ForWriting)
txt.Write contents
txt.Close
MsgBox("Done!")
Step 1: Save pdf as a text file - this strips out the pictures/ etc. With Adobe Reader, do File -> Save as other -> Text.
Step 2: Save above as Something.vbs, and edit file names in script as appropriate. Make sure to also create the empty text file for the script to save the edited text in. Note in vbs, the text "REM" signifies a comment follows.
Step 3: Run Script.
Step 4: Profit!
I've find this useful, as it for the most part saves a lot of effort in editing a 300 page pdf that I needed to convert to a word document.
Again, if anyone has a better solution please let me know!

Converting word dialog to VB

It has been a long time since doing any type of coding.
I've been tasked to send a test page (looks like old TV test pattern page) to all of our printers prior to shift start. These pages will help our team determine if there is any physical issues with the printers (bad toner, fuser, etc)
I have found this code and used in a VBA (MS Word).
Sub Sorterprint01()
Dim sPrinter As String
Dim sPrinter1 As String
With Dialogs(wdDialogFilePrintSetup)
sPrinter = .Printer
.Printer = "\\dc999nt09\USPRT_01"
.DoNotSetAsSysDefault = True
.Execute
Application.PrintOut FileName = "\\dc999file\share\7yr\Support\IS_TEAM_LOCAL\TEST SHEETS\BARCODE TEST SORTER01.docx"
.Printer = sPrinter
.Execute
End With
End Sub
I even created a form to print to all or just a specific printer.
Management does NOT want this in a word doc and would prefer to have this in a VB app (even better would be web based).
After extensive research I have found that wdDialogFilePrintSetup is a WORD based dialog and does not work in VB6/2008/2013). I am just getting back into coding and need a quick solution.
Use Automation. If you want to print docx files then you'll need word.
Set word = CreateObject("Word.Application")
With Word.dialogs(wdDialogFilePrintSetup)
etc. Just preface your objects with word..
If using VBScript (which can't access constants - nor can VB6 if you don't add Word as a Reference) you need to specify the actual number that wdDialogFilePrintSetup is equal to.
So
With Word.dialogs(97)

How to send a same reply mail (template saved in outlook) to the mail sender (to, cc)?

I am receiving continues emails from customer (different customers) to update their asset details in database.. once process done .. I have to reply (including cc) from their mail telling like "asset details successfully stored in Database" (am using template) using VBA.
Option Explicit
Public Sub ReplyToAll()
Dim oExp As Outlook.Explorer
'for selected mails in outlook
Dim oSM As mailItem
Dim oNM As mailItem
On Error GoTo Err
Set oExp = Outlook.Application.ActiveExplorer
'Check if something is selected
If oExp.Selection.Count > 0 Then
'Get the first item selected
Set oSM = ActiveExplorer.Selection.Item(1)
'Create a Reply template
Set oNM = oSM.ReplyAll
With oNM
'Change the subject
.Subject = "RE: " & oSM.Subject
'Change the body
.Body = .Body & Chr(13) & Chr(13)
'Display the new mail before sending it
.Display
End With
End If
Exit Sub
Err:
MsgBox Err.Description, vbCritical
End Sub
section 3
Sub ReplyAll()
Dim objOutlookObject As mailItem
For Each objOutlookObject In GetCurrentOutlookItems
With objOutlookObject
.ReplyAll.Display
'prob area code does not include the template saved in the location c ..throws some error
.createitemtemplate("c:\car.jtm")
End With
Next
End Sub
Function GetCurrentOutlookItems() As Collection
Dim objApp As Outlook.Application
Dim objItem As Object
Dim colItems As New Collection
Set objApp = CreateObject("Outlook.Application")
On Error Resume Next
Select Case TypeName(objApp.ActiveWindow)
Case "Explorer"
For Each objItem In objApp.ActiveExplorer.Selection
colItems.Add objItem
Next
Case "Inspector"
colItems.Add objApp.ActiveInspector.CurrentItem
Case Else
' anything else will result in an error, which is
' why we have the error handler above
End Select
Set objApp = Nothing
Set GetCurrentOutlookItems = colItems
End Function
I am sorry my comment was so curt; 500 characters does not leave much room for a full answer.
Your question is very unclear so it is likely to be closed. Certainly, I do not see how anyone could answer it. That is why it is important that you try to solve your own problem and return as necessary with specific questions.
Below I provide links to recent posts that I believe will help you get started. As I said in my comment, look through recent posts. Use the search facility. There are some very good answers here if you look for them.
The first two posts are tutorials written by me. The early steps are the same but, because the questions were not quite the same, later steps are different. Look at both and pick out the bits relevant to you. The others posts all contain information you may find helpful.
How to import the outlook mail data to excel
update excel sheet based on outlook mail
could anyone guide me in creating an outlook macro that does the following
send an email from excel 2007 vba using an outlook template set variables
using visual basic to access subfolder in inbox
vba outlook event moving email
New section in response to new information from questioner
Except for minor modifications, the code in your question was taken from the Microsoft Help file for NewMailEx Event. This code will only work if you have the correct type of installation and if you place it in the correct place:
"The NewMailEx event will only fire for mailboxes in Microsoft Outlook that provide notification for received message such as Microsoft Exchange Server. Also, the event will fire only if Outlook is running. In other words, it will not fire for the new items that are received in the Inbox when Outlook was not open. Developers who want to access these items for customers running Outlook on an Exchange server e-mail account need to implement their code on the server. However, the NewMailEx event will fire against Cached Exchange Mode in all settings: Download Full Items, Download Headers, and Download Headers and then Full Items."
Do you have the correct type of installation? Can you place your code on the server? Even if this is the correct approach for the final version of your macro, I do not believe it is the correct approach while you are learning VBA and Outlook.
You need two things:
a detailed specification of the macro you wish to write and
more understanding of VBA and Outlook.
I doubt you can create the detailed specification yet because you do not know enough about VBA and Outlook. But we can list things you will need to know:
How do you write to your database from Outlook?
How do you identify the mail items you wish to record? In your example, you are checking for a subject of "Hello" and replying "Hi". This is fine for a first experiment but you need to identify the real method. Is it a new sender? Is there specific information in the body of the message? Does a human have to identify such mail items?
In your example, you have a folder "Personal" under "Inbox". Many people seem to have this type of folder structure and Microsoft examples tend to use folders like this. I do not. I have a folder called "!Home". Under this I have folders for "Insurance", "Shopping", "Money". Under these I have folders for my different suppliers. Once I have dealt with a message, I move it to appropriate folder. Replies go to the same folder. This is my system and it works for me. What is your system going to be? Will, for example, there be a single folder for all customers or one per customer?
The above is a starter list of questions for your specification but it is also a starter list of things you need to know.
Perhaps you have a boss who wants you to stop wasting time and start writing the macro but you do not know enough yet to plan the final macro.
Start with my tutorials. The first three steps are about the folder structure. These steps are essential if you have the kind of complex folder structures I have. Next I go through a folder displaying selected information from each mail item. I have steps in which I write message bodies to disc. I suggest you go through both tutorials and try my code. Not all of it will be immediately useful but it is all good background information.
What is your database? Is it Access or Excel? There is some help in my tutorials and in the other links above with writing to Excel which you could adapt for Access.
I think the above is enough for now. Take it slowly and it will start to make sense. I still remember the first time I tried to write an Outlook macro so I understand your confusion. I promise that it will become clear. Best of luck.
New section in response to the following comment:
"hello i have tried ..Got what i want....Removed my previous code..and tried replaced the new code .. Now little help needed from you ....is there any way to use same format like when we click the replyall button in outlook .. my code working fine ..prob is format of the mail is differ .."
Problem 1
.Body = .Body & Chr(13) & Chr(13)
You are using the text body. I think you want the HTML body. Try:
.HTMLBody = .HTMLBody & Chr(13) & Chr(13)
Problem 2
You cannot add to the HTML body in this way. The HTML body will be:
<!doctype ...><html><head> ... </head><body> ... </body></html>
You must add your text to the beginning of the body; that is, just after <body>. If you just add your text, you will be accepting whatever style, margins and colours the sender has used. The following code adds some text that looks the same in every email I have tried it with. My text is within a table with a single cell. The table covers the full width of the page. The text is blue on a white background.
Dim InsertStg As String
Dim Inx As Long
Dim Pos As Long
'Change the body step 1: Create the string to be inserted
InsertStg = "<table border=0 width=""100%"" style=""Color: #0000FF""" & _
" bgColor=#FFFFFF><tr><td><p>"
For Inx = 1 To 10
InsertStg = InsertStg & "Sentence " & Inx & " of first paragraph. "
Next
InsertStg = InsertStg & "</p><p>"
For Inx = 1 To 10
InsertStg = InsertStg & "Sentence " & Inx & " of second paragraph. "
Next
' The following adds a signature at the bottom of the message.
' "font-family" gives a list of fonts to be tried. If these are
' missing from your computer, use the names of fonts you do have.
' "serif" means that if none of the fonts exist any serif font
' that exists is to be used.
InsertStg = InsertStg & "</p><p style = ""font-family: Mistral, " & _
"Vivaldi, serif; font-size: 14px; color: " & _
"rgb(127,0,127)"">John Smith<br>5 Acacia Avenue<br>"
InsertStg = InsertStg & "</p></td></tr></table>"
'Change the body step 2: Find insertion position just after <Body> element
Pos = InStr(1, LCase(.HTMLBody), "<body")
If Pos = 0 Then
Call MsgBox("<Body> element not found in HTML body", vbCritical)
Exit Sub
End If
Pos = InStr(Pos, .HTMLBody, ">")
If Pos = 0 Then
Call MsgBox("Terminating > for <Body> element not found in HTML body", vbCritical)
Exit Sub
End If
'Change the body step 3: Insert my text into body
.HTMLBody = Mid(.HTMLBody, 1, Pos) & InsertStg & Mid(.HTMLBody, Pos + 1)

Recover URL from MS Word fields showing "Error! Hyperlink reference not valid"

I have some word documents that have place holder URL's in them. The URL's are something like "http://<URL>/service.svc". Word has figured that these have to be a valid URL and when the fields get updated, replace them with "Error! Hyperlink reference not valid".
When I mouse over that error text, word pop's up a tooltip still showing the original text. Is there some way to extract the original text? The document is over 80 pages in length. Surely there must be a programmatic way to do this?
I've tried the following code, but it does not seem to find the Hyperlinks in question.
For Each oHyperlink In ActiveDocument.Hyperlinks
If IsObjectValid(oHyperlink) Then
If Len(oHyperlink.Address) > 0 Then
If Mid(oHyperlink.Address, 8, 5) = "<ULR>" Then
oHyperlink.TextToDisplay = oHyperlink.Address
oHyperlink.Range.Font.Color = wdColorBlue
oHyperlink.Range.Font.Underline = wdUnderlineSingle
oHyperlink.Range.Font.UnderlineColor = wdColorBlue
End If
End If
End If
Next oHyperlink
If in the above code sample you remove the line which tests if the item IsObjectValid it comes back with a load of hyperlink stating "Object has been deleted". I'm assuming this could be the broken (by word) "Hyperlinks". How do I get my text back?
Note: This document has been saved and closed so Ctr+z is not an option.
The information is in there, how do I get it?
Have you tried opening the file with Notepad and searching for http? This works for me.
You can also use a program like "strings" to get all of the text strings from the Word file. See http://technet.microsoft.com/en-us/sysinternals/bb897439.aspx
How about:
For Each oHyperlink In ActiveDocument.Hyperlinks
set rng = oHyperlink.Range
rng.Collapse wdCollapseStart
rng.text = oHyperlink.TextToDisplay
oHyperlink.Range.Delete
Next oHyperlink
I just had the same problem and worked around the problem by opening the document with LibreOffice.
I'm sorry I could not find anything in options of Word to open or revert this (Word v16.61 on MacOS 12).