How to read the content of an online PDF file into a string variable using VBA? - vba

I am wondering if anyone has dealt with this before. I have a spreadsheet with links to thousands of pdf files. I would like to load the content of each pdf into a string variable and run a few RegEx to extract useful data. I have the function shown below which loads the content of a pdf file into a string, however this function only works for local files. However in my case I am opening the pdf file using IE.Navigate2 "https://www.example.com/mypdf.pdf" this will open the pdf in the browser, how can I load the content of that file into a string. The extreme solution would be to download the file and open it with the function below and then delete it. Please let me know your thoughts. Please note that the function below will only work if you have Acrobat installed (not the reader) you will also will need to add the reference in the VBA project to Adobe Acrobat Type Library
Public Function ReadAcrobatDocument(strFileName As String) As String
Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc
Dim AcroHiliteList As CAcroHiliteList, AcroTextSelect As CAcroPDTextSelect
Dim PageNumber, PageContent, Content, i, j
Set AcroApp = CreateObject("AcroExch.App")
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
If AcroAVDoc.Open(strFileName, vbNull) <> True Then Exit Function
' The following While-Wend loop shouldn't be necessary but timing issues may occur.
While AcroAVDoc Is Nothing
Set AcroAVDoc = AcroApp.GetActiveDoc
Wend
Set AcroPDDoc = AcroAVDoc.GetPDDoc
For i = 0 To AcroPDDoc.GetNumPages - 1
Set PageNumber = AcroPDDoc.AcquirePage(i)
Set PageContent = CreateObject("AcroExch.HiliteList")
If PageContent.Add(0, 9000) <> True Then Exit Function
Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)
' The next line is needed to avoid errors with protected PDFs that can't be read
On Error Resume Next
For j = 0 To AcroTextSelect.GetNumText - 1
Content = Content & AcroTextSelect.GetText(j)
Next j
Next i
ReadAcrobatDocument = Content
AcroAVDoc.Close True
AcroApp.Exit
Set AcroAVDoc = Nothing: Set AcroApp = Nothing
End Function

Related

How to check pdf check box

I am trying to read one pdf and a VBA userform and then fill out another pdf.
I wrote code to read all text in a pdf and then find certain sub strings based on tokens that I can find in the string. It is intended to populate the fields in the destination pdf based on the substrings and check the appropriate text boxes based on the user form. I can get the code to fill the substrings and then save the document, but it won't check the boxes.
Before the code used a AVDoc, but I switched to a JSO because I don't want the pdf to pop up, and the jso avoids that problem.
I tried pdfBool.value = cBool(vbaBool), pdfBool.value = 1, pdfBool.value = "1", jso.setValue("checked"), jso.setValue("yes"), etc.
This code will run without crashing. I reduced the number of variables to one string and one bool for the sake of the example.
Sub main()
‘findString grabs all text from a pdf file. This code works.
Dim mystr As String
If findString(mystr) = False Then
Application.StatusBar = "Cannot find Source PDF"
Exit Sub
End If
Dim mypath As String
mypath = ActiveWorkbook.Path & "\destination.pdf"
Dim aApp As acrobat.AcroApp
Dim pdfDoc As acrobat.CAcroPDDoc
Dim jso As Object
Set aApp = CreateObject("AcroExch.App")
Set pdfDoc = CreateObject("AcroExch.PDDoc")
If pdfDoc.Open(mypath) = True Then
Set jso = pdfDoc.GetJSObject
Dim vbaText As String
Dim vbaBool As String
vbaText = returnString("Token1")
vbaBool = userForm.checkBox1.value
Dim pdfText As Object
Dim pdfBool As Object
Set pdfText = jso.getField("TextField1")
Set pdfBool = jso.getField("CheckBox1")
pdfText.Value = vbaText
pdfBool.Value = vbaBool
'save pdffile
Dim fileSavePath As String
fileSavePath = ActiveWorkbook.Path & "\My Save File.pdf"
pdfDoc.Save PDSaveFull, fileSavePath
'clean up memory
Set pdfDoc = Nothing
Set pdfText = Nothing
Set pdfBool = Nothing
Set jso = Nothing
End If
aApp.Exit
Set aApp = Nothing
Unload userForm1
End Sub
Ok, so after some searching, I have found a solution. Basically, forms created using Living Cycle don't work well with checkboxes. I asked somebody in my organization and they confirmed that Living Cycle was used on forms for a while until we got rid of it. Honestly, I don't know what Living Cycle is, but the solution seemed to work and so I think whatever the issue was related to something called "Living Cycle".
The solution? Redo the pdf form: I exported the pdf to an Encapsulated PostScript file. This stripped away all the fields. After that, I used the prepare form tool which automatically found all the relevant fields. Fortunately, with my pdf, it found all of the fields perfectly, though there was one or two extra ones that I had to delete. The field names and the code need to match so adjustments need to either be made to the PDF or to the code, but once I made that adjustment, everything was perfect.
Try jso.getfield(pdfFieldName).Value = "Yes" or "No". The value is case sensitive so you have to use Yes or No.

VBA Adobe Acrobat Sub failing after being successful previously

I have a subroutine that is in charge of combining 22 pdfs into 1. It grabs the first PDF in the list then loops through i+1 all the way to n (where n = 22), inserting those pages to the 1st PDF and then deleting the pdf at location i. So the final product is 1 PDF with all 22 pdfs combined inide of it, and the 22 pdfs get deleted to not bloat the file path. The crazy thing is while this script was working the entire time, it doesn't work anymore! The script skips out and exits the for loop without combining anything.
I've stepped through and have noticed that the MergedDoc.GetNumPages() call (that is found in the Interapplication API Docs for Adobe) is returning -1, so it is failing as per the docs.. As is the If "MergedDoc.InsertPages... " conditional statement, which exits the for..
But previously these things were not failing! Perhaps the document isn't being successfully opened in the .Open() call, but why would that be?
Does anybody have any idea what the issue could be? I included Adobe Acrobat 10.0 Type Library in VBA from the tools -> reference window as well. I am also currently using Adobe Acrobat DC on my machine. The code is below and would love any input.
Thanks!
Sub MergePDFs(FileList As Variant)
Dim i As Integer
'Remember to include Acrobat (tools -> References)
Dim AcroApp As Acrobat.CAcroApp
Dim finalPath As String
Dim numPages As Integer
Set AcroApp = CreateObject("AcroExch.App")
Set MergedDoc = CreateObject("AcroExch.PDDoc")
Set DocToAdd = CreateObject("AcroExch.PDDoc")
finalPath = FileList(0)
'open first file in PDF Array
'MergedDoc.Open ("C:\Users\akhawaja\Documents\_a.pdf")
MergedDoc.Open (finalPath)
MsgBox "Files being combined to path: " & finalPath
For i = LBound(FileList) + 1 To UBound(FileList)
'Loop through 2nd - last.
'1) Open & Get # of pages
'2)Insert pages, Save, exit
'MsgBox FileList(i)
DocToAdd.Open (FileList(i))
' Insert the pages of Part2 after the end of Part1
numPages = MergedDoc.GetNumPages()
'MsgBox numPages
'MsgBox DocToAdd.GetNumPages()
If MergedDoc.InsertPages(numPages - 1, DocToAdd, 0, DocToAdd.GetNumPages(), 0) = False Then Exit For
'MsgBox "Cannot insert pages at doc: " & FileList(i)
'End If
If MergedDoc.Save(PDSaveFull, finalPath) = False Then Exit For
'MsgBox "Cannot save the modified document"
'End If
DocToAdd.Close
'Delete PDF file now that is has been added
Kill (FileList(i))
Next i
MergedDoc.Close
AcroApp.Exit
Set AcroApp = Nothing
Set MergedDoc = Nothing
Set DocToAdd = Nothing
MsgBox "Done"
End Sub
Just figured it out - the path was being used as a OneDrive URL, when I changed the folder to a path with a C:\ url it ended up having no issues. Weird I know. Thanks for the help!

Excel VBA code for searching PDF in Adobe Acrobat

I want to search a PDF file for a string and print the number of counted instances. I've done this for Word, Excel, and Powerpoint, but never Acrobat. There is an error when I call acroDoc.Range, so I assume this is the wrong syntax for Acrobat.
Run-time error '450': Wrong number of arguments or invalid property assignment.
I can't find answers in Adobe's documentation. What is the correct syntax for selecting the whole document and searching for a string?
Sub pdfSearch()
Dim acroApp As Object
Dim acroDoc As Object
Dim aRng As Object
Dim i As Integer
i = 0
Set acroApp = CreateObject("AcroExch.App")
Set acroDoc = CreateObject("AcroExch.pddoc")
acroDoc.Open ("C:\Documents\example.pdf")
Set aRng = acroDoc.Range
With aRng.Find
Do While .Execute(FindText:="desk", MatchCase:=False)
i = i + 1
Loop
End With
acroDoc.Close 0
Set aRng = Nothing
Set acroDoc = Nothing
Set acroApp = Nothing
Debug.Print (i)
End Sub
Acrobat doesn't have a concept of Range. FindText finds the specified text, scrolls so that it is visible, and highlights it. The return value is -1 when the text is found. Unless you also pass a parameter to reset the selection, subsequent calls will start where you left off so to get the count, you just loop until the return value is something other than -1. I haven't used VAB in quite a while but I think the code would look like this...
i = 0
Set acroApp = CreateObject("AcroExch.App")
Set acroDoc = CreateObject("AcroExch.AVDoc")
acroDoc.Open ("C:\Documents\example.pdf")
Do While acroDoc.FindText("desk",0) == -1
i = i + 1
Loop
Documentation to FindText:
http://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FIAC_API_OLE_Objects%2FFindText.htm

Automation of PDF String Search using Excel VBA - OLE error

I'm getting this error, "Microsoft Excel is waiting for another application to complete an OLE action" when trying to automate a PDF string search and record findings in excel. For certain PDFs this error is not popping. I assume this is due to the less optimized PDFs taking a longer time to search string while indexing page by page.
To be more precise, I have a workbook containing two sheets. One contains a list of PDF file names and the other has a list of words that I want to search. From the file list the macro would open each PDF file and take each word from the list of words and perform a string search. If found it would record each finding in a new sheet in the same workbook with the file name and the found string.
Below is the code I'm struggling with. Any help is welcome.
Public Sub SearchWords()
'variables
Dim ps As Range
Dim fs As Range
Dim PList As Range
Dim FList As Range
Dim PLRow As Long
Dim FLRow As Long
Dim Tracker As Worksheet
Dim gapp As Object
Dim gAvDoc As Object
Dim gPDFPath As String
Dim sText As String 'String to search for
FLRow = ActiveWorkbook.Sheets("List Files").Range("B1").End(xlDown).Row
PLRow = ActiveWorkbook.Sheets("Prohibited Words").Range("A1").End(xlDown).Row
Set PList = ActiveWorkbook.Sheets("Prohibited Words").Range("A2:A" & PLRow)
Set FList = ActiveWorkbook.Sheets("List Files").Range("B2:B" & FLRow)
Set Tracker = ActiveWorkbook.Sheets("Tracker")
'For each PDF file list in Excel Range
For Each fs In FList
'Initialize Acrobat by creating App object
Set gapp = CreateObject("AcroExch.App")
'Set AVDoc object
Set gAvDoc = CreateObject("AcroExch.AVDoc")
'Set PDF file path to open in PDF
gPDFPath = fs.Cells.Value
' open the PDF
If gAvDoc.Open(gPDFPath, "") = True Then
'Bring the PDF to front
gAvDoc.BringToFront
'For each word list in the range
For Each ps In PList
'Assign String to search
sText = ps.Cells.Value
'This is where the error is appearing
If gAvDoc.FindText(sText, False, True, False) = True Then
'Record findings
Tracker.Range("A1").End(xlDown).Offset(1, 0) = fs.Cells.Offset(0, -1).Value
Tracker.Range("B1").End(xlDown).Offset(1, 0) = ps.Cells.Value
End If
Next
End If
'Message to display once the search is over for a particular PDF
MsgBox (fs.Cells.Offset(0, -1).Value & " assignment complete")
Next
gAvDoc.Close True
gapp.Exit
set gAVDoc = Nothing
set gapp = Nothing
End Sub
I have now found the answer to this problem.
I'm using Acrobat Pro and whenever I open a PDF file, it opens with limited features due to Protected View settings. If I disable this function or if I click Enable All Features and save changes to the PDF files, VBA macro runs smooth.
It's funny, I'm posting an answer to my own problem.

VBA: Acrobat Run time error 429; ActiveX component can't create object

I have the following codes to read in contents from a PDF file in Excel VBA:
'Note: A Reference to the Adobe Library must be set in Tools|References!
Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc
Dim AcroHiliteList As CAcroHiliteList, AcroTextSelect As CAcroPDTextSelect
Dim PageNumber, PageContent, Content, i, j
Set AcroApp = CreateObject("AcroExch.App")
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
If AcroAVDoc.Open(strFileName, vbNull) <> True Then Exit Function
' The following While-Wend loop shouldn't be necessary but timing issues may occur.
While AcroAVDoc Is Nothing
Set AcroAVDoc = AcroApp.GetActiveDoc
Wend
Set AcroPDDoc = AcroAVDoc.GetPDDoc
For i = 0 To AcroPDDoc.GetNumPages - 1
Set PageNumber = AcroPDDoc.AcquirePage(i)
Set PageContent = CreateObject("AcroExch.HiliteList")
If PageContent.Add(0, 9000) <> True Then Exit Function
Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)
' The next line is needed to avoid errors with protected PDFs that can't be read
On Error Resume Next
For j = 0 To AcroTextSelect.GetNumText - 1
Content = Content & AcroTextSelect.GetText(j)
Next j
Next i
ReadAcrobatDocument = Content
AcroAVDoc.Close True
AcroApp.Exit
Set AcroAVDoc = Nothing: Set AcroApp = Nothing
End Function
Sub demo()
Dim str As String
str = ReadAcrobatDocument("C:\Desktop\asdf.pdf")
End Sub
However, I am getting the runtime 429 error at
Set AcroApp = CreateObject("AcroExch.App")
What is wrong? I have Adobe Reader X and the references I've checked are:
Acrobat Access 3.0 Type Library
AcroBrokerLib
AcroIEHelper 1.0 Type Library
AcroIEHelperShim 1.0 Type Library
Adobe Acrobat Browser Control Type Library 1.0
Adobe Acrobat 10.0 Type Library
Adobe Reader File Preview Type Library
From the very first result in Google for the search query:
createobject acroexch.app error 429
You cannot do this with Adobe Reader, you need Adobe Acrobat:
This OLE interface is available with Adobe Acrobat, not Adobe Reader.
https://forums.adobe.com/thread/657262