How to search in PDF using VBA - vba

Hi I am Rookie in VBA,
Is it possible to search within PDF files using VBA? And return the searched target with the name of the file in which the search is matched?
To give you a better understanding what I do: I have a macro that goes on web page (http://cetatenie.just.ro/ordine/articol-11/) loops through the hyperlinks and downloads the matching criteria files (PDF). Now I need to search within them a name and surname (i.e. BLANARI VITALIE) and know in which file/doc it is published.
The only idea that come into my head is to import data but than again the excel file will get too huge.
Please help!

if you have Acrobat Professional installed you can use the Automation (Acrobat.AcroPDDoc) and query the Postscript
Here is an example I found for you (by searching on Google):
get the data from PDF file into Excel sheet(s) or text file(s)
to use this code, you need the below References:
AcroPDFLib
Acrobat
on my pc they are in:
AcroPDFLib: C:\Program Files\Common Files\Adobe\Acrobat\ActiveX\AcroPDF.dll
Acrobat: C:\Program Files\Adobe\Reader 11.0\Reader\AcroRd32.dll
if you can't find them in your VB Editor References dialog, or by searching for in C:\Program Files\Adobe, then you don't have the necessary components installed on your pc to do it the easy way
The hard way is to strip the PostScript and read it into variables then search the variables!
HTH
Philip

Related

Obtain the "absolute path to the workbook" from an .xls file

When I use the Excel "Document Inspector" on a particular .xls file to check for "hidden properties or personal information" it says:
The following document information was found:
* Absolute path to the workbook
How can I obtain the absolute path of the workbook from the file? If it needs to be done programmatically, I could use Java (e.g. Apache POI) or VBA.
I know where the file is currently saved, but what I want to extract is the absolute path to the workbook which is saved in the file I have. This is so I can know where it was saved by the author.
Here's what has happened to the file:
Someone authored it, saving it at some absolute filepath unknown to me
They uploaded it to a website
I downloaded it from the website
Excel indicates that the document contains the absolute path from step 1. I'm after this path, not the place I saved it at step 3 since I know that.
I can reproduced that warning message by simply creating an empty Excel file, added a formula, saved it as BIFF8 (.xls). The Document Inspector will then warn about the absolute path. ... but in my case, there was no filename inside the file.
A simple way to verify this, is to open the file in a hex-editor and search for a well-known save location (i.e. the location where a dummy/test file was stored) - this is either stored as ASCII or as 16-bit string, i.e. every odd byte is a character.
If you want to use the POI developer tools, you can use the following:
To list all Excel records:
java -cp poi-3.16-beta1.jar org.apache.poi.hssf.dev.BiffViewer file.xls
To list the document and summary properties:
java -cp poi-3.16-beta1.jar org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor file.xls
To list any embedded objects beside the usual suspects SummaryInformation, DocumentSummaryInformation and Workbook:
java -cp poi-3.16-beta1.jar org.apache.poi.poifs.dev.POIFSLister file.xls
So after running the tools and recording the output, you can then remove the properties via the Excel Document Inspector and execute the tools again. The output can be diffed and you might find the culprit.
Assuming it is an .xlsx file rather than an older-style .xls file, you can
Rename the workbook as a .zip file
Look at the xl\workbook.xml "file" within the .zip file
and you will find the absolute path when last saved from Excel.
This is why it is not a good idea to share work-related spreadsheets with other people, unless you first clear out this sort of information.
I'm not sure how to find it in the binary format files.

docx4j word/googledocs compatibility

I'm creating a program which extracts a docx file, displays it in a Javafx graphic interface with buttons in place of flags put in the docx, and when one puts on it, it modifies the docx taken in input.
I'm using the docx4j API for extracting and modifying the document.
The problem is that the program fails if i take in entry a docx generated from Microsoft Word. I'm forced to use an artifice.
I'm taking my docx made on Word, then i load it in Google Docs and I use the "Download in .docx format" option. If i directly put the docx from Word in my program, it fails.
I noticed my Word file was two times lighter after being passed trough google doc. Same, if I tale a docx file downloaded from Google Docs, if i open it in Word and modify one letter and save it, he becomes two times heavier. For the record i use word 2008.
That's it, so I'd like to know if someone know what explains this difference.
Thanks

Word Automation Service breaks links in table of contents

I have written a code which utilizes Word Automation Service in order to convert the .DOCX file to the .PDF. I have noticed that in case the Word document contains a table of contents, its links are removed in the PDF. This is very bad for my business case.
On the other hand, manually opening MS Word and saving the same document as PDF preserves the links in the table of contents. This is the behavior I am looking for, but I want to keep my code independent form having MS Office Word installed on the machine running my code.
Has anyone had the similar issue and was anybody able to resolve it?
In my case, i found out that this is something related to Job Settings property. Try to comment or remove this line of code if you have one:
jobSettings.UpdateFields = true;

Using Office 2007 extension (i.e. docx) for skin based On-Screen keyboard

I'm creating a On-Screen keyboard for my application, and it supports skins as well.
Here's what I'm doing with the skins, I have a folder which contains some images and a xml file which maps the images to the keyboard, I want to be able to have the folder as a zip file like in Office 2007 (.docx) and iPhone firmwares (.ipsw), I know I can simply zip the folder and change the extension, what I need to know is how to read the files in the code.
You've got two options, either 1) just use a zip library like SharpZipLib or DotNetZip or 2) try to use the System.IO.Packaging namespace. I think option 1 would be the easiest probably.
There's nothing really magical that Office and other programs are doing, they're just reading a zip file and pulling stuff out of it as needed. Instead of pulling an image from a disk you just pull it from a MemoryStream.

Extract text from a PowerPoint (.ppt or .pptx) file?

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file.
I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the files I'm looking at and don't particularly want to sift through the source code of the library.
Is there an easy way to do this without using the aforementioned library?
If you have MS Office and you save the PPT in the RTF (Rich Text Format), it contains just the text from the presentation. You could then open the file in any editor that understands RTF files and save it as a text (TXT) file.
I expect this to work from Open Office too.
Since you talk of API, this may not be the way to go for you but maybe it will give you newer ideas on getting there. Say, you use multiple macros to do the conversion in stages...
Edit: I got curious and did a short google search
This is what i found on one of the www.openoffice.org pages
As people in this thread have pointed out, retrieving text from an OO
document isn't hard since it's just zipped xml that can be parsed with a
perl script. The problem is getting Microsoft Powerpoint documents into
a zipped XML format in the first place.
I've found that File -> Wizards -> Document Convertor does exactly that.
Just tell it you want to convert Powerpoint documents, not templates,
point it to your source directory and where you want it to spit out the
result and you're away.
I then find unzip -p $file.sxi content.xml | perl -p -e
"s/<[^>]>/\n/g;s/ +//;s/\n\n/\n/g;" -w
works rather well for extracting the text.
Sorry, i don't have Open Office handy to try any of that out.
pptx files are relatively easy to deal with, because they are just zipped xml - you can just unzip them and then strip all the xml tags from the content of the files in the 'ppt/slides' subdirectory of the unzipped stuff, yielding most of the pertinent text.
ppt files are a whole other ballgame, and the process is rendered even more painful because the canonical tool, catppt from the catdoc package, is susceptible to a buffer overflow that makes it nearly useless (it segfaults on a large percentage of ppt files).
LibreOffice-5 File - Export - HTML includes both slide contents and presenter notes.
Then, open the .html file in Firefox or other browser, and File - Save Page As - Text File (or utility such as pandoc -o file.txt file.html).