What is the best way to parse Microsoft Office and PDF documents?

What is the best way to parse Microsoft Office and PDF documents? - vb.net

I'm developing a Desktop Search Engine using VB9 (VS2008) and Lucene.NET.
The Indexer in Lucene.NET accepts only raw text data and it is not possible to directly extract raw text from a Microsoft Office (DOC, DOCX, PPT, PPTX) and PDF documents.
What is the best way to extract raw text data from such files?

You can, like the Windows Desktop Search, use components implementing the IFilter interface.
Example of its usage from .NET
Links to IFilter implementations
Description of the IFilter interface

I can only talk about MS Office documents here. There are several ways to do this:
Using COM automation
Using converters which output the document in a more accessible format
Using 3rd-party libraries
Using Microsoft's OpenXML SDK
COM automation has the disadvantage of not always being reliable, mainly because applications tend to hang due to modal popup dialogs.
Converters are available for Word. You could check out the Text Converter SDK available from Microsoft which would allow you to use the document converters coming with Word in a stand-alone application. Requires some C coding but since you are using the same conversion engines as Office you will get high-fidelity results. The SDK can be obtained from http://support.microsoft.com/kb/111716.
For the third option using third party libraries you might want to have a look at Apache POI or the b2xtranslator project on SourceForge. The latter provides a C# library which allows you to extract the text from binary Word documents. PowerPoint development is still in an early stadium but text extraction should already be working.
The last option would be to use Microsoft's OpenXML SDK. This might be the preferred/easiest way. Search Google for samples. You could also handle binary documents by first converting them using the Office Compatibility Pack (download and install from Microsoft):
Word:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>
Excel:
"C:\Program Files\Microsoft Office\Office12\excelcnv.exe" -oice <input file> <output file>
PowerPoint:
"C:\Program Files\Microsoft Office\Office12\ppcnvcom.exe" -oice <input file> <output file>

For PDF you can use my company's .NET PDF Reader component that features text extraction.
This is exactly the code you write to extract the text from a PDF:
public String ReadTextFromPages(Stream s)
{
using (PdfTextDocument doc = new PdfTextDocument(s))
{
PdfTextReader rdr = doc.GetPdfTextReader();
return rdr.ReadToEnd();
}
}

Related

I need to convert DOC/TXT files to PDF in large batches

We are changing systems and the new system only outputs .DOC or .TXT files for reports. Several of the reports that come out need to be converted to PDF so they are available for our web users on a daily basis. Currently I am testing about 1500 of a single report and before the system is ready I will need to support at least 10 types of reports, each possibly have this 1500 or so convert.
So far I have not found a way to convert this many reports effectively. Part of the problem is that the reports must be converted to a specific size PDF for the them to be read easily. I have tested some software solutions but so far I have not been able find a solution.
I really like Batch Document Converter Pro. We have uses software from this company before and it worked very well for out needs. Whenever I try it though it gives the error
Problem with conversion: word to pdf, check word 2007 or greater is installed and the MS PDF Addon pack for office 2007
I have tried installing different versions of Office (including 2007) on the machine and installed the addon pack with no change.

One tool to try is Libre Office since:
it can run on multiple platforms
it can be driven from the command line or programmatic API
you can use it manually to confirm whether it will do what you need before doing any scripting/programming
it does pretty good conversions
the docx files page format will transition naturally to the PDF
the text files will be converted into a "normal" page layout
I would suggest you firstly install Libre Office, and open some of your documents by hand then export to PDF. If the results are good enough, then you can automate this to run in batches.
If the first step is promising, then the simplest automation is to use the command line. eg:
c:\Program Files\LibreOffice 5\program\soffice --convert-to pdf myDoc.docx
I hope that helps.

Import .docx contents into MS Access

I began writing a docx document to do a project of mine.
Recently, I realized that it would be easier to manage that data if it was in a database.
So, I wanted to import that data into MS Access automatically, to avoid copying and pasting the data manually.
Is there anyway to do it? I have only encontered ways of opening Word application via Access. I also know that docx has a XML structure, so I imagine if I can open that structure, it would be easy to do a parser in VBA

There are two basic ways information can be taken out of a Word document and put into an Access database: automating the Word object model using VBA code running in either Word or Access OR extracting the WordOpenXML that makes up the Word document. You indicate you lean towards the second option.
Here, again, there are a number of approaches available:
Use VBA in Word or Access to extract the WordOpenXML of the document open in the Word application user interface.
Use VBA in Access together with non-VBA tools to "crack open" the Zip file and extract the XML.
Use the tools available in the .NET Framework to extract the content of the ZIP file and write it to Access using an OLE DB connection.
I understand your goal is to be able to recreate the document at a later point for printing, so you want to preserve all the formatting. In addition, you want to be able to read the content from within Access.
I believe this will require a minimum of four fields in the Access table:
ID
Title
Text of song
The complete WordOpenXML for re-creating the document
You don't mention (4) in the discussion and problem description, but if you want to store the formatting AND you want to be able to read the content I believe this is necessary. While WordOpenXML is "readable", there's a lot of mark-up in there which doesn't make reading comfortable.
All things being equal, I'd go for either VBA working on the open Word document or the .NET approach, using the Open XML SDK (free download .NET library you can reference in Visual Studio and distribute with solutions).
One important thing to keep in mind is storing the Word Open XML in the database. Unless something has changed in Access, you can't store the ZIP file - you need a "streamable" format. That would be the OOXML OPC flat-file format.
When you read the WordOpenXML from a document using VBA, that's what you get, which is why that would be an option for me. The Open XML SDK doesn't have that option, but there is code available from Eric White's blog for doing this.
When you later want to recreate and print the document it should be enough to stream the WordOpenXML to a file with the .xml extension. Or you could convert it back to a docx zip file (same blog).

Turn an ArtPro file into a PDF using Automator

I am attempting to use Automator to turn a folder of ArtPro (.ap) images into .pdf's, but I can't find any existing or downloadable actions to do anything other than open a .ap file with automator.
Does anyone know of an action I could download or a different way to automate the conversion of .ap to .pdf? Is it possible to do it using applescript instead?

It is only possible with ArtPro itself (manually) or Automation Engine's Action List. You can try recording your actions with "Watch-me-do" in Automator, but it's not a good idea. Apple Script will not help.
The problem is that Esko has its own file format which no other software can understand.

I could see some approaches:
a) open the document in ArtPro, then use the Print command and write out as PDF
b) (if Preview.app can read in .ap files) open the document in Preview.app and save as PDF
c) if there is no direct way (a) or b)), write out as TIFF and convert that intermediate file, for example in Acrobat or Preview

The ArtPro format is proprietary to Esko - you won't be able to open it in anything else.
Secondly, Esko favours selling its own automation solution (Automation Engine) - ArtPro will not allow you to automate it. It doesn't integrate with Automator and as far as I know it also doesn't publish AppleScript actions.
So basically I think your only option is using Automation Engine from Esko.

You need used task "Export ArtPro to Normalized PDF File" in esko automation engine

Automation: how to automate transforming .doc to .docx?

I have a bunch of .doc files in a folder which I need to convert to .docx.
To manually convert the .doc to .docx is pretty simple:
Open .doc in Word 2007
Click on Save As...
Save it as .docx
However, doing this for hundreds of files definitely ain't fun. =p
How would you automate this?

There is no need to automate Word, which is rather slow and brittle due to pop-up messages, or to use Microsoft's Office File Converter (ofc.exe), which has an unnecessarily complicated user interface.
The simplest and fastest way would be to install either Office 2007 or download and install the Compatibility Pack from Microsoft (if not already done). Then you can convert from .doc to .docx easily using the following command:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>
where <input file> and <output file> need to be fully qualified path names.
The command can be easily applied to multiple documents using for:
for %F in (*.doc) do "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme "%F" "%Fx"

The easiest way is to use the command-line Office File Converter. Just run
ofc
and the magic happens.

Automate Word.
If you are using .NET, add Microsoft.Office.Interop.Word (make sure it is version 12 - equivalent to Word 2007 so you can achieve the above) reference assembly to your project and use it it automate word app to do exactly what you want to do above. The pseudocode
Create the application object
Use the application object to open a document (by supplying it the file name)
Use the application object to perform SaveAs by supplying to it the format and output filename
Close the current document
Loop through the above till you finish with all documents
Housekeeping code to release the Word or Doc objects
You can find plenty of example on google, just search for Word Automation in C# or something along that line.

Read existing PDF file with all format information

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic...), and paragraphs... Is there an code library for doing this, is it open source or commercial?
I am on Windows and favor C# libraries, but C/C++ is also acceptable.

I can very much recommend
pdflib (http://www.pdflib.com/).
Its commercial, but it also has a lite version which you can use for free privately. It contains very muach functionality and is available for all plattforms.

I'd echo Mr. Meyers on this. There appear to be a number of them; search for "pdf parser library" (plus your language) in your favorite search engine.
A few top hits:
http://www.lowagie.com/iText/
http://metacpan.org/pod/PDF::Parse
http://podofo.sourceforge.net/
http://www.vicman.net/download/13733/ (several for .NET)
Note that if you're wanting to edit an existing PDF, you might want to read this:
http://1t3xt.info/tutorials/faq.php?branch=faq.pdf_in_general&node=replace_word

The Pdfium.Net SDK also can help you. Via this API you can get access to a collection of text, images and other objects and ther properties.
Please note I work at the company who develop this API.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas