I have a database with tons of PDF documents embedded as OLE objects in Notes RichText fields. Those are not compatible with XPages, so I need to convert the OLE objects into file(attachment)s.
How can I do that in an automatic fashion (I know that it must run in a Notes client (must it?) - or is there a POI way to extract them?
Clarification
I can extract the blob (into memory if I want), but writing it out to disk doesn't create a PDF File since that blob is an OLE container. So I see 2 possible path:
Activate the OLE object and use a method in there
Read the blob and have something that extracts the PDF part (possibly Apache POI)
But I haven't touched any of these approaches and was wondering if some advice could save me hours of tests
Would it be possible with dxl tools? I've worked with the dxl exporter to extract embedded images from a document maybe this is also doable with ole objects?
I used a slightly changed version of the EmbeddedImage object of the lotusscript gold collection project on openntf
This library contains an object Embeddedimagelist which searches the DXL for picture tags and tries to parse its contents. Maybe this would also be applicable with embedded ole objects.
I'd think that something like searching for %PDF and then saving everything since as a file should five you PDF. Theoretically there could be a bunch of things in OLE file, but in most cases you'll get you file simply prefixed with an OLE header (or whatever it's called).
I've used this approach in one occasion (not for PDF though) and it seemed working fine.
I guess it's what openntf approach that jjtbsomhorst is talking about is based upon :-)
Related
The title sounds insane but bear with me. This is a problem that could exist with any object.
I am generating a bitmap object in memory and I would like to pass it directly to another function that wants to open a bitmap file. The simple solution is to write the file to disk, call the function against the file, and then delete the file. I don't want to do that. If I am pushing a high volume of image objects in to a Word document with a VSTO add-in it doesn't make sense to thrash my disk for no reason when the whole thing could be done in memory.
I guess I am looking for a different function to insert a picture in to a Word document that accepts a bitmap object. Or a way to pass a filesystem object that actually points to memory (Not a RAMDisk, but a RAMFile?). Or a way to wire the "Image.Save" directly to the reader of the "AddPicture" function without actually making a file on disk.
Hopefully, there is a better way of doing this.
Here is the code example:
Dim newImage = GenerateImage(InputString, SelectedFormat)
Dim imagePath = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName())
newImage.Save(imagePath, ImageFormat.Png)
With Globals.ThisAddIn.Application
.Selection.InlineShapes.AddPicture(imagePath)
End With
File.Delete(imagePath)
Word can't "stream" (see "Background", below) content, so your choices are 1) The Clipboard or 2) wrapping the bitmap in valid, Word Open XML OPC flat-file format, which means first converting the bitmap to base64.
For the first, you can use standard .NET methods to place the information on the Clipboard in the format you want Word to use. In the Word "interop", the Paste or PasteSpecial methods will insert it. The argument against this approach is, as ever, "interfering" with the user's Clipboard.
Using Word Open XML is as close as you can get to "streaming" content into Word, using the Range.InsertXML method.
Word documents (and other Office files) are essentially "zip packages" of XML and binary files that together make up the document. It's possible to create and edit these files without opening them in the Word (Office) application, which makes the format suitable for server-side work. Any tool that can work with zip files and xml can be used for this; standard is the Microsoft Open XML SDK which offers a complete API of the Office content.
Word, alone, of all the Office applications enables the developer to read and write content in the opened Word document using the OPC flat-file standard. This "concatenates" the entire content of the zip package into an XML String. The Word object model's Range.InsertXML method is used to write content in this format to a Word document open in the Word application.
Information on how to convert a zip package into OPC flat file can be found in this blog article. Information concerning minimal Word Open XML to have a valid OPC version is described in this article; there is a section in there specifically about working with graphics.
Background
Word is based on very old technology - late 1980's. By the mid-1990's it reached a very high standard as a professional word processor and what has happened with it since has mostly been "sugar coating" - adding a bit of this and a bit of that to bring it closer to HTML / page layouting. But the core of the application remains the same... and part of that means Word isn't able to do many of the things the modern developer expects - such as "streaming" data in and out.
I began writing a docx document to do a project of mine.
Recently, I realized that it would be easier to manage that data if it was in a database.
So, I wanted to import that data into MS Access automatically, to avoid copying and pasting the data manually.
Is there anyway to do it? I have only encontered ways of opening Word application via Access. I also know that docx has a XML structure, so I imagine if I can open that structure, it would be easy to do a parser in VBA
There are two basic ways information can be taken out of a Word document and put into an Access database: automating the Word object model using VBA code running in either Word or Access OR extracting the WordOpenXML that makes up the Word document. You indicate you lean towards the second option.
Here, again, there are a number of approaches available:
Use VBA in Word or Access to extract the WordOpenXML of the document open in the Word application user interface.
Use VBA in Access together with non-VBA tools to "crack open" the Zip file and extract the XML.
Use the tools available in the .NET Framework to extract the content of the ZIP file and write it to Access using an OLE DB connection.
I understand your goal is to be able to recreate the document at a later point for printing, so you want to preserve all the formatting. In addition, you want to be able to read the content from within Access.
I believe this will require a minimum of four fields in the Access table:
ID
Title
Text of song
The complete WordOpenXML for re-creating the document
You don't mention (4) in the discussion and problem description, but if you want to store the formatting AND you want to be able to read the content I believe this is necessary. While WordOpenXML is "readable", there's a lot of mark-up in there which doesn't make reading comfortable.
All things being equal, I'd go for either VBA working on the open Word document or the .NET approach, using the Open XML SDK (free download .NET library you can reference in Visual Studio and distribute with solutions).
One important thing to keep in mind is storing the Word Open XML in the database. Unless something has changed in Access, you can't store the ZIP file - you need a "streamable" format. That would be the OOXML OPC flat-file format.
When you read the WordOpenXML from a document using VBA, that's what you get, which is why that would be an option for me. The Open XML SDK doesn't have that option, but there is code available from Eric White's blog for doing this.
When you later want to recreate and print the document it should be enough to stream the WordOpenXML to a file with the .xml extension. Or you could convert it back to a docx zip file (same blog).
Is it possible to use VBA to create a Lotus Notes email/document using an existing RTF file?
Assuming that the computer is installed with a Lotus-Notes Client
Yes, you can use the Lotus Notes/Domino COM or OLE classes. The COM classes do everything in the background. The client must be installed, but it doesn't have to be running. The OLE classes drive the client UI, so the client will start automatically if it isn't already running and the VBA code will interrupt whatever it is doing.
There is nothing in the Lotus APIs, however, to deal with automatically translating an RTF file into a nicely formatted Notes rich text field. The method AppendRTFile() in the NotesRichTextItem class does not work with a conventional RTF. It's older than RTF being considered a "standard" format, so the terminology is confusing, but it's dealing only with special files that store data in Notes' own internal rich text format. You'll be on your own with that part. Your best bet is probably going to be to find a way to convert the RTF to HTML and then use the NotesMIMEEntity class methods to create a text/html body for the document you are creating. This answer to another question on StackOverflow suggests using Word automation for the conversion, but you may want to do a more thorough review of search on 'convert rtf to html' to see if you can come up with anything better.
I'm looking for a .Net component that will allow me to generate Word and/or PDF documents.
This must work on the server without MS Office installation. Preferably free. Also, it needs to be able to generate the documents based on an existing template of some sort i.e. I don't want to generate the whole document from scratch but allow a number of different templates that all have similar content that comes from elsewhere (e.g. database, XML files etc).
My initial investigations have turned up iTextSharp (but not sure if it can work from templates).
Any help that can expedite my investigation time will be much appreciated.
Thanks
I use ActivePDF at work with .NET - give it some HTML and it will output a pdf doc. However it isn't free - but we did look at a few other ways and this was 1
http://pdfcrowd.com/html-to-pdf-api/
It doesn't do word documents but converts html (your template) to pdf
I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.