Fastest way to save OpenXML Visual Basic - vb.net

I'm working on a system that, using Interop, reads in a word doc (that acts as a template) and excel document. The program then uses OpenXML to replace keywords in the word doc based on data in the excel document. It then saves the converted document as a pdf. However the process is slow as it uses a lot of read write requests:
It saves a copy of the template to a directory where the converted files are to go
It then processes it with the excel data read into a datatable and uses openxml to convert
It then converts the word file to pdf and deletes the existing word document
It has to do this for each row in the excel document (which is a lot of reads and writes to storage). I was wondering if there is a way to keep the processing in memory? Unfortanutely the openxml library (a modified I use) needs a directory to the actual word doc and can't just use a word doc object.
I'm thinking of doing the deleting on a directory level instead of file level in an attempt to speed it up. I'm not sure how much of a difference this will make.
Are there any blatant optimisations I'm missing?

Related

Pass a bitmap object to Interop Word function that is expecting a filename string for a bitmap file

The title sounds insane but bear with me. This is a problem that could exist with any object.
I am generating a bitmap object in memory and I would like to pass it directly to another function that wants to open a bitmap file. The simple solution is to write the file to disk, call the function against the file, and then delete the file. I don't want to do that. If I am pushing a high volume of image objects in to a Word document with a VSTO add-in it doesn't make sense to thrash my disk for no reason when the whole thing could be done in memory.
I guess I am looking for a different function to insert a picture in to a Word document that accepts a bitmap object. Or a way to pass a filesystem object that actually points to memory (Not a RAMDisk, but a RAMFile?). Or a way to wire the "Image.Save" directly to the reader of the "AddPicture" function without actually making a file on disk.
Hopefully, there is a better way of doing this.
Here is the code example:
Dim newImage = GenerateImage(InputString, SelectedFormat)
Dim imagePath = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName())
newImage.Save(imagePath, ImageFormat.Png)
With Globals.ThisAddIn.Application
.Selection.InlineShapes.AddPicture(imagePath)
End With
File.Delete(imagePath)
Word can't "stream" (see "Background", below) content, so your choices are 1) The Clipboard or 2) wrapping the bitmap in valid, Word Open XML OPC flat-file format, which means first converting the bitmap to base64.
For the first, you can use standard .NET methods to place the information on the Clipboard in the format you want Word to use. In the Word "interop", the Paste or PasteSpecial methods will insert it. The argument against this approach is, as ever, "interfering" with the user's Clipboard.
Using Word Open XML is as close as you can get to "streaming" content into Word, using the Range.InsertXML method.
Word documents (and other Office files) are essentially "zip packages" of XML and binary files that together make up the document. It's possible to create and edit these files without opening them in the Word (Office) application, which makes the format suitable for server-side work. Any tool that can work with zip files and xml can be used for this; standard is the Microsoft Open XML SDK which offers a complete API of the Office content.
Word, alone, of all the Office applications enables the developer to read and write content in the opened Word document using the OPC flat-file standard. This "concatenates" the entire content of the zip package into an XML String. The Word object model's Range.InsertXML method is used to write content in this format to a Word document open in the Word application.
Information on how to convert a zip package into OPC flat file can be found in this blog article. Information concerning minimal Word Open XML to have a valid OPC version is described in this article; there is a section in there specifically about working with graphics.
Background
Word is based on very old technology - late 1980's. By the mid-1990's it reached a very high standard as a professional word processor and what has happened with it since has mostly been "sugar coating" - adding a bit of this and a bit of that to bring it closer to HTML / page layouting. But the core of the application remains the same... and part of that means Word isn't able to do many of the things the modern developer expects - such as "streaming" data in and out.

Import .docx contents into MS Access

I began writing a docx document to do a project of mine.
Recently, I realized that it would be easier to manage that data if it was in a database.
So, I wanted to import that data into MS Access automatically, to avoid copying and pasting the data manually.
Is there anyway to do it? I have only encontered ways of opening Word application via Access. I also know that docx has a XML structure, so I imagine if I can open that structure, it would be easy to do a parser in VBA
There are two basic ways information can be taken out of a Word document and put into an Access database: automating the Word object model using VBA code running in either Word or Access OR extracting the WordOpenXML that makes up the Word document. You indicate you lean towards the second option.
Here, again, there are a number of approaches available:
Use VBA in Word or Access to extract the WordOpenXML of the document open in the Word application user interface.
Use VBA in Access together with non-VBA tools to "crack open" the Zip file and extract the XML.
Use the tools available in the .NET Framework to extract the content of the ZIP file and write it to Access using an OLE DB connection.
I understand your goal is to be able to recreate the document at a later point for printing, so you want to preserve all the formatting. In addition, you want to be able to read the content from within Access.
I believe this will require a minimum of four fields in the Access table:
ID
Title
Text of song
The complete WordOpenXML for re-creating the document
You don't mention (4) in the discussion and problem description, but if you want to store the formatting AND you want to be able to read the content I believe this is necessary. While WordOpenXML is "readable", there's a lot of mark-up in there which doesn't make reading comfortable.
All things being equal, I'd go for either VBA working on the open Word document or the .NET approach, using the Open XML SDK (free download .NET library you can reference in Visual Studio and distribute with solutions).
One important thing to keep in mind is storing the Word Open XML in the database. Unless something has changed in Access, you can't store the ZIP file - you need a "streamable" format. That would be the OOXML OPC flat-file format.
When you read the WordOpenXML from a document using VBA, that's what you get, which is why that would be an option for me. The Open XML SDK doesn't have that option, but there is code available from Eric White's blog for doing this.
When you later want to recreate and print the document it should be enough to stream the WordOpenXML to a file with the .xml extension. Or you could convert it back to a docx zip file (same blog).

Easiest way to add raw Open XML code to a Word document

I want to insert a chunk of formatted text into a Word document in VBA. Since this will be done server-side, I am not advised take these chunks from another Word document using Office Interop (link), so I presume it would be easiest to use Open XML like this <w:p><w:r><w:t>String from WriteToWordDoc method.</w:t></w:r></w:p> Sadly Application.ActiveDocument.Range.InsertXML fails. So what other quick and dirty alternatives do I have?
For the "bare necessities" see this MSDN article:
https://msdn.microsoft.com/en-us/library/office/dn423225.aspx?f=255&MSPPError=-2147217396
It describes what the minimum requirements are for inserting Open XML into an open Word document. Don't worry that the discussion targets Web add-ins - the principle is the same for any API that inserts WordOpenXML into a Word document at run-time.
I may be misunderstanding where "server-side" is involved in your process, so forgive me if the following is not relevant to your situation: Note that using VBA server-side is a bad idea - Word is not designed to run in a server environment. Better would be to use the Open XML SDK both to retrieve and write the information between two Word documents.

Search MS Word binary file for specific content

I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.

Storing arbitrary meta-data in Microsoft Word document

I need to store custom meta-data in a Word document. The 'document properties' are limited to 255 bytes but I have data which is >> 10k
We are using VBA to write a word extension to interact with our application and want to have our application data stored in the word document. The idea is that that the user can share just the word document without sharing any other data files of our application.
Has anyone ideas how to store arbitrary meta-data efficently in Office 2003+ documents?
Just tried this in the VBA editor immediate window:
ActiveDocument.Variables.Add "foo", String(10000, ".")
? Len(ActiveDocument.Variables("foo"))
10000
Document variables are stored with the document. They seem to go up to 65280 Unicode characters, as I just found out (tested in Word 2003; Update: This limit is the same in Word 2007 and 2010).
If you fear to get near the 60k for your data, I'd recommend either splitting it over several variables, or compressing it before you store it. This post shows some options to do that in VBA.
It fits not exactly your requirements because it does not work with Word 2003. However, it still might be interesting to you (otherwise I recommend you to use document variables as suggested by Tomalak):
In Word 2007 (and later) you may embed arbitrary XML documents in a Word document as Custom XML.
An example is described ins this article on MSDN: How to: Add Custom XML Parts to Documents by Using VSTO Add-Ins