Automation: how to automate transforming .doc to .docx?

Automation: how to automate transforming .doc to .docx? - automation

I have a bunch of .doc files in a folder which I need to convert to .docx.
To manually convert the .doc to .docx is pretty simple:
Open .doc in Word 2007
Click on Save As...
Save it as .docx
However, doing this for hundreds of files definitely ain't fun. =p
How would you automate this?

There is no need to automate Word, which is rather slow and brittle due to pop-up messages, or to use Microsoft's Office File Converter (ofc.exe), which has an unnecessarily complicated user interface.
The simplest and fastest way would be to install either Office 2007 or download and install the Compatibility Pack from Microsoft (if not already done). Then you can convert from .doc to .docx easily using the following command:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>
where <input file> and <output file> need to be fully qualified path names.
The command can be easily applied to multiple documents using for:
for %F in (*.doc) do "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme "%F" "%Fx"

The easiest way is to use the command-line Office File Converter. Just run
ofc
and the magic happens.

Automate Word.
If you are using .NET, add Microsoft.Office.Interop.Word (make sure it is version 12 - equivalent to Word 2007 so you can achieve the above) reference assembly to your project and use it it automate word app to do exactly what you want to do above. The pseudocode
Create the application object
Use the application object to open a document (by supplying it the file name)
Use the application object to perform SaveAs by supplying to it the format and output filename
Close the current document
Loop through the above till you finish with all documents
Housekeeping code to release the Word or Doc objects
You can find plenty of example on google, just search for Word Automation in C# or something along that line.

Related

Import .docx contents into MS Access

I began writing a docx document to do a project of mine.
Recently, I realized that it would be easier to manage that data if it was in a database.
So, I wanted to import that data into MS Access automatically, to avoid copying and pasting the data manually.
Is there anyway to do it? I have only encontered ways of opening Word application via Access. I also know that docx has a XML structure, so I imagine if I can open that structure, it would be easy to do a parser in VBA

There are two basic ways information can be taken out of a Word document and put into an Access database: automating the Word object model using VBA code running in either Word or Access OR extracting the WordOpenXML that makes up the Word document. You indicate you lean towards the second option.
Here, again, there are a number of approaches available:
Use VBA in Word or Access to extract the WordOpenXML of the document open in the Word application user interface.
Use VBA in Access together with non-VBA tools to "crack open" the Zip file and extract the XML.
Use the tools available in the .NET Framework to extract the content of the ZIP file and write it to Access using an OLE DB connection.
I understand your goal is to be able to recreate the document at a later point for printing, so you want to preserve all the formatting. In addition, you want to be able to read the content from within Access.
I believe this will require a minimum of four fields in the Access table:
ID
Title
Text of song
The complete WordOpenXML for re-creating the document
You don't mention (4) in the discussion and problem description, but if you want to store the formatting AND you want to be able to read the content I believe this is necessary. While WordOpenXML is "readable", there's a lot of mark-up in there which doesn't make reading comfortable.
All things being equal, I'd go for either VBA working on the open Word document or the .NET approach, using the Open XML SDK (free download .NET library you can reference in Visual Studio and distribute with solutions).
One important thing to keep in mind is storing the Word Open XML in the database. Unless something has changed in Access, you can't store the ZIP file - you need a "streamable" format. That would be the OOXML OPC flat-file format.
When you read the WordOpenXML from a document using VBA, that's what you get, which is why that would be an option for me. The Open XML SDK doesn't have that option, but there is code available from Eric White's blog for doing this.
When you later want to recreate and print the document it should be enough to stream the WordOpenXML to a file with the .xml extension. Or you could convert it back to a docx zip file (same blog).

how do I output my code to a single file?

I need to dump all of my code for my project into a single text file. Is this possible in Visual Studio 2010? I haven't been able to find any options for this in VS. Is there a third party program that can do it? Every search I've done just turns up "how to print from VB", but does not address printing my actual code. Even if I have to do it module by module, that would be acceptable, but copying and pasting is a bit much.
Just FYI, I'm not talking about printing output from my program. I'm talking about printing the program itself.
Thanks.

This can be done outside of visual studio. Start a command prompt, cd to your project directory and:
type *.vb >filecontent.txt
If you have multiple project folders, you'll need to do this for each one as the type command doesn't have a /s subfolder type of parameter.
Alternatively, you could create a batch file that changes to each folder and performs the type command to output the file contents.

How to script Excel 2013's Spreadsheet Compare?

I'm trying to incorporate the fancy new Spreadsheet Compare function from Excel 2013's Inquire Add-in, into a VBA script.
The plan is to have a macro to automate comparison between two spreadsheets with predefined names, and to export all the differences as a new spreadsheet.
Without success, to date.
Here's what I've tried so far:
Normally, to learn how to automate some Excel functionality, I use Record Macro.
If that fails, I look down the list of addable references, to see if I'm missing something obvious.
Both of those have failed in this case. No code appeared relevant to the Spreadsheet Compare, when I recorded a macro (only the peripheral stuff like cell-select appeared). And none of the addable references looked anything like Spreadsheet Compare.
So how can I script Excel's 2013 Spreadsheet Compare, from VBA?

I opened a similar question for automating the Spreadsheet Compare tool from a .NET application, but I haven't found any other way yet than executing it from command-line.
You can do this from your VBA add-in. All you need is to locate the executable file SPREADSHEETCOMPARE.EXE (usually in C:\Program Files (x86)\Microsoft Office\Office15\DCF) and to execute it in command-line with an instruction file as input argument.
This instruction file must be an ASCII file with the two Excel file paths to compare written in separate lines.

You can't.
VBA does not cover add'ins as in this case.
Spreadsheet compare is a 3rd party plugin which got swallowed by Microsoft.
If you need scriptable compare you can find those which do for each cell, for each row... kind of thing on the net.

Create a runCompare.cmd file:
REM Execute from command line spreadsheetcompare.exe
REM
cd C:\Program Files (x86)\Microsoft Office 2013\Office15\DCF
spreadsheetcompare.exe C:\reportNames.txt
In C:\reportNames.txt, save in the same line the .xlsx files you wish to compare:
C:\fileA.xlsx C:\fileB.xlsx
Execute runCompare.cmd.

mercurial version control with word

This is a followup to svn or mercurial version control of word documents
I potentially want multiple non-programmers to be able to use version control on word documents. I can configure mercurial to look at the unzipped docx files. What I want is as follows:
Read from Docx files (answered in that question, using a feature of mercurial to unzip before comparing, awesome!
automatically merge documents whenever there are non-colliding changes. It appears from the previous answer that this is done using comparison tools.
programmatically run word on the two documents if there are collisions, comparing the two.
I have manually opened one file, then another in Word to see what it was like. On my word 2004, it seems a bit buggy, but I see from reviews that the feature is much improved in 2010.
I found this link:
http://office.microsoft.com/en-us/word-help/command-line-switches-for-microsoft-office-word-2007-HP010164010.aspx#BM1
for command lines, and now see that I can execute the command:
winword /q /f file1.docx /f file2.docx
The q is for quiet, /f specifies a file. The docs don't say if I can specify two files but I tried and it loads two in separate windows.
So the only thing I don't know is how to trigger word to compare the two.
Is the word interaction a fairly easy scripting job, or does it involve binary APIs that I don't want to know about, like DCOM, ActiveX, etc.

Digging around in the TortoiseHg directory, I found some examples of scripts implementing diff/merge of doc files in the diff-scripts directory. There is an [extdiff] section in Mercurial.ini that can be configured to use this scripts. This may get you started.

What is the best way to parse Microsoft Office and PDF documents?

I'm developing a Desktop Search Engine using VB9 (VS2008) and Lucene.NET.
The Indexer in Lucene.NET accepts only raw text data and it is not possible to directly extract raw text from a Microsoft Office (DOC, DOCX, PPT, PPTX) and PDF documents.
What is the best way to extract raw text data from such files?

You can, like the Windows Desktop Search, use components implementing the IFilter interface.
Example of its usage from .NET
Links to IFilter implementations
Description of the IFilter interface

I can only talk about MS Office documents here. There are several ways to do this:
Using COM automation
Using converters which output the document in a more accessible format
Using 3rd-party libraries
Using Microsoft's OpenXML SDK
COM automation has the disadvantage of not always being reliable, mainly because applications tend to hang due to modal popup dialogs.
Converters are available for Word. You could check out the Text Converter SDK available from Microsoft which would allow you to use the document converters coming with Word in a stand-alone application. Requires some C coding but since you are using the same conversion engines as Office you will get high-fidelity results. The SDK can be obtained from http://support.microsoft.com/kb/111716.
For the third option using third party libraries you might want to have a look at Apache POI or the b2xtranslator project on SourceForge. The latter provides a C# library which allows you to extract the text from binary Word documents. PowerPoint development is still in an early stadium but text extraction should already be working.
The last option would be to use Microsoft's OpenXML SDK. This might be the preferred/easiest way. Search Google for samples. You could also handle binary documents by first converting them using the Office Compatibility Pack (download and install from Microsoft):
Word:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>
Excel:
"C:\Program Files\Microsoft Office\Office12\excelcnv.exe" -oice <input file> <output file>
PowerPoint:
"C:\Program Files\Microsoft Office\Office12\ppcnvcom.exe" -oice <input file> <output file>

For PDF you can use my company's .NET PDF Reader component that features text extraction.
This is exactly the code you write to extract the text from a PDF:
public String ReadTextFromPages(Stream s)
{
using (PdfTextDocument doc = new PdfTextDocument(s))
{
PdfTextReader rdr = doc.GetPdfTextReader();
return rdr.ReadToEnd();
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas