Extract MS Word document chapters to SQL database records? - sql

I have a 300+ page word document containing hundreds of "chapters" (as defined by heading formats) and currently indexed by word. Each chapter contains a medium amount of text (typically less than a page) and perhaps an associated graphic or two. I would like to split the document up into database records for use in an iPhone program - each chapter would be a record consisting of a title, id #, and content fields. I haven't decided yet if I would want the pictures to be a separate field (probably just containing a file name), or HTML or similar style links in the content text. In any case, the end result would be that I could display a searchable table of titles that the user could click on to pull up any given entry.
The difficulty I am having at the moment is getting from the word document to the database. How can I most easily split the document up into records by chapter, while keeping the image associations? I thought of inserting some unique character between each chapter, saving to text format, and then writing a script to parse the document into a database based on that character, but I'm not sure that I can handle the graphics in this scenario. Other options?

To answer my own question:
Given a fairly simply formatted word document
convert it to an Open Office XML document
write a python script to parse the document into a database using the xml.sax python module.
Images are inserted into the record as HTML, to be displayed using a web interface.

Related

Retrieving reports from iManage based on the Document Name or generating list of documents with document information

I extract reports from iManage on daily basis and I was searching macro codes that would automate this process. After much search in various forums, I found this Ed Mozley's link, which I found very helpful to understand about the retrieval process from iManage.
Saving to iManage with VBA
To retrieve reports from the iManage, Ed mentions using of GetDocument function which has 2 parameters (document number, version number). In my case, however, the document number changes everyday with the updates after day-end process and are always unique.
I would like to know if there is a way to generate the list of documents created on a particular date and that list contains Document Name, Document Number, Version ID, Document Creation Time, Database etc. information. If I could generate the list of documents, I could compare my list of relevant documents with this list and then pull the document numbers and save them in array and create the relevant copies using codes suggested by Ed Mozley.
Or can we create copy of document based on the document name that partially matches with the name of the document available in iManage?
Any advice will of great help.
Thank you
Roshan

Word Automation (VBA): Mail Merge Rich Text Format

I'm trying to do a Word MailMerge via VBA from my Access project. I created a clsWordMerge class so I could declare the Word application WithEvents, and take advantage of Word's MailMerge events, mainly the AfterMerge event.
Everything works fine, and I get the finished Word documents created, except that the source fields containing RTF data end up in the document not as formatted text, but instead the RTF codes and data:
<div><font face="Times New Roman" size=3 color=black>This is my <strong><em>test </em></strong>paragraph.</font></div>
Where I would expect to see:
This is my test paragraph
This happens whether I do a mail merge using a CSV file for my data source or an Access table.
So is there any way to correct this, and show the formatted data? I have access to all of the MailMerge events that Word provides.
Thanks..
No, there's no way to merge in RTF and have it display as Word content. RTF is not Word's native file format - a converter is required to display RTF as Word content.
Mail merge literally displays the data text, as it appears in the data source. There are no "advanced features" that enable selectively formatting the mail merge result.
Also, based on painful experience, relying on MailMergeAfterMerge is not advisable. When it was introduced, I tried it, was enthusiastic... until it started failing. The event is unpredictable and not reliable.
Given your requirements, a fully VBA-driven data transfer from Access to Word is a better investment of time and energy.
It probably can be done in certain circumstances, but I agree with Cindy Meister that the Mail Merge Events have not proven reliable (unless they have been fixed - I haven't actually used them for years). The following description of real and likely problems that I have previously encountered when trying this may help:
Not sure any of it can be done if you are merging to Email.
AFAICR the event you are likely to need (MailMergeBeforeRecordMerge) only fires each time Word processes the Main Document, not each time it processes a record in the data source. So if your Mail Merge Main Document "consumes" more than one Data Source record, e.g. because it uses { NEXT } or { NEXTIF } fields, it may be very difficult to get MaiMergeBeforeRecordMerge to do what you need. If I am right about that, that would be enough to put me off making the attempt.
in order to insert your "RTF", you must either
a. Have code that can interpret the "RTF" encoding and do all the right things necessary to insert it in your document, or
b. Have code that saves the "RTF" to an external file, then uses (say) Range.InsertFile to insert it and have Word interpret its contents, or perhaps
c. Use the clipboard to help you do the conversion.
If any of your rich text fields actually contained RTF, (a) would be difficult unless you could find a suitable library to help you. But in fact your sample shows a typical Access rich text field value, which is HTML-like. In fact, I think it is all standard HTML tagging that Word can interpret, but I don't know for sure. That could be much easier to interpret, especially if you only need the plain text (at its simplest, you might be able to throw away the tagging and insert the result.
If your rich text is longer than 255 characters (including the markup), Word's Document.MailMerge.DataSource.DataFields("the case-sensitive field name as Word sees it").Value will be truncated. So if you need the whole of the text or more of it, you'll have to get it somewhere else
The value inserted in the document using a { MERGEFIELD } field is not truncated to 255 characters so you may be able to get the value from the document. Word MailMerge may impose another limit (can't remember, perhaps 64Kb for an OLE DB connection, perhaps less, or perhaps there is a length limit for the data as a whole.
If you can't get the data from the document, you can get it directly from Access. Probably rather easily if your code is running in Access, but it can be done by using ADODB or perhaps ADO from Word VBA code. Your Mail Merge Data Source will need to retrieve the key fields of the record if you want to do that reliably. During development, if your application is running from Access but you are using VBA code in Word, you will probably also need to make sure that you save your Access database each time you modify your Access VBA code, otherwise Access opens the database exclusively and Word won't be able to retrieve data from it.
If you need to use (b) or (c) to save your HTML to a file then you may need to surround the HTML that you get from Access with tags and possibly tags to get Word to recognise the HTML. You could use Scripting.FileSystemObject to save the text, or perhaps ADODB.Stream if you are already using ADODB to retrieve Access data.
You should be able to use VBA Range.InsertFile to insert it, as long as you have some placeholder that tells you what to put it. Or you could use an INCLUDETEXT field and ensure that your Event code updates that field. A snag with the INCLUDETEXT approach is that if you merge to a new document, the INCLUDETEXT fields remain in the document so if you update them, they will all end up with the same result if you do not also create a new file for each source record.
i.e. quite a lot to think about!

How to extract table data from pdf and store it in csv/excel using Automation Anywhere?

I want to extract the table data from pdf to excel/csv. How can I do this using Automation Anywhere?
Please find below the sample table from pdf document.
There are multiple ways to extract data from PDFs.
You can extract raw data, formatted data, or create form fields if the layout is consistent.
If the layout is more random, you might want to take a look at IQ Bot, where there are predefined classifications for things like Orders etc.
I would err on using form fields if you have unusual fonts like " for inches character if you have a standard format, since the encoding doesn't map well with the raw/formatted option.
The raw format has some quirks where you don't always get all the characters you expect, such as missing first letter of a data item for raw.
The formatted option is good at capturing tabular columns as they go across the line.

format structure of a ms word document

I have converted a well structured pdf document into a rich text format.
By Structured document I mean the document has well formatted heading levels, bullets and numberings, and sections, and also has contents table.
After conversion from pdf, the rich text file appears almost exactly similar to the original pdf document, but the formatting data is not available in the document. Heading levels and numberings are not available in outline view of the ms word. the numberings seems to be a plain text typed one after another. they do not behave like a normal ms word numbering which increments automatically for every new line. Similarly for bullets and headings. they do not form a structure of sections.
for eg: when I select a bullet character, the bullet characters of the same group should be highlighted. instead only the bullet character which I select gets highlighted.
It is a document with 200+ pages. I need to apply styles and formatting supported by MS word by default. Kindly help in finding a way to do this.
Now, I used MS Office 2013 to convert the PDF into rich text and it worked fine. It retained almost all of the heading information and table structure. Even though I had to apply some manual formatting, it is by far a well formatted conversion I have used.

Word VBA - Matching large selecting of text based keys with data. Embedded resource/text?

I have a pretty complex VBA plugin for Word written that automatically creates a report for me, using XML input, cycling through the X objects within the report to create the output. It is currently embedded into a Word Template file .DOCM.
I need to insert into the report a static list of text, based on the name of the item within the XML. For example, within my XML I have entries with a name BLAH1, BLAH2, BLAH3. Every time I see BLAH1, I need to match it with the static INSERT1, and BLAH2 match it with INSERT2, etc.
This seems simple enough, but her lies the problem...
It appears there are no Hashmap's in VBA without requiring external libraries, which I can't really rely on, since I can't install items on the machines where this will be running. As a result I can't store this reference data in a Hashmap as far as I can tell.
I can't seem to concatenate more than about 20 lines of strings together without hitting a max within VBA, and just parsing the chunk of text for what I need since there are about 1500 "lines" in my reference data, which greatly exceeds 20.
I also haven't found a way to embed a text, or any other type of file to hold this information within the file, and then parse the data.
I really would like to have everything within the single template file, without requiring additional text or other files to be bundled with the document. If there is no other option, I will go that route, but I wanted to see what create ideas people at Stackoverflow might have first ;-)
Have you considered using Word's Document Variables? They are name/value pairs stored invisibly within the document. (ActiveDocument.Variables("BLAH1").Value = "INSERT1" to create one, debug.print ActiveDocument.Variables("BLAH1").Value to retrieve a value (you have to use an error handler to detect non-existent indices if you go that route). Word can store (at least) hundreds of thousands of these things).