Using VBA to manage multiple styles definitions in the same Word document - vba

TL/DR: I have a game plan on how to do this below; however, I am wondering if my plan is going to prove to be too complicated, and what additional considerations I need to take into account before diving into building this project. Although I am not an experienced programmer, I am NOT asking for code; I am asking for feedback from experienced Word VBA programmers as to whether my entire idea/approach is one huge mistake.
I have a document "template" (not yet a template file type - I hope to create that as described below) for a report. The report is broken up into different sections:
Letter to the Client
Table of Contents
Section I
Title Page
Body
1.0
2.0
Section II
Title Page
Body
1.0
2.0
Appendix A
Title Page
Body
Appendix B
Title Page
Body
I want each major "metasection" (such as Letter, Section I, Section II, Appendices) to have different styling and formatting. This could be accomplished by having multiple styles for each metasection, e.g.:
Normal-Letter
Normal-SectionI
Normal-Appendices
Heading1-Letter
Heading1-SectionI
Heading1-Appendices
This would quickly become unmanageable.
In order to avoid users having to wade through a huge number of styles to find the correct one (and it is worth noting that if users of this report have to do this, they will likely not use styles AT ALL), it would be nice if I could have the same style name (e.g, Normal) be different depending on which section of the document it is found in. Or said another way, I would like for a document to have multiple style sets depending on the section.
The goals for the user experience are:
The user simply applies the Normal style, Heading1 style, etc, as necessary.
Registered section-specific style definitions are updated when styles are edited via the Modify Style dialog box, or other ways.
The styles are applied automatically and transparently when styles are changed, or when the document is opened, saved, or printed.
ALTERNATIVE: If automatic/transparent style application proves too difficult, execute the style-application routine with a simple command button.
My initial idea on how I might do this in VBA is:
Write VBA code (probably a class) such that there is a style registry of Normals and Heading1s, etc., for each document section.
Write a style-application subroutine which iterates through the registered document sections, selects all the parts with each registered style, and applies the section-specific style from the style registry (preserving any styling that deviates from the style definition).
Write a style-update subroutine that automatically and transparently updates the registered style definitions
The style-application subroutine executes any time styling is applied anywhere in one of the registered sections (so I'll need to tie into Events here).
The style-update subroutine executes any time a style definition in a registered section is changed (so here's another Event I'll need to monitor).
I previously asked a similar question about this topic on Superuser. The feedback I received has led me to believe that I can only accomplish the behavior I want using VBA, so I am now asking a follow-up question here on Stackoverflow.
My question is: am I making a mistake here? I have a feeling there is a better way to solve this problem (perhaps using VBA, perhaps not) than this.

Yes, in my opinion, you are making a mistake.
I have just recently finished a project where I have created a document template for a company. My experiences:
Users vary in knowledge level (obviously)
High level users don't like over-engineered files, because they can't use their own macros as they might conflict with the file's own macros, they can't use their doc properties or their own building blocks etc., as these likely won't be compatible with the macros (or at least they think they won't work, and fiddle around until they actually manage to break them)
low level users are intimidated by the automatisms, and keep avoiding them as long as they can (which means as long as their bosses don't order them to use the file), after which point they start hating the file and the work
Complex solutions like this one usually get abandoned after a few years. Eg. the original developer changes jobs, or moves to another department, and nobody understands the code enough to keep managing it (especially if it is not a well-documented, well-written code, which it won't be, as you are not an experienced VBA programmer).
The developer (you) will be inundated with (sometimes false) bug reports and questions and minor change requests, which gets really annoying after a few weeks (trust me on that :) ). They won't dare change even a font size without consulting you, and in the end, they will ask you to do it. Or, even worse, they try to change something, break it, and then tell you to fix your bug.
Your users would have to remember to use section brakes or other kinds of indicators to indicate the next section. This will seem too much for some, too complex, and if they accidentally remove a section indicator (which they inevitably will), all hell brakes lose, and worst of all:
Undo function will be disabled after each macro run. This, to most users, is a disaster. You don't do that to your users.
So I would say don't go down the macro route. Don't use Doc properties, that didn't work at the company I was working with. (Actually an IT company, with mostly high-level users :) ) The high-level users will create and use their own doc properties, for others, it is just a hassle. Bookmarks get constantly deleted, so no-go either.
My advice:
Use styles. Users will learn to use them quickly.
Get a decent document design. Having 4 different sets of title, heading and normal styles in one document is really unprofessional. Consistency is important, especially as this seems to be a letter to you clients. (Yes, I know, your company is different and your bosses are dumb and this is a special case and and and ... Just saying, talk to a designer, and get a professional look for your template.)
You can manage the Style gallery (Home tab, centre) drop down list on a template basis - so your template will load the used styles into the dropdown at the top, and remove everything else. This works really good, and even as much as 20 styles is manageable, if they are well-named.
Use building blocks: title pages, tables, pre-written and formatted Quick parts (legal mumbo-jumbo, company introduction, contacts, etc.), headers and footers...
And, if you want happy-happy and cooperative users:
After creating a blank template, create a full template:
Fill up a document template with texts, pre-written paragraphs, pre-written titles, so they will only have to click and rewrite, without the need to format or bother with styles and Cover pages and the lot
Educate the users: 2 sessions of 1,5 hour Word class can go a long way. It is a must.
Long post. One last thing: creating a complex Word template, you will be sailing a sea of Word bugs and annoyances. Even without writing macros, this won't be a walk in the park. (I for example gave up on making my TOC work in Office 2013, as after 3 days and 10 versions, it still kept on creating a maximum sized extra paragraph whenever it was inserted. Only in W2013. Still no idea why, but I let it go.)
Whatever you decide to do, best of luck, and have a lot of patience! :)

Related

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Best practice when using multiple forms - vb.net

What is the best practice for having many different menus/screens/forms in a visual basic program? Would it be to just make a new form for each menu or screen that I want? Or are there other better options?
I am not trying to make this overly complicated, I have a group project to work on and we all have different skill levels. That said it has peaked my curiosity so I figured it wouldn't hurt to ask before I got started.
I can see this question being closed pretty quickly as being too open ended so allow me to get in my key gripe on this before that happens... no .Visible property for TabControl pages? Seriously, Microsoft??
Which brings me to the key point. If the forms are in some way related but not necessarily identical I prefer to use a single form with different tabs, despite that glaring shortcoming in the control. (Which you don't have to look far to find workarounds for on SO, but a workaround is still a workaround.) Dynamically manipulating controls at run time is another side of this coin, though one that I tend to use more rarely... but that's just a personal thing.
In a recent application, for instance, I had lists of several types of objects. They were related, but performed quite different functions and the user wouldn't really need to look at more than one list at once. As a result I used one form with a tab for each object list to keep the users' display less cluttered.
Similarly when doing a GL app recently I had the journal header and journal line entries (which go to different tables in the back-end database) in separate parts of the one form. On the other hand asset creation was sufficiently different that I created a different form, despite the creation process sharing some of the underlying data. (That is, journal line data.)
I don't believe in the concept of "best practice" because what's a good practice in one situation may be a very bad one in another. However the "rules of thumb" that I use are:
- Keep the number of forms to a minimum to keep overhead low and reduce maintenance BUT
- If there is no logical "tie" between two functions, don't be afraid to make a new form because trying to maintain one form which performs 7 different roles is a guaranteed path to madness and frustration, especially if you break something inadvertently.
Yes, the two rules conflict, but in a way I see this aspect of design as being akin to database normalisation; there's a sweet spot between over-normalising (a separate form for each and every display) and under-normalising (trying to shoe-horn too many unrelated functions into one form). At the very least the rules always give me pause to think "do I need this form, or does it relate to something that I've already done?"
And the third rule of thumb is, obviously... always look at it from the point of view of your user. Are they going to feel like you're bouncing them around too much? Do all of the forms share a look and feel and, more importantly, control layout so that they always know where to find something?
All of these things will vary from app to app, and there's never one size that will fit all IMHO.
In my case, when I am dealing with multiple forms, I use MDI Parent Form to avoid multiple items in the windows task bar.
Another unusual solution is to set each forms ShowInTaskbar property to false.

Are there VB.NET UI Templates for Managing a DataSet?

Is there a quick and easy way to make a VB.NET user interface for managing the data in a normalized DataSet?
I know that is a very subjective question, so let me explain. For a brief period early in my career, I used to create user interfaces in Microsoft Access. I developed a simple, but very effective approach to user interface design. Here are some details of that approach:
Create one form per table. Put on
each form all controls necessary to
completely manage one row in the
table.
Use combo boxes for
foreign-key columns.
Give the user a
standard way to add rows and delete
rows.
Use Apply and Undo buttons.
Let
the user navigate from one row to
another with a list box.
Provide a
search box and filter options for
more efficient navigation.
Let the
user double-click on controls
representing foreign-key columns to
quickly navigate from one form to
another.
Make the state of each form
persistent (so the user always
returns to the last navigation point)
etc.
Simple, right? I found that Access encouraged this approach. It has many built-in features that make this kind of UI easy. For instance, creating a combo box to represent a foreign key relationship takes about 10 seconds.
Well, I haven't worked in Access for a while. A couple of years ago, however, I was hired to write an application in VB.NET on the NET 2.0 framework. To get a data management user interface up and running quickly, I used my Access experience to write a quick & easy prototype in Access -- that took me about one week. Then I hired a programmer to implement that same UI in VB.NET. What a nightmare! We've been working on that implementation for a year, and I'm still very unsatisfied with the results. Some of the problems we are having:
Apply and Undo buttons don't work quite right. We can't find an event that tells us when the form is "dirty" (thus making Apply and Undo relevant).
Navigation from row to row and from form to form requires surprisingly complicated code. I get the impression that we are fighting against NET's binding features, not working with them the way they were intended to be used.
The NET controls seem buggy. For instance, when the user types a value into a combo box (as opposed to choosing it from the drop down), it doesn't trigger the SelectedValueChanged event.
We seem to be repeating a lot of information. For instance, the DataSet knows there is a relationship between the columns in two tables, but we must nevertheless effectively repeat the details of that relationship when we program the combo boxes, binding, navigtation features, etc.
We still don't have good solutions for the filter and search features. There are lots of little details to work out. (For instance, what if you choose a filter that doesn't include the currently displayed row?)
We are writing many helper functions and classes to simplify the work, and I can't figure out why that effort hasn't already been done by others -- I'm certain we are reinventing the wheel.
etc.
By themselves, none of the above are a big deal -- there are effective solutions to each one. Taken together, however, these problems are making my UI development go much slower than expected.
In an ideal world, I should be able to create a small amount of code relevant to my specific data model (for instance, one user control per table establishing the layout and logic relevant to the rows in that table) then integrate that code into a template which interprets the data model and handles everything else -- navigation, adding and deleting, apply and undo, search and filter, etc.
Thus, my question: Is there anything out there which makes this type of UI development easier?
I've searched the web for various combinations of "generic forms", "UI templates", "data managment forms", etc., but I haven't found anything on topic. Perhaps I just don't know the buzzwords. Is there a specific name for this type of UI development task?
Create UCs for each table. Drop a grid control onto the UC and bind it to the tables dataset using VS's wizard. Select the options that allow for insert, update, delete. Each row on the grid will have those buttons/actions automatically added for you.

Populating PDF fields from a database

I have a PDF file (not created by me - I have no control over the design etc.) which allows users to fill in some form fields in Adobe Reader and save the result. I want to automate the process of populating the fields, using the following steps:
Fetch data from database.
Open PDF template.
Populate form fields with data.
Save modified file to a separate location on disk.
Lock modified file so that the form fields can no longer be edited.
Send file to user.
I'm happy to use PHP, Perl, Python or Java to do steps 2-5 (in descending order of preference), but whatever I use has to work under Linux (i.e. it mustn't rely on libraries which are only available on Windows for example).
The end result should be a PDF which the average user can open and print, but not modify (I'm sure advanced users could find a way to do so, but I accept that I can't guarantee complete security against modification). I don't want to change the structure of the PDF, merely populate the form fields.
Is there a standard piece of software for doing this? I've seen mentions of FDF Toolkit, but I'm not entirely sure if that's what I want and whether it will allow me to lock the file afterwards, and whether what I want to do fits in with the EULA.
Edit: Final answer is to use iText (as suggested by Mark Storer) but to implement it as a web service which allows you to pass in an array of form field names and values and the PDF file 'template'. The web service will be open source (and available on GitHub once I've written it), as per the AGPL, but anything connecting to it won't have to be.
Filling
Any number of different libraries can fill in field values. I'm partial to iText (java) or iTextSharp (c#). I wrote one in Java a number of years ago. It's not that hard). There are lots. Search SO, you'll find 'em.
Locking
There are a couple different levels of "lock the fields".
Each field has a "read only" flag. This is pretty much a courtesy as far as other libraries capable of setting field values are concerned. In fact, it's generally considered to mean "the ui cannot make changes". Form script can, regardless.
Form flattening: Draw the fields directly into the page and removing all the interactivity.
Each one has pros and cons.
Flag: None too secure. Form data still easily accessible. Scrolling fields still scroll.
Flattening: Pretty much the exact opposite. It's harder to modify (though far from impossile). The form data can only be extracted via text extraction (which is hard, but becoming increasingly common). List & text fields that contain more stuff than is visible will no longer scroll.
The ability to flatten forms is relatively rare. Again, iText can do it (as can iTextSharp), but I'm not aware of any other third party libraries that can... I'm sure they exist, I just can't name them off the top of my head.

Source core repositories and sticky notes

An interesting problem occured recently, and I've been thinking of the "best" way (for a given value of "best") to implement this.
In essence, it's one of tracking notes against source code. The example that flagged this was getting a problem fixed in live within SLAs, and how to best achieve this. Without going into all the details, it came down to finding a function that's used in a number of places which may or may not be buggy, yet the problem was being reporting only in a single location.
The fix to meet the SLAs was simply to add a check into the location where the problem was reported, rather than tweaking the common code and having to test everything that touches that function.
The interesting issue is then for upstreaming. The "correct" method would then be to go back and check the original function, validate it's correct for everywhere it's called and then make the change "properly" if its determined the library function is wrong.
The problem is this takes time, so upstreaming may simply take the workaround, etc. However if the problem occurs again (say six months later) in another location calling the same library function, there isn't an easy way to link the two problems together. You can search the bug tracking database, but this isn't guranteed to help - it depends if a note's been added saying something along the lines of "this library function needs more thorough checking, but no time to investigate now".
So the question is this: within a large team of developers (30 plus, split into teams of both support and on-going development), what methods do you use to manage (what are effectively) "sticky notes" against source code, short of adding a comment to the suspicious function's source code saying "this might be a bit dodgy"?
The problem with the commiting a comment is one of process: a change is a change, so committing a zero-change change (i.e., one where just comments are added) is not ideal; developers can make mistakes even adding a comment (hit a stray key or something) so it's always (IMO) better to commit only where actual changes are made.
Now a wiki could be used to track per-file notes, but we've got a minimum of four branches and inexcess of a few hundred files (SQL objects, source code, XML files, etc), so a wiki will get unmangable quite quickly.
This is the sort of thing that it would be nice if SCM's could support - bits of metadata against files that are simply notes, but don't add to the SCM's version history - that can be displayed when doing (say) an svn update, or manually viewed.
There may already be solutions out there -- so how do you manage this type of knowledge sharing?
Well we're now using this method: in each folder checked into SVN, we've created a .url shortcut (this is Windows we're dev'ing on) that links to a page on our development wiki about that folder. Thus we can update the Wiki info freely, and on checkout/update everyone gets a link that will take them to the appropriate Wiki page for that folder/module.
We've not long instigated it so we'll have to see how well it works long term -- but it's better than what we had before (i.e., nothing :-) ).