Is it possible to check automatically if a PDF follows the PDF/UA standard? - pdf

There are different standards for PDF files:
PDF/A: PDF/A-1a, PDF/A-2a or PDF/A-3a
PDF/UA
Potentially more
Is it possible to check if a given PDF file follows the PDF/UA standard?

Partially.
Most PDF based ISO standards (PDF/X, PDF/VT, PDF/E, PDF/VCR, PDF/A) can be automatically checked in software. I'm affiliated with a company (callas software) that has commercial software that handles all of these. The tool also has support for PDF/UA, but the verification is only partial.
The reason is subtle but important. All other PDF based ISO standards state that PDF files need to follow certain guidelines, and those guidelines are easy to verify in software. This or that property has to be set, that property can't have this value, this element must have that or that color space...
PDF/UA makes many of such claims as well, but it also makes different claims. Two examples:
For a document to be compliant with PDF/UA, all text in the document must be tagged with the "correct" language.
Similarly, the elements in the document must be tagged with their type (image, paragraph, title, footnote, table) "in a correct way".
The problem is not for software to check that language tags are present or that structure is present. The problem is for the software to verify that the language tags are indeed for the correct language. And that the structure in the document corresponds to what a human would think makes sense.
There are more examples like that in the PDF/UA standard, and they mean that while software can definitely assist in the verification process, it can't conclusively say that the file is correct. The best it can ever say is that it doesn't contradict one of the computer verifiable rules.

Related

How to check 'PDF Digital Signature' whether it is 'PAdES' standard or not?

How can we check whether our Digital Signature on PDF is according to 'PAdES' standard ?
Best Regards
Pearapon S.
Bangkok, Thailand
The OP uses third-party software (iText 5.4.2 in the case at hand) for creating PAdES signatures and now wants to check whether the resulting signatures indeed satisfy the PAdES specification. Here some ideas for that:
What do you mean by "PAdES signature"?
First of all you have to define what you mean by "PAdES signature":
Do you mean that term as defined in the upcoming PDF 2.0 specification:
The PDF signatures using the Subfilter value ETSI.CAdES.detached are referred to as PAdES signatures and they follow one of two CMS profiles created to be compatible with the corresponding CAdES profiles defined in ETSI EN 319 122.
(ISO 32000-2 FDIS section 12.8.3.4 - CAdES signatures as used in PDF)
Do you mean more specifically a signature following one of the PAdES baseline signature profiles as specified in ETSI EN 319 142-1?
Clause 6 defines four levels of PAdES baseline signatures, intended to facilitate interoperability and to encompass the life cycle of PAdES signature, namely:
a) B-B level provides requirements for the incorporation of signed and some unsigned attributes when the signature is generated.
b) B-T level provides requirements for the generation and inclusion, for an existing signature, of a trusted token proving that the signature itself actually existed at a certain date and time.
c) B-LT level provides requirements for the incorporation of all the material required for validating the signature in the signature document. This level aims to tackle the long term availability of the validation material.
d) B-LTA level provides requirements for the incorporation of electronic time-stamps that allow validation of the signature long time after its generation. This level aims to tackle the long term availability and integrity of the validation material.
(ETSI EN 319 142-1 V1.1.1 section 6 - PAdES baseline signatures)
Or would one of the additional PAdES signatures profiles from ETSI EN 319 142-2 also qualify?
The present document contains a profile for the use of PDF signatures, as described in ISO 32000-1 [1] and based on CMS digital signatures [i.6], that enables greater interoperability for PDF signatures by providing additional restrictions beyond those of ISO 32000-1 [1]. This first profile is not related to ETSI EN 319 142-1 [4].
The present document also contains a second set of profiles that extend the scope of the profile in PAdES part 1 [5], while keeping some features that enhance interoperability of PAdES signatures. These profiles define three levels of PAdES extended signatures addressing incremental requirements to maintain the validity of the signatures over the long term, in a way that a certain level always addresses all the requirements addressed at levels that are below it. These PAdES extended signatures offer a higher degree of optionality than the PAdES baseline signatures specified in ETSI EN 319 142-1 [4].
The present document also defines a third profile for usage of an arbitrary XML document signed with XAdES signatures that is embedded within a PDF file.
(ETSI EN 319 142-2 V1.1.1 section 1 - Scope)
In particular you have to know whether you are also considering the additional PAdES signatures profiles "for CMS digital signatures in PDF" or "for XAdES Signatures signing XML content in PDF" from ETSI EN 319 142-2 or not because these two profiles differ very much from the remaining ones.
I assume you don't consider these two profiles because the former more or less is the good old ISO 32000-1 CMS signature with a few restriction most signers follow anyways, and the latter are XML signatures which are a completely different animal altogether.
All the remaining profiles differ mostly in the amount and type of validation related information and time stamps added to the document to make the validation less and less dependent on data one has to retrieve online.
Options for checking
Manually: You take a PDF browser (like iText RUPS or the PDFBox PDF Debugger), a hex viewer, an ASN.1 Dump utility, the applicable base specification from the list above and the additional specifications referenced from the base specification and check all the criteria.
This requires quite some time for a single signature but in the end you can know everything there is to know about the tested signature. I only recommend this if you have some experience using those tools and working with specifications, or if you really have much time to learn.
Programmatically: You take a general purpose PDF library (if you take a different one than the one you create you signatures with, the credibility of the result might increase), a security functions library (e.g. BouncyCastle), the base specification and the additional ones, and start implementing a program testing the criteria in the specification.
This requires quite some time to develop but then can be re-used, e.g. for quality assurance and regression prevention. I only recommend this if you have some experience using those libraries and working with specifications, or if you really have much time to learn.
You use an existing software or service that evaluates the signature for you, e.g. the ETSI Signature Conformance Checker (http://signatures-conformance-checker.etsi.org/pub/index.shtml), a free online tool that performs numerous checkings in order to verify the conformity of the ETSI Advanced Electronic Signatures.
This obviously is the quickest option but you generally cannot be sure how reliable the tests were executed. Thus, it is good as a first opinion but depending on how much is at stake, I would want more security.
You can hire experts to analyse and assess the validity of your signature according to the desired profile.
This may be expensive but at least you have someone to resort to if later you get into trouble due to some non-compliance.

How can I check PDF/UA compliance?

While there are lot of posts on how to check PDF/A compliance but seemingly there are no posts about testing PDF/UA compliance.
I checked in Acrobat Pro even, we cannot verify PDF/UA compliance through that software.
The only nearest tool which I could find was PAC (PDF accessibility checker).
But I am not sure if that is sufficient for UA compliance.
Is there any tool which you know that can do this job ?
If you run the validation for the "a" variants of PDF/A (PDF/A-1a, PDF/A-2a, etc.) and it passes you'll know if the PDF meets the structural requirements for PDF/UA. However, the machine validators really can't address the semantic requirements...
Complete tagging of "real content" in logical reading order
Tags must correctly represent the document's semantic structures (headings, lists, tables, etc.)
Problematic content is prohibited, including illogical headings, the use of color/contrast to convey information, inaccessible JavaScript, and more
Meaningful graphics must include alternative text descriptions

Modular MediaWiki

I wonder if it is possible to configure MediaWiki (or other wiki tools) as a modular predefined wiki. For instance, on a regular wiki page one can freely edit sections, text, everything.
I am looking for a solution that predefines a number of sections (or modules) that can be added to each wiki page. Then users are free to edit inside those sections within their predefined formats.
Hope someone can help, thanks.
As for MediaWiki, there is at least one extension that can work that way: Semantic Forms, usually used together with Semantic MediaWiki (though that is not necessary). With SF, you will define one or more templates that receives the data entered in the form, and the form can be divided into sections.
A more lighweight solution might be using one of the many boiler plate extensions available.
Either way, with a wiki you can never force your users to follow a certain scheme. The whole philosophy, making wiki's unique among collaborative tools, is that the users, not you, create not only the content but also the structure for the content!
The former Semantic Forms is now called Page_Forms and it is not dependent on SMW https://www.mediawiki.org/wiki/Extension:Page_Forms and can also make use of the Cargo extension https://www.mediawiki.org/wiki/Extension:Cargo
I would disagree that wiki users cannot or should not be forced to follow a scheme for some types of information, though the default is that they do control the categories and namespaces and can create those at will as the data evolves. All this means though is that you manage such issues socially rather than with complex permissions structures, i.e. someone undoes your change and says "do it this way instead". So it's a different kind of forcing, but, still, someone has to make sure categories don't proliferate with bad names, capitalization, etc.
The typical use of forms data is when it must be used to satisfy some legal or professional requirement (say logging for what reason a change was made for Sarbanes-Oxley, or logging what precedents were consulted for logging legal time), or will be providing input strictly to some application (like maps). It would not be a good idea to rigorize literally every page of a wiki this way.

Application (Not a Markup Language) for Producing a User Manual [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Can anyone recommend a program to create user manuals with? Not a markup language (like LaTeX or DocBook) but more something interactive like Scribus. As I'm not the only one that will update the manual the software should be something that's easy for a novice to pick up but still has some advanced features (like linking in text from external sources/tables, handling masterpages/themes etc.).
Regards,
Oscar
Technical Publishing Software - Views on FrameMaker and Its Alternatives
I've done spec documents with LaTeX and Framemaker, and designed a Framemaker workflow to support a team of 5 analysts producing a spec document for an insurance underwriting system. The document was expected to get to 2,000 pages or so. Many years ago (around 1992-1993) I also worked briefly as a typesetter.
Framemaker is designed for technical documentation and does it very well indeed. It also has features designed to support very large documents with multiple authors - people use this system to do documents with more than 100,000 pages. It is also more accessible than LaTeX to users familiar with word processing software.
Key features of Framemaker:
Documents consisting of multiple
files: You can pull together a
'Book' with multiple subsections in
different files. The document can
also be kept in source control.
Textual MIF format for
import/export: The importer is
somewhat finicky (I found generating
working LaTeX to be easier) but you can
generate items such as data
dictionaries and import them into
the document. The file has textual
anchors (see below) so you can
create cross-reference links that
will be stable across imports. I
find this to be a key feature for
specs as it allows cross-references
to link directly to generated items.
Powerful tagging, indexing and cross-referencing System: Everything
is based on tags in Framemaker and
it is easy to apply tags quickly.
This means that cross-referencing,
indexing, conditional text and
applying styles en-masse is easy and
just works. You can generate indexes and TOCs based on tags, so
having multiple specialised indexes
(such as a list of data field names
from screens or a data dictionary)
is easy to do. The document I
described above had 4 separate
indexes.
Stable: Framemaker is designed for
professionals so it doesn't second
guess you in the way that word does.
It is also much more stable on large
documents. Anyone who's tried to
write a document of more than 50-100
pages on Word should have a pretty
fair idea of what this implies.
Scriptable: FM has a C API and there
are various scripting plugins
(FrameScript and FMPython
being probably the most widely used)
which can be used to automate jobs
in FM. Framemaker 10 adds support
for a Javascript based scripting tool
called Extendscript, presumably
ported across from the scripting facility
in InDesign.
Single-sourcing: From a single FM
document you can produce PDF,
Windows Help (CHM), HTML and print
documents fairly easily. The
cross-references also resolve to
hyperlinks.
Global style controls: You can
easily set up styles for a document
and apply it across the whole
document. It also facilitates
running headers and footers with a
great deal of flexibility in having
them track sections, versions,
chapters etc.
Alternatives to Framemaker
LaTeX/Lout: You've already indicated
that you don't want a markup
lanaguage, but the TeX and
Lout systems are used for large
structured documents and do this
well.
Ventura Publisher: Probably the
only real alternative to Framemaker
if you want that sort of user interface
without paying bodily parts for the
privilege.
It has strong support for structured
documents and an XML-based document
interchange format. It's now owned
by Corel, who still appear to be actively promoting it.
There are a couple of other technical publishing tools on the market: Quicksilver (which used to be known as Interleaf) and ArborText. These two are powerful tools - Interleaf used to be the market leader in this field at one point - but quite expensive.
Adobe Indesign: Although Adobe
claim you can do large documents
with InDesign, the cross-referencing
and other large document features
tend to be viewed as lacking by
Framemaker afficionados. There is,
however, a text entry system for it
called InCopy that apparently
does have this sort of
functionality and quite
a large body of Third-party
plugins, some of which do
support tagging and other such facilities.
InDesign also has a scripting API and
a JavaScript interpreter for executing
scripts.
I haven't used Indesign,
so I can't really comment on how
well it works in practice.
DocBook: This is really just
a standard format for structured
documents but has a large ecosystem
of tools surrounding it for writing
and rendering documents. If you
don't want to use LaTeX you will
probably not want to use DocBook for
similar reasons. As Vinko Vrsalovic
points out (+1), This link goes to a StackOverflow
post from someone describing using
DocBook in practice.
I've never really used DocBook and I've
made so many edits to this post that it's now in Wiki mode, so
someone familiar with DocBook might
want to elaborate on this.
Word processing software: Word
has serious shortcomings as a
technical publishing tool and is not
recommended. OpenOffice has
somewhat better structured
documentation functionality than
word and may be a better choice if
politics or requirement to use .doc
as a document interchange format
preclude a better alternative.
Wordperfect is also
considerably better for
documentation-in-the-large than word
and still has a presence in several vertical markets
such as legal offices.
Madcap Software's Blaze and Flare: These
are new kids on the block and live
in roughly the same space as
Framemaker. The company was founded by former
eHelp (creators of RoboHelp) employees and is
actively developing, with multiple releases yearly. Their
offerings have greatly expanded in the past two years,
to the detriment of the quality of the individual products.
It seems focus has been on turning out new products and
by consequence there are a lot of "fit and finish" issues in
each. The authors have chosen to reinvent the wheel in many ways,
resulting in confusing and often broken implementations. Save often,
you will encounter unhandled exceptions. Source control integration
is flaky. For example, moving or deleting a group of files will result in
one source control commit for each file deletion. Big PITA when
you have source control email notifications. Hello 500 emails.
Flare can import Word and Framemaker files, but the import
is far from seamless. Expect to retain all of your content
but plan on completely re-styling from scratch.
Flare shares many of Word's tendancies to do too much behind
the scenes and assume what the user would choose. The HTML looks
like what Word outputs when you export HTML - lots of custom tags
and attributes, deeply nested inline styles, etc. The text
editor is maddening, for example, its cursor model is different
than any other software you've ever used.
Framemaker vs. LaTeX
These two are main systems I have used to produce large, presentable system documents and I've had good results with both.
Ease of Learning: TeX can give you absolute control but actually
achieving this on a complex LaTeX
document without breaking other
items isn't trivial, particularly
where a large number of macro
packages are involved. Basic LaTeX
isn't hard to learn, but making
modified versions of .sty files that
still work takes a bit of tinkering
if you're not a really deep TeX
hacker. It can be done but be
prepared to spend quite a lot of
time fiddling.
Framemaker can give you a good degree of control on the look of the document and isn't that hard to learn. Getting a house style and tweaking the layout (which you probably will have to do) will be easier with Framemaker.
Ease of Text Entry: You can use tools such as Lyx to provide a
wordprocessor-like front end to
LaTeX, and these work well if you
want to write large bodies of text.
Framemaker's DTP-like user interface
works in a way familiar to people
who are used to wordproessing
software. From this perspective
there is little practical
difference.
Templating Document Structure: Framemaker allows a document
structure to be defined in terms of
tags or an XML schema (if using
Structured Framemaker). LaTeX has a
set of canned structural elements
that are flexible enough to be
useful. Adding additional
structural elements (e.g. a data
dictionary item) can be done as a
macro, but making them auto-number
is a bit more challenging and you will
need to poke around behind the
scenes. Both can do it, but it's
considerably more technical to do it
in LaTeX in anything but trivial
cases.
Also, LaTeX does not have
the facility to template the
document structure in the way that
Structured Framemaker does.
However, you can achieve this type
of effect with DocBook and then
generate to LaTeX if desired.
Ease of Integration: I found making a generator for non-trivially
complex MIF files to be quite
fiddly. The MIF parser is quite
pernickety in FM and doesn't really
give good diagnostics. LaTeX
produces far better error messages
and is quite a bit less fussy.
Technical Publishing Software vs. Layout Software
Page layout software started with Pagemaker and the other main players in this space were its competitor Quark Xpess and now InDesign, with which Adobe is essentially trying to deprecate and replace it and Framemaker. Scribus, which you mentioned before, lives in the same space as these products.
If you are producing a manual with less than (say) 50-100 pages, one of the packages would probably do an adequate job. They are really designed for advertising and layout-heavy publication tasks such as magazines, so their support for large-document features of the sort found in Framemaker is fairly limited. The key issue with these products is scalability - they do not work well on large documents.
Just for reference I have actually typeset a 200-page book (someone's autobiography) using Pagemaker. While the fine-grained kerning and leading control helps a bit for copyfitting, it is still a highly manual process to lay out a book sized document. In this case the book was just straight text with no significant cross-referencing or structure other than chapters. Doing a complex technical spec document or manual this size with Pagemaker would have been very fiddly and probably next to impossible to get right without any mistakes.
Technical Publishing vs. Word Processing Software
This is more of a description of key shortcomings of MS-Word for large spec documents. However, it will illustrate some of the main features required for documentation-in-the-large:
Indexing and Cross-Referencing: This is a real chore in Word, and
quite unstable. Framemaker's
tagging features and LaTeX's labels
mean that you can assign a tag or
known label (in a predictable format
if necessary). The textual format
for the tag anchors is exposed in
the user interface, and is used for
the linkage. In Word, the anchors
are much more opaque and not
easily controllable in this way.
Combined with the clumsy user
interface and instability of the
product, this makes maintaining
these fiddly, and often unstable -
you often have to manually fix them
up.
Templated Layouts: Style support in word are quite basic and
numbering tends to be somewhat
unstable. FrameMaker is all about
driving from the tags and applying
styles based on the tags. Global
style changes just work in
Framemaker in a way that they do not
in Word.
Large multi-file Documents: I've never been able to make this work
well in Word, but it is a key
feature in Framemaker and LaTeX.
Again, Word's instability means that
you tend to spend a lot of time
tidying up after it. As the
document grows larger, the
proportion of time spent on this
work grows quadratically -
propensity for breakage proportional
to n (size of document) * time to
fix proportional to size n (time
to fix)
Why is Word so Unstable: Word does a lot behind the scenes to
support novice users and intervene
in layouts. It is also not really
frame-based (text flow conceptually
separate from document layout), but
the developers try to implement
various frame-like behaviours in the UI. When
the A.I. second-guesses you on a
complex document it often does the
wrong thing. Framemaker 'treats the
user as an adult' and does none of
this so things stay where you put
them.
Other word processors such as
Open Office and WordPerfect do not
misbehave in quite the same way as
Word, which is one of the reasons
that just about any word
processor other than Word will do a
better job of technical documents.
Pre-Flighting: In documentation-speak, this is the
process of checking that your
assemblage of files for the document
(image files etc.) is correct before
committing to print. The
professional systems will complain
about things that are wrong, giving
you a chance to correct it. Word
will just put on a happy face and
try to fix things behind the scenes.
A good example of this is a word
file with linked graphics. If you
copy the file and graphics to
another directory and update one of
the graphics in situ, word may well
still read the file from the old
path (I've seen it do this) and not
the new one you've just updated.
However, this behaviour is not consistent and
typifies the rampant abuse of
unstable heuristics in that product.
Pre-Press Support: A publishing system extends into the pre-press
phase of the workflow. This means
it covers preparation for print.
Word processing software tends not
to have this functionality or have
it in a very limited form.
Without getting too far into this, a key difference is that publishing software tends to treat you like a consenting adult and not get in the way when you want to scale or automate things. One can use word processing software for large scale documentation but it has many design decisions adapted to casual users writing short documents with little regard for quality. These adaptations come at the expense of fitness-for-task on large scale document preparation work. The main issues I find with Word for spec documents are the poor indexing and cross-referencing and general instability issues where I am always having to go back and fix things. However, political considerations in most environments (I'm a contractor) mean one is often stuck with it.
Some general comments on the state of technical documentation software
Framemaker would be the obvious choice if Adobe didn't keep giving off signals that they are trying to deprecate it and move its user base to InDesign. However, FM is widely used in aerospace, software and engineering circles and Adobe's management would face a lynch mob if they actually EOL'd the product without a credible migration path. From what one reads on the web, Adobe's acquisition of FM was driven by John Warnock, but he was ousted and FM became a victim of office politics. The net result is that it's been moved to maintenance mode and is quite stagnant.
Ventura Publisher has also been relegated to a niche market to some extent, but at least Corel do not have two competing product lines in the way that Adobe do. It is probably a passable substitute for FM and may be more politically acceptable to PHB types as it is marketed as a 'business publishing' system.
Quicksilver and Arbortext both seem to be viable products, but are very expensive. I've not used either, so I can't really make any real judgement on their merits.
The markup language systems are free and very powerful in many ways. Lout might be a bit easier to work with as it doesn't have quite the level of legacy baggage that LaTeX does. DocBook is also quite widely used and does have quite a bit of tool support. These technologies put a significant squeeze on the 'geek' end of Framemaker's market share and do so on their merits - they have probably taken quite a chunk out of Adobe's profit margins over the years. I would not dismiss these technologies out of hand, but they will be harder to learn in practice.
You might try evaluating InDesign and a selected set of plugins (concentrate on those for tagging and cross-ref/index management). Finally, some of the word processing software (Wordperfect and OpenOffice) give you a reasonable toolkit for structured documentation and work considerably better for this than MS-Word.
PostScript
Yes, that is a pun. I haven't touched on Pre-Press functionality of any of these products. Printing and Pre-Press are technical fields in their own right and the scope for expensive mistakes means you should probably leave this up to specialists.
Framemaker, InDesign, Ventura, QuickSilver, Arbortext and (presumably) the MadCap products all come with facilities to do pre-press preparation. By and large, word processing software does not.
Doing pre-press with LaTeX tends to involve post-processing the PS output with software like psutils or rendering to PDF and taking the pre-press workflow from there. Generally, most pre-press houses can work from PDF, so a good PDF writing tool like Distiller is the best interface for work prepared from tools that are not designed for prepress work. Note that the quality of the output from Distiller tends to be better than the Ghostscript based ones like PDFCreator.
Note that the RGB colour space of a monitor does not have a direct map to a CYMK colour space used by a printing press. Actually getting colours - especially colour photos - to come out correctly on a press is somewhat fraught if you do not have the right kit. For print production, see a specialist unless you have reason to believe you know what you're doing. For a casual user I would still recommend this 15 years after I was involved in the industry, as mistakes are very expensive to fix once they're committed to print.
If you really do want to do colour print work in-house, you will probably need to calibrate your monitor. For best results, you should get a high-fidelity monitor like this one from HP. In order to calibrate the monitor you may also need a sensor like one of the ones described in this review if the monitor does not come with one. Most professional graphics cards like these from Nvidia, AMD or Matrox have facilities to support gamma correction; many consumer ones do as well. You will also need to get calibration data for the press you are going to be using to print, although the pre-press house will probably be able to do this.
As stated before, print media is quite technical in its own right, easy to get wrong and expensive to fix once it's gone to print. If you're not 100% certain you've got your calibration right, get a colour proof like a Chromalin. This is done from the actual film separations (and is thus quite expensive), so it gives an accurate rendition of the actual colour of the final printed article. Doing this for a few sample pages will give you accurate feedback about whether your calibration is set up right.
Acknowledgements: Thanks to Aidan Ryan for expanding the section on Madcap products.
I would recommend "Help & Manual" from EC Software. You can create a printed manual, PDF, Windows help file (CHM), and HTML web based help from a single source document.
I've heard good things about FrameMaker. I've not used it myself, but have had it recommended to me for just such an application.
Adobe Framemaker indeed is the classic tool for writing user manuals. I've used it for all kinds of long documents, and it works very well. Too bad that Adobe left it to rot for years, before noticing that users wouldn't switch.
MSWord took till 2003 to get the bullet/numbering bugs out, and I don't know if they finally got master document working.
LaTeX still is a reasonable alternative. The format is easy to process, and you could generate it from a wiki.
If you want collaboration, then a language-based approach (LaTeX would be my preference although XML-based ones are also good -- Docbook being the flagship here) does make sense, especially if you are tracking files with a version control system.
Anything that does complicate things like any software with a binary or proprietary format will not help you here.
Sorry if it is not the answer you want.
I agree with Ollivier that using DocBook (or LaTEX) is the sanest approach to have easy conversion, sane formatting, nice version control.
Happily, you can try to have your cake and eat it too with a DocBook editor.
Try the ones on this list and see if any satisfies your needs (I haven't used any).
We are using "Help & Manual" from EC Software and it works quite well. Our authors are spread through the U.S. so we share our content files via a hosted SVN server to manage version control. On each workstation we use Tortoise SVN to stay in sync. The product is extremely easy to use and productive.
A VERY nice explanation on what O'Reilly (actually the ones selling all these books...) uses:
O'Reilly Toolchain
It may seem complicated, but depending on the amount of pages you are going to write you maybe should put some consideration into it.
Word (or your favorite word processor)
I make all my user manuals (not to be confused with user HELP files) in Word. Then I can determine if they need to be in PDF, RTF, DOC or even converted to HTML. To solve the multi-user updating issue, I store the file in Source Control which handles all those fun things.
See the Fastware Project blog for an in depth discussion of the tradeoffs of using DocBook etc. Scott Meyer has tried out a lot of possibilities and shares what he's thinking.
Adobe InDesign CS5.5 is much better at cross references and long documents than earlier versions. It is very powerful and relatively easy to learn and use. The feature set is very rich and the more you learn about it the more you can do with it. It supports very powerful XML features and can import and export XML as needed. It can also map Styles to Tags and Tags to styles allowing you to create your XML in an automated fashion if you simply use a full set of character and paragraph styles. I have used the program for years and produced multiple projects from books to one-off advertisements. It is a graphic design tool, but has support for many aspects of book and manual production. I recommend it if you are more concerned with graphics, images or illustrations. InDesign support a wide number of import and export formats.
InDesign CS5.5 has added and improved support for both interactive content and export for EPUB (electronic book) and Adobe's Digital Publishing Suite (DPS) electronic magazine formats.
Framemaker is an excellent tool for books, manuals and long technical documents. It is a bit harder to learn than InDesign but has a richer set of tools for building variables and running headers and footers, if you have the time and inclination to learn how to use them. It also has a very robust XML feature-set, but I have not used it personally.
Unfortunately, Framemaker suffers from lack of support for graphic design. The color system is based very kludgey and spot (PMS) colors are hard to define. Simple things like adding a stroke color and fill color are rudimentary at best. For example, you still can't select a stroke color that's different from an objects fill color. The program is intended to output to laser and inkjet printers and not really to printing presses.
One feature that is really cool is the ability to apply master pages based on the Paragraph styles appearing on the page. The paragraph/illustration numbering in Framemaker is superior to any other program that I have ever used. But it is also difficult to learn and use.
Both programs support output to PDF and PostScript file formats and can generate hyperlinks and interactive content.

Company insists on using a binary format for all our documentation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I work at a company that, for some reason, insists that all our development documentation should be in MS Word format. Which, being a binary format, means we cannot:
Diff versions of a document against each other (so peer reviewing them is a pain - because of the domain we work in, peer reviews for all changes are essential)
Grep a folder-full of documents for keywords
What do you use to write documentation in and why?
Please also give me ammo to change this situation with...
I recently started using DocBook XML to author my documentation.
On the upside, it's a pure text format. You can break a large document into multiple files, and use nodes to bring them all together into a single book. Table of contents and index are automatically generated. Intra-document links (within arbitrary text, pointing to chapters or sections) are very easy. And with a push of a button, I can create a single-html-file version, a chunked-html version (one file per chapter), and a PDF version.
After some tweaking and customization, I'm very happy with the output. The documents look great!!
DocBook is used extensively by real publishers (most notably, O'Reilly), and it's been around for more than fifteen years, so it's reached a certain level of maturity.
On the other hand, all of the processing is done with XSLT, using an ad-hoc collection of tools. (My own docbook pipeline includes Python, Java, Xerces, Xalan, Apache FOP, and PDF-SAM. Plus the official XSLT stylesheet distribution, and my own XSLT customizations.)
DocBook is not a turnkey solution. You won't be able to get going quickly, without reading the manual. And if you don't know anything about XSLT, you'll have to learn.
On the other hand, there are only a dozen or two XML tags that you really need to know to write the documents. (The real expertise comes into play during doc generation from the XML sources.) If one person on your team was willing to be responsible for writing the doc build script, then everyone else on the team could just learn the DTD and do a decent job contributing.
Anyhow... DocBook definitely has some faults. It's not the easiest system for tech authorship. But it's the best open source tool I know of.
The "Subversion Book" is written in DocBook. Here's a page with links to the different book versions (single-html, chunked-html, and PDF):
http://svnbook.red-bean.com/
And here's a link to the DocBook XML sources for the first chapter, so that you can get an idea for how it works:
http://sourceforge.net/p/svnbook/source/HEAD/tree/branches/1.7/en/book/ch01-fundamental-concepts.xml
For ammo, there's the trusty old Pragmatic Programmer, chapter 14: The Power of Plain Text.
As Pragmatic Programmers, our base
material isn't wood or iron, it's
knowledge. We gather requirements as
knowledge, and then express that
knowledge in our designs,
implementations, tests, and documents.
And we believe the best format for
storing knowledge persistently is
plain text. With plain text, we give
ourselves the ability to manipulate
knowledge, both manually and
programmatically, using virtually
every tool at our disposal.
We use a wiki (specifically the one provided by Trac) for the two reasons you mentioned. Plus, if we really need to we can get the text version of the markup and manipulate it in a text-only environment, too (e.g. as part of svn comments during commit).
A format that can be easily reduced to text-only (non-binary) is definitely a must. Having the ability to upconvert it to a pretty format like a PDF is, for us, not terribly important.
Word has change tracking for documents (although it only works up until you accept the changes) and you can also grep them (the text isn't encrypted). So I'm not sure either of your arguments will hold up under scrutiny. I'd love to give you the ammo to change this but I've become jaded and cynical with age.
We use MS Word for our docs (which is a huge improvement over the earlier choice (Lotus WordPro - ugh!).
We use a wiki - specifically Confluence by Atlassian.
It's a commercial product, and it's great. One of the reasons we picked it over free/open wiki engines is that it has a full-blown WYSIWYG editor and various other features that make it more easily accessible to users who are familiar with Word.
We've also come up with a neat trick where we store images, designs, wireframes, etc. in Subversion, and then embed links in the wiki documents to those resources URLs via the Apache/SVN web interface module; notes on how we do this are here if you're interested.
Like Dylan's organisation, we also use the excellent Confluence wiki. I wrote an article about why this is better approach called Wiki is my word-processor, which should give you some reasons to change the situation.
Benefits of using a wiki for internal documentation include the following.
Word-processor users get sucked into changing the layout and typography, however good your templates are, which wastes time and reduces consistency.
A wiki provides full-text search, which you are unlikely to have for your body of the MS Word documents written by everyone.
A wiki provides a document version history; I have never heard of a team successfully keeping all revisions in Word documents and always being able to compare old versions, or using a version control system (with the possible exception of SharePoint but that's whole different failure scenario).
A wiki makes hyperlinks between documents easy; it is too hard to reliably link between documents in a collection of Word documents, so new documents end up duplicating older content into new monolithic documents which means they take more time to read and write.
Separate wiki pages can be edited by different people at the same time, and Confluence can merge changes when multiple people edit the same page at the same time; collaboration is harder with a Word document that only one person can edit at a time.
A wiki like Confluence automatically generates navigation pages based on wiki structure and tags; you need a librarian and lots of discipline to make it possible to browse a large collection of Word documents.
A wiki page usually loads and displays more quickly than a Word document.
A wiki page has more automatic meta-data; you need templates and discipline to make sure that Word documents always have Title, Author and Version set in the document properties and visible in the document on-screen and in print.
If you want more ammunition than this, then there is lots of wiki-promotion on The Atlassian Blog.
You could ask for documentation to be in OOXML (.docx, in the case of Word) format. Not as ideal as using ODT, in my opinion, however, it's still just a zip file with a bunch of XML files inside. :-)
A textual format facilitates merging your documentation with generated items such as JavaDoc, API references or data dictionaries. It also scales much better than word, which is hard to use for large documents. Finally, a format that allows includes allows multiple authors to work on a document concurrently.
LaTeX and FrameMaker (the two systems I have used for this) both have vastly superior indexing and cross-referencing capabilities and have either a native textual format or a textual version of their native format that can be included (MIF in the case of Framemaker). They are also both much more stable than word.
I've built tools that read data dictionaries and generate documentation that can be included into a larger document with stable indexing and two-way cross-referencing. The functional specification for This product was done with LaTeX in this way and got me another gig with the company. I have also developed a similar process with FrameMaker.
Is the entire development team against this requirement, or is it a small group? If it's the entire team, just ignore the mandate and use a text-based format -- wouldn't be the first time employees ignored a silly rule. Works especially well if you've not made a big fuss about it in the past. If you have, management might look especially hard at your docs.
MS Word supports document changes tracking and peer review.
The new MS Office format is fully XML based (to see this, rename a MS Word .docx file to a .zip, then unpack it to see).
Maybe Office 2007 may fit both your company requirements and your concerns ?
You can at least compare Word documents, see the "Track changes" command in the "Extra" menu, or use software like DeltaView. Found via google search first link at lifehacker.com. Searching in word documents should be possible with Google Desktop Search or other similar programs that index all files they are able to read.
Do they insist that you write it in Word or only that it's available in Word format? You could write in a text format and convert it to Word automatically.
Don't you store documentation files in some kind of Version Control System, ideally together with the source code? I would recommend to do this (makes it easy to get the documentation for old software releases).
And if you do store the docs in VCS, you will notice that plain text or XML-bases files are much better for this, because you can get diffs; also, changes between text files are usually stored more efficiently than changes between binary files.
Not to defend MS products here, but MS word can diff documents.
If you use Beyond Compare as the diff tool for your source-control system (As we do, with Perforce), it will show you differences between revisions of your Word docs. Admittedly, it only shows the textual differences - formatting changes are not shown - but this is usually enough for you to see what changed.
This is just another reason to invest in Beyond Compare, as it is one of the most polished pieces of software I've ever used - and it's the best $30 dollars (Less if you buy several) I've spent on software
There are many tools for word document comparison. I currently use a python script that puts a command-line on the built-in compare and merge functionality of word.
http://nicolas.lehuen.com/index.php/post/2005/06/30/60-comparing-microsoft-word-documents-stored-in-a-subversion-repository
It should be easy to automate word to extract all text from a word document into a text file. So you could write a script creating text files from word docs, and grep, compare, version control, Review these text files.
Of course this is not an ideal solution, since you loose your pretty formatting, but it should work.
I think there are programmes that convert Word docs to plain text. Use one of them to convert the word doc to plain text and then use diff, grep etc
Also have a look into recommended toolchain(s) for DocBook.