Is there a description of the mecab (Japanese word parser) algorithm? - mecab

Is there a document somewhere that describes the Mecab algorithm?
Or could someone give a simple one-paragraph or one-page description?
I'm finding it too hard to understand the existing code, and what the databases contain.
I need this functionality in my free website and phone apps for teaching languages (www.jtlanguage.com). I also want to generalize it for other languages, and make use of the conjugation detection mechanism I've already implemented, and I also need it without license encumbrance. Therefore I want to create my own implementation (C#).
I already have a dictionary database derived from EDICT. What else is needed? A frequency-of-usage database?
Thank you.

Some thoughts that are too long to fit in a comment.
§ What license encumbrances? MeCab is dual-licensed including BSD, so that's about as unencumbered as you can get.
§ There's also a Java rewrite of Mecab called Kuromoji that's Apache licensed, also very commercial-friendly.
§ MeCab implements a machine learning technique called conditional random fields for morphological parsing (separating free text into morphemes) and part-of-speech tagging (labeling those morphemes) Japanese text. It is able to use various dictionaries as training data, which you've seen—IPADIC, UniDic, etc. Those dictionaries are compilations of morphemes and parts-of-speech, and are the work of many human-years worth of linguistic research. The linked paper is by the authors of MeCab.
§ Others have applied other powerful machine learning algorithms to the problem of Japanese parsing.
Kytea can use both support vector machines and logistic regression to the same problem. C++, Apache licensed, and the papers are there to read.
Rakuten MA is in JavaScript, also liberally licensed (Apache again), and comes with a regular dictionary and a light-weight one for constrained apps—it won't give you readings of kanji though. You can find the academic papers describing the algorithm there.
§ Given the above, I think you can see that simple dictionaries like EDICT and JMDICT are insufficient to do the advanced analysis that these morphological parsers do. And these algorithms are likely way overkill for other, easier-to-parse languages (i.e., languages with spaces).
If you need the power of these libraries, you're probably better off writing a microservice that runs one of these systems (I wrote a REST frontend to Kuromoji called clj-kuromoji-jmdictfurigana) instead of trying to reimplement them in C#.
Though note that it appears C# bindings to MeCab exist: see this answer.
In several small projects I just shell out to MeCab, then read and parse its output. My TypeScript example using UniDic for Node.js.
§ But maybe you don't need full morphological parsing and part-of-speech tagging? Have you ever used Rikaichamp, the Firefox add-on that uses JMDICT and other low-weight publicly-available resources to put glosses on website text? (A Chrome version also exists.) It uses a much simpler deinflector that quite frankly is awful compared to MeCab et al. but can often get the job done.
§ You had a question about the structure of the dictionaries (you called them "databases"). This note from Kimtaro (the author of Jisho.org) on how to add custom vocabulary to IPADIC may clarify at least how IPADIC works: https://gist.github.com/Kimtaro/ab137870ad4a385b2d79. Other more modern dictionaries (I tend to use UniDic) use different formats, which is why the output of MeCab differs depending on which dictionary you're using.

Related

Is there a repository of grammars for CMU Sphinx?

I'm writing an (offline) voice recognition app. I have CMU Sphinx4 set up and working using some of the included demo dictionaries. However, they're of limited scope (eg..numbers, cities, etc).
Is there a more comprehensive grammar available? Or maybe a repository of more of these limited grammars? I'm trying to exhaust any other options before creating my own.
Thank you
Grammars are always specific to your particular goal, so it does not make sense to share those . Even such simple subject as digits can vary between concrete applications: we use "zero" and "oh" to denote "0" in regular speech, whilst scientists also use "not" for the same purpose.
Sphinx4 supports JSGF and GRXML formats, you can easily find specifications of both.
You seem to be mistaking grammars with dictionaries. They are completely different things.
Sphinx supports not only grammars, but also n-gram language models. You may find them more versatile. Such model can be automatically generated and will work if given a large corpora which reflects the real usage sentences.
As for dictionaries - creating them for english is relatively simple. One could even think about a tool which reads a phonetic word representation from an online dictionary and converts it to sphinx format. The only input would be a word list.
I believe this paper will come handy to your effort. This paper entails creating grammar and dictionary for a new language, Swahili

What tool to use for finding duplicated Ada code due to copy&paste

I'm looking for a tool for finding duplicated code due to copy&paste programming to be run over a large Ada codebase. I suppose that Ada support in the tool is important for detecting more than the trivial text similarities, that is, ignore layout or identifier difference, etc.
The tools that I have found with Ada support are the following:
Clone Doctor, commercial product with support for several languages, including Ada. http://www.semdesigns.com/Products/Clone/index.html
ConQAT: commercially supported open source product that includes a CloneDetection tool with Ada support since September 2011 http://conqat.cs.tum.edu/index.php/CloneDetectionTutorial
Have you tried these tools? Am I missing any other one of interest? Is the language support really significant or a general text tool would be enough? What is your experience with code duplication detection?
Thanks in advance.
I'm the author of CloneDR. Read the following understanding my bias.
It is important to understand the differences in the detection methods of clone detection tools, and the quality of the results as a consequence.
ConQAT is a representative of what are called "token based" detectors. They match sequences of language tokens (operators, identifiers, brackets, keywords etc.) The good news is they are pretty fast (that isn't a big issue; you don't run clone detection every 30 seconds, once a week is enough). They will find some clones that are near-misses, in the sense that another identifier or constant is substituted for an identifier in a clone. The bad news is that they don't understand the structure of your code and consequently want to report things like
} void ID ( ID
as clones. This is defeated by making the detectors only hunt for very long sequences of tokens (typically 30 or more), which means token-based detectors cannot find small but interesting clones without also drowning you in false positives like the above.
CloneDR operates by parsing the code (even for Ada) just like a compiler, building abstract syntax trees, and matching the trees up to a point of difference. It cannot propose a clone that crosses structure boundaries in silly ways. It will find near misses of the same kind as the token based detectors, but it goes beyond this. CloneDR will find consistent substitutions ("anti unifiers") which means clones can be explained by a small number of parameters that have been used in many places in the clone, and it will find variations in the code in which the mismatches are larger than a single token, e.g., expressions, statements, declarations, even blocks. So it produces fewer false positives and better answers. Independent research reports that compare types of clone detectors, specifically including CloneDR, agree with this analysis.
There is more detailed discussion at the Clone Doctor link you listed above. You can see examples of detected clones for many languages (but we don't have an Ada report on the web site).
EDIT March 19, 2012:
Now you can download an eval copy of an Ada95 CloneDR.
Ira Baxter has a good description.
Token-based clone detection tools tend to be good enough for our purpose, which is usually to get a quick overview of how bad code duplication is in a body of source code we haven't seen before, and how duplication is distributed across that code.
In particular, we are happy with CCFinderX, because it has a nice visualization frontend.
However, it's buggy, unmaintained, and the code has been released but without any license statement.
It has language specific preprocessors for some languages, but we often just disable them (they are buggy as well).
If you need better accuracy, you know exactly the language you need to parse (e.g. with C or C++, this is not always the case), and you can find a tool that parses exactly that language (which is also an issue with C and C++), a parsing-based approach may be better, as Ira writes.

How I can start building wordnet for Turkish language to use in sentiment analysis

Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)

How to use CFStringTokenizer with Chinese and Japanese?

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.
If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).
Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.

Application (Not a Markup Language) for Producing a User Manual [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Can anyone recommend a program to create user manuals with? Not a markup language (like LaTeX or DocBook) but more something interactive like Scribus. As I'm not the only one that will update the manual the software should be something that's easy for a novice to pick up but still has some advanced features (like linking in text from external sources/tables, handling masterpages/themes etc.).
Regards,
Oscar
Technical Publishing Software - Views on FrameMaker and Its Alternatives
I've done spec documents with LaTeX and Framemaker, and designed a Framemaker workflow to support a team of 5 analysts producing a spec document for an insurance underwriting system. The document was expected to get to 2,000 pages or so. Many years ago (around 1992-1993) I also worked briefly as a typesetter.
Framemaker is designed for technical documentation and does it very well indeed. It also has features designed to support very large documents with multiple authors - people use this system to do documents with more than 100,000 pages. It is also more accessible than LaTeX to users familiar with word processing software.
Key features of Framemaker:
Documents consisting of multiple
files: You can pull together a
'Book' with multiple subsections in
different files. The document can
also be kept in source control.
Textual MIF format for
import/export: The importer is
somewhat finicky (I found generating
working LaTeX to be easier) but you can
generate items such as data
dictionaries and import them into
the document. The file has textual
anchors (see below) so you can
create cross-reference links that
will be stable across imports. I
find this to be a key feature for
specs as it allows cross-references
to link directly to generated items.
Powerful tagging, indexing and cross-referencing System: Everything
is based on tags in Framemaker and
it is easy to apply tags quickly.
This means that cross-referencing,
indexing, conditional text and
applying styles en-masse is easy and
just works. You can generate indexes and TOCs based on tags, so
having multiple specialised indexes
(such as a list of data field names
from screens or a data dictionary)
is easy to do. The document I
described above had 4 separate
indexes.
Stable: Framemaker is designed for
professionals so it doesn't second
guess you in the way that word does.
It is also much more stable on large
documents. Anyone who's tried to
write a document of more than 50-100
pages on Word should have a pretty
fair idea of what this implies.
Scriptable: FM has a C API and there
are various scripting plugins
(FrameScript and FMPython
being probably the most widely used)
which can be used to automate jobs
in FM. Framemaker 10 adds support
for a Javascript based scripting tool
called Extendscript, presumably
ported across from the scripting facility
in InDesign.
Single-sourcing: From a single FM
document you can produce PDF,
Windows Help (CHM), HTML and print
documents fairly easily. The
cross-references also resolve to
hyperlinks.
Global style controls: You can
easily set up styles for a document
and apply it across the whole
document. It also facilitates
running headers and footers with a
great deal of flexibility in having
them track sections, versions,
chapters etc.
Alternatives to Framemaker
LaTeX/Lout: You've already indicated
that you don't want a markup
lanaguage, but the TeX and
Lout systems are used for large
structured documents and do this
well.
Ventura Publisher: Probably the
only real alternative to Framemaker
if you want that sort of user interface
without paying bodily parts for the
privilege.
It has strong support for structured
documents and an XML-based document
interchange format. It's now owned
by Corel, who still appear to be actively promoting it.
There are a couple of other technical publishing tools on the market: Quicksilver (which used to be known as Interleaf) and ArborText. These two are powerful tools - Interleaf used to be the market leader in this field at one point - but quite expensive.
Adobe Indesign: Although Adobe
claim you can do large documents
with InDesign, the cross-referencing
and other large document features
tend to be viewed as lacking by
Framemaker afficionados. There is,
however, a text entry system for it
called InCopy that apparently
does have this sort of
functionality and quite
a large body of Third-party
plugins, some of which do
support tagging and other such facilities.
InDesign also has a scripting API and
a JavaScript interpreter for executing
scripts.
I haven't used Indesign,
so I can't really comment on how
well it works in practice.
DocBook: This is really just
a standard format for structured
documents but has a large ecosystem
of tools surrounding it for writing
and rendering documents. If you
don't want to use LaTeX you will
probably not want to use DocBook for
similar reasons. As Vinko Vrsalovic
points out (+1), This link goes to a StackOverflow
post from someone describing using
DocBook in practice.
I've never really used DocBook and I've
made so many edits to this post that it's now in Wiki mode, so
someone familiar with DocBook might
want to elaborate on this.
Word processing software: Word
has serious shortcomings as a
technical publishing tool and is not
recommended. OpenOffice has
somewhat better structured
documentation functionality than
word and may be a better choice if
politics or requirement to use .doc
as a document interchange format
preclude a better alternative.
Wordperfect is also
considerably better for
documentation-in-the-large than word
and still has a presence in several vertical markets
such as legal offices.
Madcap Software's Blaze and Flare: These
are new kids on the block and live
in roughly the same space as
Framemaker. The company was founded by former
eHelp (creators of RoboHelp) employees and is
actively developing, with multiple releases yearly. Their
offerings have greatly expanded in the past two years,
to the detriment of the quality of the individual products.
It seems focus has been on turning out new products and
by consequence there are a lot of "fit and finish" issues in
each. The authors have chosen to reinvent the wheel in many ways,
resulting in confusing and often broken implementations. Save often,
you will encounter unhandled exceptions. Source control integration
is flaky. For example, moving or deleting a group of files will result in
one source control commit for each file deletion. Big PITA when
you have source control email notifications. Hello 500 emails.
Flare can import Word and Framemaker files, but the import
is far from seamless. Expect to retain all of your content
but plan on completely re-styling from scratch.
Flare shares many of Word's tendancies to do too much behind
the scenes and assume what the user would choose. The HTML looks
like what Word outputs when you export HTML - lots of custom tags
and attributes, deeply nested inline styles, etc. The text
editor is maddening, for example, its cursor model is different
than any other software you've ever used.
Framemaker vs. LaTeX
These two are main systems I have used to produce large, presentable system documents and I've had good results with both.
Ease of Learning: TeX can give you absolute control but actually
achieving this on a complex LaTeX
document without breaking other
items isn't trivial, particularly
where a large number of macro
packages are involved. Basic LaTeX
isn't hard to learn, but making
modified versions of .sty files that
still work takes a bit of tinkering
if you're not a really deep TeX
hacker. It can be done but be
prepared to spend quite a lot of
time fiddling.
Framemaker can give you a good degree of control on the look of the document and isn't that hard to learn. Getting a house style and tweaking the layout (which you probably will have to do) will be easier with Framemaker.
Ease of Text Entry: You can use tools such as Lyx to provide a
wordprocessor-like front end to
LaTeX, and these work well if you
want to write large bodies of text.
Framemaker's DTP-like user interface
works in a way familiar to people
who are used to wordproessing
software. From this perspective
there is little practical
difference.
Templating Document Structure: Framemaker allows a document
structure to be defined in terms of
tags or an XML schema (if using
Structured Framemaker). LaTeX has a
set of canned structural elements
that are flexible enough to be
useful. Adding additional
structural elements (e.g. a data
dictionary item) can be done as a
macro, but making them auto-number
is a bit more challenging and you will
need to poke around behind the
scenes. Both can do it, but it's
considerably more technical to do it
in LaTeX in anything but trivial
cases.
Also, LaTeX does not have
the facility to template the
document structure in the way that
Structured Framemaker does.
However, you can achieve this type
of effect with DocBook and then
generate to LaTeX if desired.
Ease of Integration: I found making a generator for non-trivially
complex MIF files to be quite
fiddly. The MIF parser is quite
pernickety in FM and doesn't really
give good diagnostics. LaTeX
produces far better error messages
and is quite a bit less fussy.
Technical Publishing Software vs. Layout Software
Page layout software started with Pagemaker and the other main players in this space were its competitor Quark Xpess and now InDesign, with which Adobe is essentially trying to deprecate and replace it and Framemaker. Scribus, which you mentioned before, lives in the same space as these products.
If you are producing a manual with less than (say) 50-100 pages, one of the packages would probably do an adequate job. They are really designed for advertising and layout-heavy publication tasks such as magazines, so their support for large-document features of the sort found in Framemaker is fairly limited. The key issue with these products is scalability - they do not work well on large documents.
Just for reference I have actually typeset a 200-page book (someone's autobiography) using Pagemaker. While the fine-grained kerning and leading control helps a bit for copyfitting, it is still a highly manual process to lay out a book sized document. In this case the book was just straight text with no significant cross-referencing or structure other than chapters. Doing a complex technical spec document or manual this size with Pagemaker would have been very fiddly and probably next to impossible to get right without any mistakes.
Technical Publishing vs. Word Processing Software
This is more of a description of key shortcomings of MS-Word for large spec documents. However, it will illustrate some of the main features required for documentation-in-the-large:
Indexing and Cross-Referencing: This is a real chore in Word, and
quite unstable. Framemaker's
tagging features and LaTeX's labels
mean that you can assign a tag or
known label (in a predictable format
if necessary). The textual format
for the tag anchors is exposed in
the user interface, and is used for
the linkage. In Word, the anchors
are much more opaque and not
easily controllable in this way.
Combined with the clumsy user
interface and instability of the
product, this makes maintaining
these fiddly, and often unstable -
you often have to manually fix them
up.
Templated Layouts: Style support in word are quite basic and
numbering tends to be somewhat
unstable. FrameMaker is all about
driving from the tags and applying
styles based on the tags. Global
style changes just work in
Framemaker in a way that they do not
in Word.
Large multi-file Documents: I've never been able to make this work
well in Word, but it is a key
feature in Framemaker and LaTeX.
Again, Word's instability means that
you tend to spend a lot of time
tidying up after it. As the
document grows larger, the
proportion of time spent on this
work grows quadratically -
propensity for breakage proportional
to n (size of document) * time to
fix proportional to size n (time
to fix)
Why is Word so Unstable: Word does a lot behind the scenes to
support novice users and intervene
in layouts. It is also not really
frame-based (text flow conceptually
separate from document layout), but
the developers try to implement
various frame-like behaviours in the UI. When
the A.I. second-guesses you on a
complex document it often does the
wrong thing. Framemaker 'treats the
user as an adult' and does none of
this so things stay where you put
them.
Other word processors such as
Open Office and WordPerfect do not
misbehave in quite the same way as
Word, which is one of the reasons
that just about any word
processor other than Word will do a
better job of technical documents.
Pre-Flighting: In documentation-speak, this is the
process of checking that your
assemblage of files for the document
(image files etc.) is correct before
committing to print. The
professional systems will complain
about things that are wrong, giving
you a chance to correct it. Word
will just put on a happy face and
try to fix things behind the scenes.
A good example of this is a word
file with linked graphics. If you
copy the file and graphics to
another directory and update one of
the graphics in situ, word may well
still read the file from the old
path (I've seen it do this) and not
the new one you've just updated.
However, this behaviour is not consistent and
typifies the rampant abuse of
unstable heuristics in that product.
Pre-Press Support: A publishing system extends into the pre-press
phase of the workflow. This means
it covers preparation for print.
Word processing software tends not
to have this functionality or have
it in a very limited form.
Without getting too far into this, a key difference is that publishing software tends to treat you like a consenting adult and not get in the way when you want to scale or automate things. One can use word processing software for large scale documentation but it has many design decisions adapted to casual users writing short documents with little regard for quality. These adaptations come at the expense of fitness-for-task on large scale document preparation work. The main issues I find with Word for spec documents are the poor indexing and cross-referencing and general instability issues where I am always having to go back and fix things. However, political considerations in most environments (I'm a contractor) mean one is often stuck with it.
Some general comments on the state of technical documentation software
Framemaker would be the obvious choice if Adobe didn't keep giving off signals that they are trying to deprecate it and move its user base to InDesign. However, FM is widely used in aerospace, software and engineering circles and Adobe's management would face a lynch mob if they actually EOL'd the product without a credible migration path. From what one reads on the web, Adobe's acquisition of FM was driven by John Warnock, but he was ousted and FM became a victim of office politics. The net result is that it's been moved to maintenance mode and is quite stagnant.
Ventura Publisher has also been relegated to a niche market to some extent, but at least Corel do not have two competing product lines in the way that Adobe do. It is probably a passable substitute for FM and may be more politically acceptable to PHB types as it is marketed as a 'business publishing' system.
Quicksilver and Arbortext both seem to be viable products, but are very expensive. I've not used either, so I can't really make any real judgement on their merits.
The markup language systems are free and very powerful in many ways. Lout might be a bit easier to work with as it doesn't have quite the level of legacy baggage that LaTeX does. DocBook is also quite widely used and does have quite a bit of tool support. These technologies put a significant squeeze on the 'geek' end of Framemaker's market share and do so on their merits - they have probably taken quite a chunk out of Adobe's profit margins over the years. I would not dismiss these technologies out of hand, but they will be harder to learn in practice.
You might try evaluating InDesign and a selected set of plugins (concentrate on those for tagging and cross-ref/index management). Finally, some of the word processing software (Wordperfect and OpenOffice) give you a reasonable toolkit for structured documentation and work considerably better for this than MS-Word.
PostScript
Yes, that is a pun. I haven't touched on Pre-Press functionality of any of these products. Printing and Pre-Press are technical fields in their own right and the scope for expensive mistakes means you should probably leave this up to specialists.
Framemaker, InDesign, Ventura, QuickSilver, Arbortext and (presumably) the MadCap products all come with facilities to do pre-press preparation. By and large, word processing software does not.
Doing pre-press with LaTeX tends to involve post-processing the PS output with software like psutils or rendering to PDF and taking the pre-press workflow from there. Generally, most pre-press houses can work from PDF, so a good PDF writing tool like Distiller is the best interface for work prepared from tools that are not designed for prepress work. Note that the quality of the output from Distiller tends to be better than the Ghostscript based ones like PDFCreator.
Note that the RGB colour space of a monitor does not have a direct map to a CYMK colour space used by a printing press. Actually getting colours - especially colour photos - to come out correctly on a press is somewhat fraught if you do not have the right kit. For print production, see a specialist unless you have reason to believe you know what you're doing. For a casual user I would still recommend this 15 years after I was involved in the industry, as mistakes are very expensive to fix once they're committed to print.
If you really do want to do colour print work in-house, you will probably need to calibrate your monitor. For best results, you should get a high-fidelity monitor like this one from HP. In order to calibrate the monitor you may also need a sensor like one of the ones described in this review if the monitor does not come with one. Most professional graphics cards like these from Nvidia, AMD or Matrox have facilities to support gamma correction; many consumer ones do as well. You will also need to get calibration data for the press you are going to be using to print, although the pre-press house will probably be able to do this.
As stated before, print media is quite technical in its own right, easy to get wrong and expensive to fix once it's gone to print. If you're not 100% certain you've got your calibration right, get a colour proof like a Chromalin. This is done from the actual film separations (and is thus quite expensive), so it gives an accurate rendition of the actual colour of the final printed article. Doing this for a few sample pages will give you accurate feedback about whether your calibration is set up right.
Acknowledgements: Thanks to Aidan Ryan for expanding the section on Madcap products.
I would recommend "Help & Manual" from EC Software. You can create a printed manual, PDF, Windows help file (CHM), and HTML web based help from a single source document.
I've heard good things about FrameMaker. I've not used it myself, but have had it recommended to me for just such an application.
Adobe Framemaker indeed is the classic tool for writing user manuals. I've used it for all kinds of long documents, and it works very well. Too bad that Adobe left it to rot for years, before noticing that users wouldn't switch.
MSWord took till 2003 to get the bullet/numbering bugs out, and I don't know if they finally got master document working.
LaTeX still is a reasonable alternative. The format is easy to process, and you could generate it from a wiki.
If you want collaboration, then a language-based approach (LaTeX would be my preference although XML-based ones are also good -- Docbook being the flagship here) does make sense, especially if you are tracking files with a version control system.
Anything that does complicate things like any software with a binary or proprietary format will not help you here.
Sorry if it is not the answer you want.
I agree with Ollivier that using DocBook (or LaTEX) is the sanest approach to have easy conversion, sane formatting, nice version control.
Happily, you can try to have your cake and eat it too with a DocBook editor.
Try the ones on this list and see if any satisfies your needs (I haven't used any).
We are using "Help & Manual" from EC Software and it works quite well. Our authors are spread through the U.S. so we share our content files via a hosted SVN server to manage version control. On each workstation we use Tortoise SVN to stay in sync. The product is extremely easy to use and productive.
A VERY nice explanation on what O'Reilly (actually the ones selling all these books...) uses:
O'Reilly Toolchain
It may seem complicated, but depending on the amount of pages you are going to write you maybe should put some consideration into it.
Word (or your favorite word processor)
I make all my user manuals (not to be confused with user HELP files) in Word. Then I can determine if they need to be in PDF, RTF, DOC or even converted to HTML. To solve the multi-user updating issue, I store the file in Source Control which handles all those fun things.
See the Fastware Project blog for an in depth discussion of the tradeoffs of using DocBook etc. Scott Meyer has tried out a lot of possibilities and shares what he's thinking.
Adobe InDesign CS5.5 is much better at cross references and long documents than earlier versions. It is very powerful and relatively easy to learn and use. The feature set is very rich and the more you learn about it the more you can do with it. It supports very powerful XML features and can import and export XML as needed. It can also map Styles to Tags and Tags to styles allowing you to create your XML in an automated fashion if you simply use a full set of character and paragraph styles. I have used the program for years and produced multiple projects from books to one-off advertisements. It is a graphic design tool, but has support for many aspects of book and manual production. I recommend it if you are more concerned with graphics, images or illustrations. InDesign support a wide number of import and export formats.
InDesign CS5.5 has added and improved support for both interactive content and export for EPUB (electronic book) and Adobe's Digital Publishing Suite (DPS) electronic magazine formats.
Framemaker is an excellent tool for books, manuals and long technical documents. It is a bit harder to learn than InDesign but has a richer set of tools for building variables and running headers and footers, if you have the time and inclination to learn how to use them. It also has a very robust XML feature-set, but I have not used it personally.
Unfortunately, Framemaker suffers from lack of support for graphic design. The color system is based very kludgey and spot (PMS) colors are hard to define. Simple things like adding a stroke color and fill color are rudimentary at best. For example, you still can't select a stroke color that's different from an objects fill color. The program is intended to output to laser and inkjet printers and not really to printing presses.
One feature that is really cool is the ability to apply master pages based on the Paragraph styles appearing on the page. The paragraph/illustration numbering in Framemaker is superior to any other program that I have ever used. But it is also difficult to learn and use.
Both programs support output to PDF and PostScript file formats and can generate hyperlinks and interactive content.