Application (Not a Markup Language) for Producing a User Manual [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Can anyone recommend a program to create user manuals with? Not a markup language (like LaTeX or DocBook) but more something interactive like Scribus. As I'm not the only one that will update the manual the software should be something that's easy for a novice to pick up but still has some advanced features (like linking in text from external sources/tables, handling masterpages/themes etc.).
Regards,
Oscar

Technical Publishing Software - Views on FrameMaker and Its Alternatives
I've done spec documents with LaTeX and Framemaker, and designed a Framemaker workflow to support a team of 5 analysts producing a spec document for an insurance underwriting system. The document was expected to get to 2,000 pages or so. Many years ago (around 1992-1993) I also worked briefly as a typesetter.
Framemaker is designed for technical documentation and does it very well indeed. It also has features designed to support very large documents with multiple authors - people use this system to do documents with more than 100,000 pages. It is also more accessible than LaTeX to users familiar with word processing software.
Key features of Framemaker:
Documents consisting of multiple
files: You can pull together a
'Book' with multiple subsections in
different files. The document can
also be kept in source control.
Textual MIF format for
import/export: The importer is
somewhat finicky (I found generating
working LaTeX to be easier) but you can
generate items such as data
dictionaries and import them into
the document. The file has textual
anchors (see below) so you can
create cross-reference links that
will be stable across imports. I
find this to be a key feature for
specs as it allows cross-references
to link directly to generated items.
Powerful tagging, indexing and cross-referencing System: Everything
is based on tags in Framemaker and
it is easy to apply tags quickly.
This means that cross-referencing,
indexing, conditional text and
applying styles en-masse is easy and
just works. You can generate indexes and TOCs based on tags, so
having multiple specialised indexes
(such as a list of data field names
from screens or a data dictionary)
is easy to do. The document I
described above had 4 separate
indexes.
Stable: Framemaker is designed for
professionals so it doesn't second
guess you in the way that word does.
It is also much more stable on large
documents. Anyone who's tried to
write a document of more than 50-100
pages on Word should have a pretty
fair idea of what this implies.
Scriptable: FM has a C API and there
are various scripting plugins
(FrameScript and FMPython
being probably the most widely used)
which can be used to automate jobs
in FM. Framemaker 10 adds support
for a Javascript based scripting tool
called Extendscript, presumably
ported across from the scripting facility
in InDesign.
Single-sourcing: From a single FM
document you can produce PDF,
Windows Help (CHM), HTML and print
documents fairly easily. The
cross-references also resolve to
hyperlinks.
Global style controls: You can
easily set up styles for a document
and apply it across the whole
document. It also facilitates
running headers and footers with a
great deal of flexibility in having
them track sections, versions,
chapters etc.
Alternatives to Framemaker
LaTeX/Lout: You've already indicated
that you don't want a markup
lanaguage, but the TeX and
Lout systems are used for large
structured documents and do this
well.
Ventura Publisher: Probably the
only real alternative to Framemaker
if you want that sort of user interface
without paying bodily parts for the
privilege.
It has strong support for structured
documents and an XML-based document
interchange format. It's now owned
by Corel, who still appear to be actively promoting it.
There are a couple of other technical publishing tools on the market: Quicksilver (which used to be known as Interleaf) and ArborText. These two are powerful tools - Interleaf used to be the market leader in this field at one point - but quite expensive.
Adobe Indesign: Although Adobe
claim you can do large documents
with InDesign, the cross-referencing
and other large document features
tend to be viewed as lacking by
Framemaker afficionados. There is,
however, a text entry system for it
called InCopy that apparently
does have this sort of
functionality and quite
a large body of Third-party
plugins, some of which do
support tagging and other such facilities.
InDesign also has a scripting API and
a JavaScript interpreter for executing
scripts.
I haven't used Indesign,
so I can't really comment on how
well it works in practice.
DocBook: This is really just
a standard format for structured
documents but has a large ecosystem
of tools surrounding it for writing
and rendering documents. If you
don't want to use LaTeX you will
probably not want to use DocBook for
similar reasons. As Vinko Vrsalovic
points out (+1), This link goes to a StackOverflow
post from someone describing using
DocBook in practice.
I've never really used DocBook and I've
made so many edits to this post that it's now in Wiki mode, so
someone familiar with DocBook might
want to elaborate on this.
Word processing software: Word
has serious shortcomings as a
technical publishing tool and is not
recommended. OpenOffice has
somewhat better structured
documentation functionality than
word and may be a better choice if
politics or requirement to use .doc
as a document interchange format
preclude a better alternative.
Wordperfect is also
considerably better for
documentation-in-the-large than word
and still has a presence in several vertical markets
such as legal offices.
Madcap Software's Blaze and Flare: These
are new kids on the block and live
in roughly the same space as
Framemaker. The company was founded by former
eHelp (creators of RoboHelp) employees and is
actively developing, with multiple releases yearly. Their
offerings have greatly expanded in the past two years,
to the detriment of the quality of the individual products.
It seems focus has been on turning out new products and
by consequence there are a lot of "fit and finish" issues in
each. The authors have chosen to reinvent the wheel in many ways,
resulting in confusing and often broken implementations. Save often,
you will encounter unhandled exceptions. Source control integration
is flaky. For example, moving or deleting a group of files will result in
one source control commit for each file deletion. Big PITA when
you have source control email notifications. Hello 500 emails.
Flare can import Word and Framemaker files, but the import
is far from seamless. Expect to retain all of your content
but plan on completely re-styling from scratch.
Flare shares many of Word's tendancies to do too much behind
the scenes and assume what the user would choose. The HTML looks
like what Word outputs when you export HTML - lots of custom tags
and attributes, deeply nested inline styles, etc. The text
editor is maddening, for example, its cursor model is different
than any other software you've ever used.
Framemaker vs. LaTeX
These two are main systems I have used to produce large, presentable system documents and I've had good results with both.
Ease of Learning: TeX can give you absolute control but actually
achieving this on a complex LaTeX
document without breaking other
items isn't trivial, particularly
where a large number of macro
packages are involved. Basic LaTeX
isn't hard to learn, but making
modified versions of .sty files that
still work takes a bit of tinkering
if you're not a really deep TeX
hacker. It can be done but be
prepared to spend quite a lot of
time fiddling.
Framemaker can give you a good degree of control on the look of the document and isn't that hard to learn. Getting a house style and tweaking the layout (which you probably will have to do) will be easier with Framemaker.
Ease of Text Entry: You can use tools such as Lyx to provide a
wordprocessor-like front end to
LaTeX, and these work well if you
want to write large bodies of text.
Framemaker's DTP-like user interface
works in a way familiar to people
who are used to wordproessing
software. From this perspective
there is little practical
difference.
Templating Document Structure: Framemaker allows a document
structure to be defined in terms of
tags or an XML schema (if using
Structured Framemaker). LaTeX has a
set of canned structural elements
that are flexible enough to be
useful. Adding additional
structural elements (e.g. a data
dictionary item) can be done as a
macro, but making them auto-number
is a bit more challenging and you will
need to poke around behind the
scenes. Both can do it, but it's
considerably more technical to do it
in LaTeX in anything but trivial
cases.
Also, LaTeX does not have
the facility to template the
document structure in the way that
Structured Framemaker does.
However, you can achieve this type
of effect with DocBook and then
generate to LaTeX if desired.
Ease of Integration: I found making a generator for non-trivially
complex MIF files to be quite
fiddly. The MIF parser is quite
pernickety in FM and doesn't really
give good diagnostics. LaTeX
produces far better error messages
and is quite a bit less fussy.
Technical Publishing Software vs. Layout Software
Page layout software started with Pagemaker and the other main players in this space were its competitor Quark Xpess and now InDesign, with which Adobe is essentially trying to deprecate and replace it and Framemaker. Scribus, which you mentioned before, lives in the same space as these products.
If you are producing a manual with less than (say) 50-100 pages, one of the packages would probably do an adequate job. They are really designed for advertising and layout-heavy publication tasks such as magazines, so their support for large-document features of the sort found in Framemaker is fairly limited. The key issue with these products is scalability - they do not work well on large documents.
Just for reference I have actually typeset a 200-page book (someone's autobiography) using Pagemaker. While the fine-grained kerning and leading control helps a bit for copyfitting, it is still a highly manual process to lay out a book sized document. In this case the book was just straight text with no significant cross-referencing or structure other than chapters. Doing a complex technical spec document or manual this size with Pagemaker would have been very fiddly and probably next to impossible to get right without any mistakes.
Technical Publishing vs. Word Processing Software
This is more of a description of key shortcomings of MS-Word for large spec documents. However, it will illustrate some of the main features required for documentation-in-the-large:
Indexing and Cross-Referencing: This is a real chore in Word, and
quite unstable. Framemaker's
tagging features and LaTeX's labels
mean that you can assign a tag or
known label (in a predictable format
if necessary). The textual format
for the tag anchors is exposed in
the user interface, and is used for
the linkage. In Word, the anchors
are much more opaque and not
easily controllable in this way.
Combined with the clumsy user
interface and instability of the
product, this makes maintaining
these fiddly, and often unstable -
you often have to manually fix them
up.
Templated Layouts: Style support in word are quite basic and
numbering tends to be somewhat
unstable. FrameMaker is all about
driving from the tags and applying
styles based on the tags. Global
style changes just work in
Framemaker in a way that they do not
in Word.
Large multi-file Documents: I've never been able to make this work
well in Word, but it is a key
feature in Framemaker and LaTeX.
Again, Word's instability means that
you tend to spend a lot of time
tidying up after it. As the
document grows larger, the
proportion of time spent on this
work grows quadratically -
propensity for breakage proportional
to n (size of document) * time to
fix proportional to size n (time
to fix)
Why is Word so Unstable: Word does a lot behind the scenes to
support novice users and intervene
in layouts. It is also not really
frame-based (text flow conceptually
separate from document layout), but
the developers try to implement
various frame-like behaviours in the UI. When
the A.I. second-guesses you on a
complex document it often does the
wrong thing. Framemaker 'treats the
user as an adult' and does none of
this so things stay where you put
them.
Other word processors such as
Open Office and WordPerfect do not
misbehave in quite the same way as
Word, which is one of the reasons
that just about any word
processor other than Word will do a
better job of technical documents.
Pre-Flighting: In documentation-speak, this is the
process of checking that your
assemblage of files for the document
(image files etc.) is correct before
committing to print. The
professional systems will complain
about things that are wrong, giving
you a chance to correct it. Word
will just put on a happy face and
try to fix things behind the scenes.
A good example of this is a word
file with linked graphics. If you
copy the file and graphics to
another directory and update one of
the graphics in situ, word may well
still read the file from the old
path (I've seen it do this) and not
the new one you've just updated.
However, this behaviour is not consistent and
typifies the rampant abuse of
unstable heuristics in that product.
Pre-Press Support: A publishing system extends into the pre-press
phase of the workflow. This means
it covers preparation for print.
Word processing software tends not
to have this functionality or have
it in a very limited form.
Without getting too far into this, a key difference is that publishing software tends to treat you like a consenting adult and not get in the way when you want to scale or automate things. One can use word processing software for large scale documentation but it has many design decisions adapted to casual users writing short documents with little regard for quality. These adaptations come at the expense of fitness-for-task on large scale document preparation work. The main issues I find with Word for spec documents are the poor indexing and cross-referencing and general instability issues where I am always having to go back and fix things. However, political considerations in most environments (I'm a contractor) mean one is often stuck with it.
Some general comments on the state of technical documentation software
Framemaker would be the obvious choice if Adobe didn't keep giving off signals that they are trying to deprecate it and move its user base to InDesign. However, FM is widely used in aerospace, software and engineering circles and Adobe's management would face a lynch mob if they actually EOL'd the product without a credible migration path. From what one reads on the web, Adobe's acquisition of FM was driven by John Warnock, but he was ousted and FM became a victim of office politics. The net result is that it's been moved to maintenance mode and is quite stagnant.
Ventura Publisher has also been relegated to a niche market to some extent, but at least Corel do not have two competing product lines in the way that Adobe do. It is probably a passable substitute for FM and may be more politically acceptable to PHB types as it is marketed as a 'business publishing' system.
Quicksilver and Arbortext both seem to be viable products, but are very expensive. I've not used either, so I can't really make any real judgement on their merits.
The markup language systems are free and very powerful in many ways. Lout might be a bit easier to work with as it doesn't have quite the level of legacy baggage that LaTeX does. DocBook is also quite widely used and does have quite a bit of tool support. These technologies put a significant squeeze on the 'geek' end of Framemaker's market share and do so on their merits - they have probably taken quite a chunk out of Adobe's profit margins over the years. I would not dismiss these technologies out of hand, but they will be harder to learn in practice.
You might try evaluating InDesign and a selected set of plugins (concentrate on those for tagging and cross-ref/index management). Finally, some of the word processing software (Wordperfect and OpenOffice) give you a reasonable toolkit for structured documentation and work considerably better for this than MS-Word.
PostScript
Yes, that is a pun. I haven't touched on Pre-Press functionality of any of these products. Printing and Pre-Press are technical fields in their own right and the scope for expensive mistakes means you should probably leave this up to specialists.
Framemaker, InDesign, Ventura, QuickSilver, Arbortext and (presumably) the MadCap products all come with facilities to do pre-press preparation. By and large, word processing software does not.
Doing pre-press with LaTeX tends to involve post-processing the PS output with software like psutils or rendering to PDF and taking the pre-press workflow from there. Generally, most pre-press houses can work from PDF, so a good PDF writing tool like Distiller is the best interface for work prepared from tools that are not designed for prepress work. Note that the quality of the output from Distiller tends to be better than the Ghostscript based ones like PDFCreator.
Note that the RGB colour space of a monitor does not have a direct map to a CYMK colour space used by a printing press. Actually getting colours - especially colour photos - to come out correctly on a press is somewhat fraught if you do not have the right kit. For print production, see a specialist unless you have reason to believe you know what you're doing. For a casual user I would still recommend this 15 years after I was involved in the industry, as mistakes are very expensive to fix once they're committed to print.
If you really do want to do colour print work in-house, you will probably need to calibrate your monitor. For best results, you should get a high-fidelity monitor like this one from HP. In order to calibrate the monitor you may also need a sensor like one of the ones described in this review if the monitor does not come with one. Most professional graphics cards like these from Nvidia, AMD or Matrox have facilities to support gamma correction; many consumer ones do as well. You will also need to get calibration data for the press you are going to be using to print, although the pre-press house will probably be able to do this.
As stated before, print media is quite technical in its own right, easy to get wrong and expensive to fix once it's gone to print. If you're not 100% certain you've got your calibration right, get a colour proof like a Chromalin. This is done from the actual film separations (and is thus quite expensive), so it gives an accurate rendition of the actual colour of the final printed article. Doing this for a few sample pages will give you accurate feedback about whether your calibration is set up right.
Acknowledgements: Thanks to Aidan Ryan for expanding the section on Madcap products.

I would recommend "Help & Manual" from EC Software. You can create a printed manual, PDF, Windows help file (CHM), and HTML web based help from a single source document.

I've heard good things about FrameMaker. I've not used it myself, but have had it recommended to me for just such an application.

Adobe Framemaker indeed is the classic tool for writing user manuals. I've used it for all kinds of long documents, and it works very well. Too bad that Adobe left it to rot for years, before noticing that users wouldn't switch.
MSWord took till 2003 to get the bullet/numbering bugs out, and I don't know if they finally got master document working.
LaTeX still is a reasonable alternative. The format is easy to process, and you could generate it from a wiki.

If you want collaboration, then a language-based approach (LaTeX would be my preference although XML-based ones are also good -- Docbook being the flagship here) does make sense, especially if you are tracking files with a version control system.
Anything that does complicate things like any software with a binary or proprietary format will not help you here.
Sorry if it is not the answer you want.

I agree with Ollivier that using DocBook (or LaTEX) is the sanest approach to have easy conversion, sane formatting, nice version control.
Happily, you can try to have your cake and eat it too with a DocBook editor.
Try the ones on this list and see if any satisfies your needs (I haven't used any).

We are using "Help & Manual" from EC Software and it works quite well. Our authors are spread through the U.S. so we share our content files via a hosted SVN server to manage version control. On each workstation we use Tortoise SVN to stay in sync. The product is extremely easy to use and productive.

A VERY nice explanation on what O'Reilly (actually the ones selling all these books...) uses:
O'Reilly Toolchain
It may seem complicated, but depending on the amount of pages you are going to write you maybe should put some consideration into it.

Word (or your favorite word processor)
I make all my user manuals (not to be confused with user HELP files) in Word. Then I can determine if they need to be in PDF, RTF, DOC or even converted to HTML. To solve the multi-user updating issue, I store the file in Source Control which handles all those fun things.

See the Fastware Project blog for an in depth discussion of the tradeoffs of using DocBook etc. Scott Meyer has tried out a lot of possibilities and shares what he's thinking.

Adobe InDesign CS5.5 is much better at cross references and long documents than earlier versions. It is very powerful and relatively easy to learn and use. The feature set is very rich and the more you learn about it the more you can do with it. It supports very powerful XML features and can import and export XML as needed. It can also map Styles to Tags and Tags to styles allowing you to create your XML in an automated fashion if you simply use a full set of character and paragraph styles. I have used the program for years and produced multiple projects from books to one-off advertisements. It is a graphic design tool, but has support for many aspects of book and manual production. I recommend it if you are more concerned with graphics, images or illustrations. InDesign support a wide number of import and export formats.
InDesign CS5.5 has added and improved support for both interactive content and export for EPUB (electronic book) and Adobe's Digital Publishing Suite (DPS) electronic magazine formats.
Framemaker is an excellent tool for books, manuals and long technical documents. It is a bit harder to learn than InDesign but has a richer set of tools for building variables and running headers and footers, if you have the time and inclination to learn how to use them. It also has a very robust XML feature-set, but I have not used it personally.
Unfortunately, Framemaker suffers from lack of support for graphic design. The color system is based very kludgey and spot (PMS) colors are hard to define. Simple things like adding a stroke color and fill color are rudimentary at best. For example, you still can't select a stroke color that's different from an objects fill color. The program is intended to output to laser and inkjet printers and not really to printing presses.
One feature that is really cool is the ability to apply master pages based on the Paragraph styles appearing on the page. The paragraph/illustration numbering in Framemaker is superior to any other program that I have ever used. But it is also difficult to learn and use.
Both programs support output to PDF and PostScript file formats and can generate hyperlinks and interactive content.

Related

Functional Specification Process Management

Developing functional specifications is never a pleasurable experience, but I kind of find a sick pleasure in planning a project well. I think I have some father issues.
Regardless of my own issues, I can find any number of articles on how to create a single functional spec in varying degrees of usefulness. There are templates and examples aplenty, and I've got a good library of my own. However I am finding it difficult to find anyone who discusses a manner in which to produce multiple functional specs with any efficiency.
Does anyone know of a source discussing how to manage the process of quickly generating disparate types of functional specs? Say a company that delivers web apps, perhaps using a rapid development tool like ColdFusion or PhoneGap or something where the experience lies within the use of the tool not the end result. So the functional specs can have a wonderful array of difference in them.
Can anyone point me towards a way of managing this process to ease the burden of building each of these from scratch?
EDIT - I really like OmniGraffle, however I'm not trying to maintain a look and feel or do anything visual (saving past screen shots might be useful if they can be indexed). Code Snippets seems closer to what I wanted. But in actuality I think I am looking for the method to archive/index past blocks of text.
So if I described a purchase order system a year ago and I am building something similar today, I want to find that functional spec from a year ago to have some example text to start from.
In my head this is liek some novel writing software where like code snippets a block of text (either a scene, chapter or blurb or whatever can be written and then moved aroudn int eh body of the whole. yWriter does this. However I need to find a way to index/search through these large chunks of text for relevance. I am hoping to learn more about that kind of system.
Fleshing out the ambiguity
If you are asking about templates that are primarily textual, then your best bet is probably just to have a 'stationary' file that you can open a copy, adding pieces that are copies of the template structure you've saved to the 'stationary', and then save out the draft spec.
If you are referring to diagrams and other visual schematic that follow a 'spec language' that is unique to your development framework, then I would suggest a tool like OmniGraffle, Visio, or LucidCharts, which have active communities that develop 'stencil libraries' (e.g. graffletopia)
I think you more mean #1, in which case you might look to examples like OmniOutliner templates which can contain sophisticated stylization of fonts and format, akin to 'type styles' in Word documents.
Code Snippets are one mechanism for solving this, but you will only get snippet libraries for programming IDEs, which generally will lack text style features. Code Snippet libraries are like text macros: short strips that expand into large blocks of text. You could create your own snippets for the different structures of project spec that related to each kind of framework.
Another solution is to leverage the file interoperability of tools like OmniGraffle and OmniOutliner (or other pairings). WhenOmniGraffle opens an Outliner file, it displays the list structure as a tree of objects/nodes. After adding more nodes, the OmniGraffle file can be re-opened in OmniOutliner and viewed as a list, with all the attached Outliner styles.
This is a nice multi-modal approach, but locks you into a toolset. Probably unavoidable until more people demand tooling to do this kind of thing.

Using flow chart or diagram for routines across programs

I have a busy set of routines to validate or download the current client application. It starts with a Windows desktop shortcut that invokes a .WSF file. This calls on several .VBS files, an .INI for settings, and potentially a .BAT file. Some of these script documents have internal functions. The final phase opens a Microsoft Access database, which entails an AutoExec macro, which kicks off some VBA, including a form which has a load routine of its own in VBA.
None of this detail is specifically important (so please don't add a VBA tag, OR criticize my precious complexity). The point is I have a variety of tools and containers and they may be functionally nested.
I need better techniques for parsing that in a flow chart. Currently I rely on any or all of the following:
a distinct color
a big box that encloses a routine
the classic 'transfer of control' symbol
perhaps an explanatory call-out
Shouldn't I increase my flow charting vocabulary? Tutorials explain the square, the diamond, the circle, and just about nothing more. Surely FC can help me deal with these sorts of things:
The plethora of script types lets me answer different needs, and I want to indicate tool/language.
A sub-routine could result in an abort of the overall task, or an error, and I want to show the handling of that by (or consequences for) higher-level "enclosing" routines.
I want to distinguish "internal" sub-routines from ones in a different script file.
Concurrent script processing could become critical, so I want to note that.
The .INI file lets me provide all routines with persistent values. How is that charted?
A function may have an argument(s) and a return value/reference ... I don't know how to effectively cite even that.
Please provide guidance or point me to a extra-helpful resource. If you recommend an analysis tool set (like UML, which I haven't gotten the hang of yet), please also tell me where I can find a good introduction.
I am not interested in software. Please consider this a white board exercise.
Discussion of the question suggests flowcharts are not useful or accurate.
Accuracy depends on how the flow charts are constructed. If they are constructed manually, they are like any other manually built document and will be out of date almost instantly; that makes hand-constructed flowcharts really useless, which is why people tend to like looking at the code.
[The rest of this response violate's the OPs requirement of "not interested in software (to produce flowcharts)" because I think that's the only way to get them in some kind of useful form.]
If the flowcharts are derived from the code by an an appropriate language-accurate analysis tool, they will be accurate. See examples at http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html These examples are semantically precise although the pages there don't provide the exact semantics, but that's just a documetation detail.
It is hard to find such tools :-} especially if you want flowcharts that span multiple languages, and multiple "execution paradigms" (OP wants his INI files included; they are some kind of implied assignment statements, and I'm pretty sure he'd want to model SQL actions which don't flowchart usefully because they tend to be pure computation over tables).
It is also unclear that such flowcharts are useful. The examples at the page I provided should be semiconvincing; if you take into account all the microscopic details (e.g., the possiblity of an ABORT control flow arc emanating from every subroutine call [because each call may throw an exception]) these diagrams get horrendously big, fast. The fact that the diagrams are space-consuming (boxes, diamonds, lines, lots of whitespace) aggravates this pretty badly. Once they get big, you literally get lost in space following the arcs. Again, a good reason for people to avoid flowcharts for entire systems. (The other reason people like text languages is they can in fact be pretty dense; you can get a lot on a page with a succinct language, and wait'll you see APL :)
They might be of marginal help in individual functions, if the function has complex logic.
I think it unlikely that you are going to get language accurate analyzers that produce flowcharts for all the languages you want, that such anlayzers can compose their flowcharts nicely (you want JavaScript invoking C# running SQL ...?)
What you might hope for is a compromise solution: display the code with various hyper links to the other artifacts referenced. You still need the ability to produce such hyperlinked code (see http://www.semanticdesigns.com/Products/Formatters/JavaBrowser.html for one way this might work), but you also need hyperlinks across the language boundaries.
I know of no tools that presently do that. And I doubt you have the interest or willpower to build such tools on your own.

Extracting information from PDFs of research papers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.
At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.
Ideally this would be an open source solution.
The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.
I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.
I'm only allowed one link per posting so this is it:
pdfinfo Linux manual page
This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.
To pull out bibliographic references, look at cb2bib:
cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.
You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.
We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.
https://code.google.com/p/pdfssa4met/
Might be a tad simplistic but Googling "bibtex + paper title" ussualy gets you a formated bibtex entry from the ACM,Citeseer, or other such reference tracking sites. Ofcourse this is assuming the paper isn't from a non-computing journal :D
-- EDIT --
I have a feeling you won't find a custom solution for this, you might want to write to citation trackers such as citeseer, ACM and google scholar to get ideas for what they have done. There are tons of others and you might find their implementations are not closed source but not in a published form. There is tons of research material on the subject.
The research team I am part of has looked at such problems and we have come to the conclusion that hand written extraction algorithms or machine learning are the way to do it. Hand written algorithms are probably your best bet.
This is quite a hard problem due to the amount of variation possible. I suggest normalizing the PDF's to text (which you get from any of the dozens of programmatic PDF libraries). You then need to implement custom text scrapping algorithms.
I would start backward from the end of the PDF and look what sort of citation keys exist -- e.g., [1], [author-year], (author-year) and then try to parse the sentence following. You will probably have to write code to normalize the text you get from a library (removing extra whitespace and such). I would only look for citation keys as the first word of a line, and only for 10 pages per document -- the first word must have key delimiters -- e.g., '[' or '('. If no keys can be found in 10 pages then ignore the PDF and flag it for human intervention.
You might want a library that you can further programmatically consult for formatting meta-data within citations --e.g., itallics have a special meaning.
I think you might end up spending quite some time to get a working solution, and then a continual process of tuning and adding to the scrapping algorithms/engine.
In this case i would recommend TET from PDFLIB
If you need to get a quick feel for what it can do, take a look at the TET Cookbook
This is not an open source solution, but it's currently the best option in my opinion. It's not platform-dependant and has a rich set of language bindings and a commercial backing.
I would be happy if someone pointed me to an equivalent or better open source alternative.
To extract text you would use the TET_xxx() functions and to query metadata you can use the pcos_xxx() functions.
You can also use the commanline tool to generate an XML-file containing all the information you need.
tet --tetml word file.pdf
There are examples on how to process TETML with XSLT in the TET Cookbook
What’s included in TETML?
TETML output is encoded in UTF-8 (on zSeries with USS or
MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information:
general document information and metadata
text contents of each page (words or paragraph)
glyph information (font name, size, coordinates)
structure information, e.g. tables
information about placed images on the page
resource information, i.e. fonts, colorspaces, and images
error messages if an exception occurred during PDF processing
CERMINE - Content ExtRactor and MINEr
Described in the paper: TKACZYK, Dominika, et al. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 2015, 18.4: 317-335.
Mainly written in Java and available as open source at github.
Another Java library to try would be PDFBox. PDFs are really designed to viewed and printed, so you definitely want a library to do some of the heavy lifting for you. Even so, you might have to do a little gluing of text pieces back together to get the data you want extracted. Good Luck!
Just found pdftk... it's amazing, comes in a binary distribution for Win/Lin/Mac as well as source.
In fact, I solved my other problem (look at my profile, I asked then answered another pdf question .. can't link due to 1 link limitation).
It can do pdf metadata extraction, for example, this will return the line containing the title:
pdftk test.pdf dump_data output test.txt | grep -A 1 "InfoKey: Title" | grep "InfoValue"
It can dump title, author, mod-date, and even bookmarks and page numbers (test pdf had bookmarks)... obviously a bit of work will be needed to properly grep the output, but I think this should fit your needs.
If your pdfs don't have metadata (ie, no "Abstract" metadata), you can cat the text using a different tool like pdf2text, and use some grep tricks like above. If your pdfs are not OCR'd, you have a much bigger problem, and ad-hoc querying of the pdf(s) will be painfully slow (best to OCR).
Regardless, I would recommend you build an index of your documents instead of having each query scan the file metadata/text.
Take a look at iText. It is a Java library that will let you read PDFs. You will still face the problem of finding the right data, but the library will provide formatting and layout information that might be usable to infer purpose.
PyPDF might be of help. It provides extensive API for reading and writing the content of a PDF file (un-encrypted), and its written in an easy language Python.
Have a look at this research paper - Accurate Information Extraction from Research Papers using Conditional Random Fields
You might want to use an open-source package like Stanford NER to get started on CRFs.
Or perhaps, you could try importing them (the research papers) to Mendeley. Apparently, it should extract the necessary information for you.
Hope this helps.
Here is what I do using linux and cb2bib.
Open up cb2bib and make sure that clipboard connection is ON, and that your reference database is loaded
Find your paper on google scholar
Click 'import to bibtex' underneath the paper
Select (highlight) everything on the next page (ie., the bibtex code)
It should now appear formatted in cb2bib
Optionally now press network search (the globe icon) to add additional info.
Press save in cb2bib to add the paper to your ref database.
Repeat this for all the papers. I think in the absence of a method that reliably extracts metadata from PDFs, this is the easiest solution I found.
I recommend gscholar in combination with pdftotext.
Although PDF provides meta data, it is seldomly populated with correct content. Often "None" or "Adobe-Photoshop" or other dumb strings are inplace of the title field, for example. That is why none of the above tools might derive correct information from PDFs as the title might be anywhere in the document. Another example: many papers of conference proceedings might also have the title of the conference, or the name of the editors which confuses automatic extraction tools. The results are then dead wrong when you are interested of the real authors of the paper.
So I suggest a semi-automatic approach involving google scholar.
Render the PDF to text, so you might extract: author, and title.
Second copy paste some of this info and query google scholar. To automate this, I employ the cool python script gscholar.py.
So in real life this is what I do:
me#box> pdftotext 10.1.1.90.711.pdf - | head
Computational Geometry 23 (2002) 183–194
www.elsevier.com/locate/comgeo
Voronoi diagrams on the sphere ✩
Hyeon-Suk Na a , Chung-Nim Lee a , Otfried Cheong b,∗
a Department of Mathematics, Pohang University of Science and Technology, South Korea
b Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
Received 28 June 2001; received in revised form 6 September 2001; accepted 12 February 2002
Communicated by J.-R. Sack
me#box> gscholar.py "Voronoi diagrams on the sphere Hyeon-Suk"
#article{na2002voronoi,
title={Voronoi diagrams on the sphere},
author={Na, Hyeon-Suk and Lee, Chung-Nim and Cheong, Otfried},
journal={Computational Geometry},
volume={23},
number={2},
pages={183--194},
year={2002},
publisher={Elsevier}
}
EDIT: Be careful, you might encounter captchas. Another great script is bibfetch.

Company insists on using a binary format for all our documentation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I work at a company that, for some reason, insists that all our development documentation should be in MS Word format. Which, being a binary format, means we cannot:
Diff versions of a document against each other (so peer reviewing them is a pain - because of the domain we work in, peer reviews for all changes are essential)
Grep a folder-full of documents for keywords
What do you use to write documentation in and why?
Please also give me ammo to change this situation with...
I recently started using DocBook XML to author my documentation.
On the upside, it's a pure text format. You can break a large document into multiple files, and use nodes to bring them all together into a single book. Table of contents and index are automatically generated. Intra-document links (within arbitrary text, pointing to chapters or sections) are very easy. And with a push of a button, I can create a single-html-file version, a chunked-html version (one file per chapter), and a PDF version.
After some tweaking and customization, I'm very happy with the output. The documents look great!!
DocBook is used extensively by real publishers (most notably, O'Reilly), and it's been around for more than fifteen years, so it's reached a certain level of maturity.
On the other hand, all of the processing is done with XSLT, using an ad-hoc collection of tools. (My own docbook pipeline includes Python, Java, Xerces, Xalan, Apache FOP, and PDF-SAM. Plus the official XSLT stylesheet distribution, and my own XSLT customizations.)
DocBook is not a turnkey solution. You won't be able to get going quickly, without reading the manual. And if you don't know anything about XSLT, you'll have to learn.
On the other hand, there are only a dozen or two XML tags that you really need to know to write the documents. (The real expertise comes into play during doc generation from the XML sources.) If one person on your team was willing to be responsible for writing the doc build script, then everyone else on the team could just learn the DTD and do a decent job contributing.
Anyhow... DocBook definitely has some faults. It's not the easiest system for tech authorship. But it's the best open source tool I know of.
The "Subversion Book" is written in DocBook. Here's a page with links to the different book versions (single-html, chunked-html, and PDF):
http://svnbook.red-bean.com/
And here's a link to the DocBook XML sources for the first chapter, so that you can get an idea for how it works:
http://sourceforge.net/p/svnbook/source/HEAD/tree/branches/1.7/en/book/ch01-fundamental-concepts.xml
For ammo, there's the trusty old Pragmatic Programmer, chapter 14: The Power of Plain Text.
As Pragmatic Programmers, our base
material isn't wood or iron, it's
knowledge. We gather requirements as
knowledge, and then express that
knowledge in our designs,
implementations, tests, and documents.
And we believe the best format for
storing knowledge persistently is
plain text. With plain text, we give
ourselves the ability to manipulate
knowledge, both manually and
programmatically, using virtually
every tool at our disposal.
We use a wiki (specifically the one provided by Trac) for the two reasons you mentioned. Plus, if we really need to we can get the text version of the markup and manipulate it in a text-only environment, too (e.g. as part of svn comments during commit).
A format that can be easily reduced to text-only (non-binary) is definitely a must. Having the ability to upconvert it to a pretty format like a PDF is, for us, not terribly important.
Word has change tracking for documents (although it only works up until you accept the changes) and you can also grep them (the text isn't encrypted). So I'm not sure either of your arguments will hold up under scrutiny. I'd love to give you the ammo to change this but I've become jaded and cynical with age.
We use MS Word for our docs (which is a huge improvement over the earlier choice (Lotus WordPro - ugh!).
We use a wiki - specifically Confluence by Atlassian.
It's a commercial product, and it's great. One of the reasons we picked it over free/open wiki engines is that it has a full-blown WYSIWYG editor and various other features that make it more easily accessible to users who are familiar with Word.
We've also come up with a neat trick where we store images, designs, wireframes, etc. in Subversion, and then embed links in the wiki documents to those resources URLs via the Apache/SVN web interface module; notes on how we do this are here if you're interested.
Like Dylan's organisation, we also use the excellent Confluence wiki. I wrote an article about why this is better approach called Wiki is my word-processor, which should give you some reasons to change the situation.
Benefits of using a wiki for internal documentation include the following.
Word-processor users get sucked into changing the layout and typography, however good your templates are, which wastes time and reduces consistency.
A wiki provides full-text search, which you are unlikely to have for your body of the MS Word documents written by everyone.
A wiki provides a document version history; I have never heard of a team successfully keeping all revisions in Word documents and always being able to compare old versions, or using a version control system (with the possible exception of SharePoint but that's whole different failure scenario).
A wiki makes hyperlinks between documents easy; it is too hard to reliably link between documents in a collection of Word documents, so new documents end up duplicating older content into new monolithic documents which means they take more time to read and write.
Separate wiki pages can be edited by different people at the same time, and Confluence can merge changes when multiple people edit the same page at the same time; collaboration is harder with a Word document that only one person can edit at a time.
A wiki like Confluence automatically generates navigation pages based on wiki structure and tags; you need a librarian and lots of discipline to make it possible to browse a large collection of Word documents.
A wiki page usually loads and displays more quickly than a Word document.
A wiki page has more automatic meta-data; you need templates and discipline to make sure that Word documents always have Title, Author and Version set in the document properties and visible in the document on-screen and in print.
If you want more ammunition than this, then there is lots of wiki-promotion on The Atlassian Blog.
You could ask for documentation to be in OOXML (.docx, in the case of Word) format. Not as ideal as using ODT, in my opinion, however, it's still just a zip file with a bunch of XML files inside. :-)
A textual format facilitates merging your documentation with generated items such as JavaDoc, API references or data dictionaries. It also scales much better than word, which is hard to use for large documents. Finally, a format that allows includes allows multiple authors to work on a document concurrently.
LaTeX and FrameMaker (the two systems I have used for this) both have vastly superior indexing and cross-referencing capabilities and have either a native textual format or a textual version of their native format that can be included (MIF in the case of Framemaker). They are also both much more stable than word.
I've built tools that read data dictionaries and generate documentation that can be included into a larger document with stable indexing and two-way cross-referencing. The functional specification for This product was done with LaTeX in this way and got me another gig with the company. I have also developed a similar process with FrameMaker.
Is the entire development team against this requirement, or is it a small group? If it's the entire team, just ignore the mandate and use a text-based format -- wouldn't be the first time employees ignored a silly rule. Works especially well if you've not made a big fuss about it in the past. If you have, management might look especially hard at your docs.
MS Word supports document changes tracking and peer review.
The new MS Office format is fully XML based (to see this, rename a MS Word .docx file to a .zip, then unpack it to see).
Maybe Office 2007 may fit both your company requirements and your concerns ?
You can at least compare Word documents, see the "Track changes" command in the "Extra" menu, or use software like DeltaView. Found via google search first link at lifehacker.com. Searching in word documents should be possible with Google Desktop Search or other similar programs that index all files they are able to read.
Do they insist that you write it in Word or only that it's available in Word format? You could write in a text format and convert it to Word automatically.
Don't you store documentation files in some kind of Version Control System, ideally together with the source code? I would recommend to do this (makes it easy to get the documentation for old software releases).
And if you do store the docs in VCS, you will notice that plain text or XML-bases files are much better for this, because you can get diffs; also, changes between text files are usually stored more efficiently than changes between binary files.
Not to defend MS products here, but MS word can diff documents.
If you use Beyond Compare as the diff tool for your source-control system (As we do, with Perforce), it will show you differences between revisions of your Word docs. Admittedly, it only shows the textual differences - formatting changes are not shown - but this is usually enough for you to see what changed.
This is just another reason to invest in Beyond Compare, as it is one of the most polished pieces of software I've ever used - and it's the best $30 dollars (Less if you buy several) I've spent on software
There are many tools for word document comparison. I currently use a python script that puts a command-line on the built-in compare and merge functionality of word.
http://nicolas.lehuen.com/index.php/post/2005/06/30/60-comparing-microsoft-word-documents-stored-in-a-subversion-repository
It should be easy to automate word to extract all text from a word document into a text file. So you could write a script creating text files from word docs, and grep, compare, version control, Review these text files.
Of course this is not an ideal solution, since you loose your pretty formatting, but it should work.
I think there are programmes that convert Word docs to plain text. Use one of them to convert the word doc to plain text and then use diff, grep etc
Also have a look into recommended toolchain(s) for DocBook.

Practices for programming in a scientific environment? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Background
Last year, I did an internship in a physics research group at a university. In this group, we mostly used LabVIEW to write programs for controlling our setups, doing data acquisition and analyzing our data. For the first two purposes, that works quite OK, but for data analysis, it's a real pain. On top of that, everyone was mostly self-taught, so code that was written was generally quite a mess (no wonder that every PhD quickly decided to rewrite everything from scratch). Version control was unknown, and impossible to set up because of strict software and network regulations from the IT department.
Now, things actually worked out surprisingly OK, but how do people in the natural sciences do their software development?
Questions
Some concrete questions:
What languages/environments have you used for developing scientific software, especially data analysis? What libraries? (for example, what do you use for plotting?)
Was there any training for people without any significant background in programming?
Did you have anything like version control, and bug tracking?
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (especially physicists are stubborn people!)
Summary of answers thus far
The answers (or my interpretation of them) thus far: (2008-10-11)
Languages/packages that seem to be the most widely used:
LabVIEW
Python
with SciPy, NumPy, PyLab, etc. (See also Brandon's reply for downloads and links)
C/C++
MATLAB
Version control is used by nearly all respondents; bug tracking and other processes are much less common.
The Software Carpentry course is a good way to teach programming and development techniques to scientists.
How to improve things?
Don't force people to follow strict protocols.
Set up an environment yourself, and show the benefits to others. Help them to start working with version control, bug tracking, etc. themselves.
Reviewing other people's code can help, but be aware that not everyone may appreciate that.
What languages/environments have you used for developing scientific software, esp. data analysis? What libraries? (E.g., what do you use for plotting?)
I used to work for Enthought, the primary corporate sponsor of SciPy. We collaborated with scientists from the companies that contracted Enthought for custom software development. Python/SciPy seemed to be a comfortable environment for scientists. It's much less intimidating to get started with than say C++ or Java if you're a scientist without a software background.
The Enthought Python Distribution comes with all the scientific computing libraries including analysis, plotting, 3D visualation, etc.
Was there any training for people without any significant background in programming?
Enthought does offer SciPy training and the SciPy community is pretty good about answering questions on the mailing lists.
Did you have anything like version control, bug tracking?
Yes, and yes (Subversion and Trac). Since we were working collaboratively with the scientists (and typically remotely from them), version control and bug tracking were essential. It took some coaching to get some scientists to internalize the benefits of version control.
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (esp. physicists are stubborn people!)
Make sure they are familiarized with the tool chain. It takes an investment up front, but it will make them feel less inclined to reject it in favor of something more familiar (Excel). When the tools fail them (and they will), make sure they have a place to go for help — mailing lists, user groups, other scientists and software developers in the organization. The more help there is to get them back to doing physics the better.
The course Software Carpentry is aimed specifically at people doing scientific computing and aims to teach the basics and lessons of software engineering, and how best to apply them to projects.
It covers topics like version control, debugging, testing, scripting and various other issues.
I've listened to about 8 or 9 of the lectures and think it is to be highly recommended.
Edit: The MP3s of the lectures are available as well.
Nuclear/particle physics here.
Major programing work used to be done mostly in Fortran using CERNLIB (PAW, MINUIT, ...) and GEANT3, recently it has mostly been done in C++ with ROOT and Geant4. There are a number of other libraries and tools in specialized use, and LabVIEW sees some use here and there.
Data acquisition in my end of this business has often meant fairly low level work. Often in C, sometimes even in assembly, but this is dying out as the hardware gets more capable. On the other hand, many of the boards are now built with FPGAs which need gate twiddling...
One-offs, graphical interfaces, etc. use almost anything (Tcl/Tk used to be big, and I've been seeing more Perl/Tk and Python/Tk lately) including a number of packages that exist mostly inside the particle physics community.
Many people writing code have little or no formal training, and process is transmitted very unevenly by oral tradition, but most of the software group leaders take process seriously and read as much as necessary to make up their deficiencies in this area.
Version control for the main tools is ubiquitous. But many individual programmers neglect it for their smaller tasks. Formal bug tracking tools are less common, as are nightly builds, unit testing, and regression tests.
To improve things:
Get on the good side of the local software leaders
Implement the process you want to use in your own area, and encourage those you let in to use it too.
Wait. Physicists are empirical people. If it helps, they will (eventually!) notice.
One more suggestion for improving things.
Put a little time in to helping anyone you work directly with. Review their code. Tell them about algorithmic complexity/code generation/DRY or whatever basic thing they never learned because some professor threw a Fortran book at them once and said "make it work". Indoctrinate them on process issues. They are smart people, and they will learn if you give them a chance.
This might be slightly tangential, but hopefully relevant.
I used to work for National Instruments, R&D, where I wrote software for NI RF & Communication toolkits. We used LabVIEW quite a bit, and here are the practices we followed:
Source control. NI uses Perforce. We did the regular thing - dev/trunk branches, continuous integration, the works.
We wrote automated test suites.
We had a few people who came in with a background in signal processing and communication. We used to have regular code reviews, and best practices documents to make sure their code was up to the mark.
Despite the code reviews, there were a few occasions when "software guys", like me had to rewrite some of this code for efficiency.
I know exactly what you mean about stubborn people! We had folks who used to think that pointing out a potential performance improvement in their code was a direct personal insult! It goes without saying that that this calls for good management. I thought the best way to deal with these folks is to go slowly, not press to hard for changes and if necessary be prepared to do the dirty work. [Example: write a test suite for their code].
I'm not exactly a 'natural' scientist (I study transportation) but am an academic who writes a lot of my own software for data analysis. I try to write as much as I can in Python, but sometimes I'm forced to use other languages when I'm working on extending or customizing an existing software tool. There is very little programming training in my field. Most folks are either self-taught, or learned their programming skills from classes taken previously or outside the discipline.
I'm a big fan of version control. I used Vault running on my home server for all the code for my dissertation. Right now I'm trying to get the department to set up a Subversion server, but my guess is I will be the only one who uses it, at least at first. I've played around a bit with FogBugs, but unlike version control, I don't think that's nearly as useful for a one-man team.
As for encouraging others to use version control and the like, that's really the problem I'm facing now. I'm planning on forcing my grad students to use it on research projects they're doing for me, and encouraging them to use it for their own research. If I teach a class involving programming, I'll probably force the students to use version control there too (grading them on what's in the repository). As far as my colleagues and their grad students go, all I can really do is make a server available and rely on gentle persuasion and setting a good example. Frankly, at this point I think it's more important to get them doing regular backups than get them on source control (some folks are carrying around the only copy of their research data on USB flash drives).
1.) Scripting languages are popular these days for most things due to better hardware. Perl/Python/Lisp are prevalent for lightweight applications (automation, light computation); I see a lot of Perl at my work (computational EM) since we like Unix/Linux. For performance stuff, C/C++/Fortran are typically used. For parallel computing, well, we usually manually parallelize runs in EM as opposed to having a program implicitly do it (ie split up the jobs by look angle when computing radar cross sections).
2.) We just kind of throw people into the mix here. A lot of the code we have is very messy, but scientists are typically a scatterbrained bunch that don't mind that sort of thing. Not ideal, but we have things to deliver and we're severely understaffed. We're slowly getting better.
3.) We use SVN; however, we do not have bug tracking software. About as good as it gets for us is a txt file that tells you where bugs specific bugs are.
4.) My suggestion for implementing best practices for scientists: do it slowly. As scientists, we typically don't ship products. No one in science makes a name for himself by having clean, maintainable code. They get recognition from the results of that code, typically. They need to see justification for spending time on learning software practices. Slowly introduce new concepts and try to get them to follow; they're scientists, so after their own empirical evidence confirms the usefulness of things like version control, they will begin to use it all the time!
I'd highly recommend reading "What Every Computer Scientist Should Know About Floating-Point Arithmetic". A lot of problems I encounter on a regular basis come from issues with floating point programming.
I am a physicist working in the field of condensed matter physics, building classical and quantum models.
Languages:
C++ -- very versatile: can be used for anything, good speed, but it can be a bit inconvenient when it comes to MPI
Octave -- good for some supplementary calculations, very convenient and productive
Libraries:
Armadillo/Blitz++ -- fast array/matrix/cube abstractions for C++
Eigen/Armadillo -- linear algebra
GSL -- to use with C
LAPACK/BLAS/ATLAS -- extremely big and fast, but less convenient (and written in FORTRAN)
Graphics:
GNUPlot -- it has very clean and neat output, but not that productive sometimes
Origin -- very convenient for plotting
Development tools:
Vim + plugins -- it works great for me
GDB -- a great debugging tool when working with C/C++
Code::Blocks -- I used it for some time and found it quite comfortable, but Vim is still better in my opinion.
I work as a physicist in a UK university.
Perhaps I should emphasise that different areas of research have different emphasis on programming. Particle physicists (like dmckee) do computational modelling almost exclusively and may collaborate on large software projects, whereas people in fields like my own (condensed matter) write code relatively infrequently. I suspect most scientists fall into the latter camp. I would say coding skills are usually seen as useful in physics, but not essential, much like physics/maths skills are seen as useful for programmers but not essential. With this in mind...
What languages/environments have you used for developing scientific software, esp. data analysis? What libraries? (E.g., what do you use for plotting?)
Commonly data analysis and plotting is done using generic data analysis packages such as IGOR Pro, ORIGIN, Kaleidegraph which can be thought of as 'Excel plus'. These packages typically have a scripting language that can be used to automate. More specialist analysis may have a dedicated utility for the job that generally will have been written a long time ago, no-one has the source for and is pretty buggy. Some more techie types might use the languages that have been mentioned (Python, R, MatLab with Gnuplot for plotting).
Control software is commonly done in LabVIEW, although we actually use Delphi which is somewhat unusual.
Was there any training for people without any significant background in programming?
I've been to seminars on grid computing, 3D visualisation, learning Boost etc. given by both universities I've been at. As an undergraduate we were taught VBA for Excel and MatLab but C/MatLab/LabVIEW is more common.
Did you have anything like version control, bug tracking?
No, although people do have personal development setups. Our code base is in a shared folder on a 'server' which is kept current with a synching tool.
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (esp. physicists are stubborn people!)
One step at a time! I am trying to replace the shared folder with something a bit more solid, perhaps finding a SVN client which mimics the current synching tools behaviour would help.
I'd say though on the whole, for most natural science projects, time is generally better spent doing research!
Ex-academic physicist and now industrial physicist UK here:
What languages/environments have you used for developing scientific software, esp. data analysis? What libraries? (E.g., what do you use for plotting?)
I mainly use MATLAB these days (easy to access visualisation functions and maths). I used to use Fortran a lot and IDL. I have used C (but I'm more a reader than a writer of C), Excel macros (ugly and confusing). I'm currently needing to be able to read Java and C++ (but I can't really program in them) and I've hacked Python as well. For my own entertainment I'm now doing some programming in C# (mainly to get portability / low cost / pretty interfaces). I can write Fortran with pretty much any language I'm presented with ;-)
Was there any training for people without any significant background in programming?
Most (all?) undergraduate physics course will have a small programming course usually on C, Fortran or MATLAB but it's the real basics. I'd really like to have had some training in software engineering at some point (revision control / testing / designing medium scale systems)
Did you have anything like version control, bug tracking?
I started using Subversion / TortoiseSVN relatively recently. Groups I've worked with in the past have used revision control. I don't know any academic group which uses formal bug tracking software. I still don't use any sort of systematic testing.
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (esp. physicists are stubborn people!)
I would try to introduce some software engineering ideas at undergraduate level and then reinforce them by practice at graduate level, also provide pointers to resources like the Software Carpentry course mentioned above.
I'd expect that a significant fraction of academic physicists will be writing software (not necessarily all though) and they are in dire need of at least an introduction to ideas in software engineering.
What languages/environments have you used for developing scientific software, esp. data analysis? What libraries? (E.g., what do you use for plotting?)
Python, NumPy and pylab (plotting).
Was there any training for people without any significant background in programming?
No, but I was working in a multimedia research lab, so almost everybody had a computer science background.
Did you have anything like version control, bug tracking?
Yes, Subversion for version control, Trac for bug tracing and wiki. You can get free bug tracker/version control hosting from http://www.assembla.com/ if their TOS fits your project.
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (esp. physicists are stubborn people!).
Make sure the infrastructure is set up and well maintained and try to sell the benefits of source control.
I'm a statistician at a university in the UK. Generally people here use R for data analysis, it's fairly easy to learn if you know C/Perl. Its real power is in the way you can import and modify data interactively. It's very easy to take a number of say CSV (or Excel) files and merge them, create new columns based on others and then throw that into a GLM, GAM or some other model. Plotting is trivial too and doesn't require knowledge of a whole new language (like PGPLOT or GNUPLOT.) Of course, you also have the advantage of having a bunch of built-in features (from simple things like mean, standard deviation etc all the way to neural networks, splines and GL plotting.)
Having said this, there are a couple of issues. With very large datasets R can become very slow (I've only really seen this with >50,000x30 datasets) and since it's interpreted you don't get the advantage of Fortran/C in this respect. But, you can (very easily) get R to call C and Fortran shared libraries (either from something like netlib or ones you've written yourself.) So, a usual workflow would be to:
Work out what to do.
Prototype the code in R.
Run some preliminary analyses.
Re-write the slow code into C or Fortran and call that from R.
Which works very well for me.
I'm one of the only people in my department (of >100 people) using version control (in my case using git with githuib.com.) This is rather worrying, but they just don't seem to be keen on trying it out and are content with passing zip files around (yuck.)
My suggestion would be to continue using LabView for the acquisition (and perhaps trying to get your co-workers to agree on a toolset for acquisition and making is available for all) and then move to exporting the data into a CSV (or similar) and doing the analysis in R. There's really very little point in re-inventing the wheel in this respect.
What languages/environments have you used for developing scientific software, esp. data analysis? What libraries? (E.g., what do you use for plotting?)
My undergraduate physics department taught LabVIEW classes and used it extensively in its research projects.
The other alternative is MATLAB, in which I have no experience. There are camps for either product; each has its own advantages/disadvantages. Depending on what kind of problems you need to solve, one package may be more preferable than the other.
Regarding data analysis, you can use whatever kind of number cruncher you want. Ideally, you can do the hard calculations in language X and format the output to plot nicely in Excel, Mathcad, Mathematica, or whatever the flavor du jour plotting system is. Don't expect standardization here.
Did you have anything like version control, bug tracking?
Looking back, we didn't, and it would have been easier for us all if we did. Nothing like breaking everything and struggling for hours to fix it!
Definitely use source control for any common code. Encourage individuals to write their code in a manner that could be made more generic. This is really just coding best practices. Really, you should have them teaching (or taking) a computer science class so they can get the basics.
How would you go about trying to create a decent environment for programming, without getting too much in the way of the individual scientists (esp. physicists are stubborn people!)
There is a clear split between data aquisition (DAQ) and data analysis. Meaning, it's possible to standardize on the DAQ and then allow the scientists to play with the data in the program of their choice.
Another good option is Scilab. It has graphic modules à la LabVIEW, it has its own programming language and you can also embed Fortran and C code, for example. It's being used in public and private sectors, including big industrial companies. And it's free.
About versioning, some prefer Mercurial, as it gives more liberties managing and defining the repositories. I have no experience with it, however.
For plotting I use Matplotlib. I will soon have to make animations, and I've seen good results using MEncoder. Here is an example including an audio track.
Finally, I suggest going modular, this is, trying to keep main pieces of code in different files, so code revision, understanding, maintenance and improvement will be easier. I have written, for example, a Python module for file integrity testing, another for image processing sequences, etc.
You should also consider developing with the use a debugger that allows you to check variable contents at settable breakpoints in the code, instead using print lines.
I have used Eclipse for Python and Fortran developing (although I got a false bug compiling a Fortran short program with it, but it may have been a bad configuration) and I'm starting to use the Eric IDE for Python. It allows you to debug, manage versioning with SVN, it has an embedded console, it can do refactoring with Bicycle Repair Man (it can use another one, too), you have Unittest, etc. A lighter alternative for Python is IDLE, included with Python since version 2.3.
As a few hints, I also suggest:
Not using single-character variables. When you want to search appearances, you will get results everywhere. Some argue that a decent IDE makes this easier, but then you will depend on having permanent access to the IDE. Even using ii, jj and kk can be enough, although this choice will depend on your language. (Double vowels would be less useful if code comments are made in Estonian, for instance).
Commenting the code from the very beginning.
For critical applications sometimes it's better to rely on older language/compiler versions (major releases), more stable and better debugged.
Of course you can have more optimized code in later versions, fixed bugs, etc, but I'm talking about using Fortran 95 instead of 2003, Python 2.5.4 instead of 3.0, or so. (Specially when a new version breaks backwards compatibility.) Lots of improvements usually introduce lots of bugs. Still, this will depend on specific application cases!
Note that this is a personal choice, many people could argue against this.
Use redundant and automated backup! (With versioning control).
Definitely, use Subversion to keep current, work-in-progress, and stable snapshot copies of source code. This includes C++, Java etc. for homegrown software tools, and quickie scripts for one-off processing.
With the strong leaning in science and applied engineering toward "lone cowboy" development methodology, the usual practice of organizing the repository into trunk, tag and whatever else it was - don't bother! Scientists and their lab technicians like to twirl knobs, wiggle electrodes and chase vacuum leaks. It's enough of a job to get everyone to agree to, say Python/NumPy or follow some naming convention; forget trying to make them follow arcane software developer practices and conventions.
For source code management, centralized systems such as Subversion are superior for scientific use due to the clear single point of truth (SPOT). Logging of changes and ability to recall versions of any file, without having chase down where to find something, has huge record-keeping advantages. Tools like Git and Monotone: oh my gosh the chaos I can imagine that would follow! Having clear-cut records of just what version of hack-job scripts were used while toying with the new sensor when that Higgs boson went by or that supernova blew up, will lead to happiness.
What languages/environments have you
used for developing scientific
software, esp. data analysis? What
libraries? (E.g., what do you use for
plotting?)
Languages I have used for numerics and sicentific-related stuff:
C (slow development, too much debugging, almost impossible to write reusable code)
C++ (and I learned to hate it -- development isn't as slow as C, but can be a pain. Templates and classes were cool initially, but after a while I realized that I was fighting them all the time and finding workarounds for language design problems
Common Lisp, which was OK, but not widely used fo Sci computing. Not easy to integrate with C (if compared to other languages), but works
Scheme. This one became my personal choice.
My editor is Emacs, although I do use vim for quick stuff like editing configuration files.
For plotting, I usually generate a text file and feed it into gnuplot.
For data analysis, I usually generate a text file and use GNU R.
I see lots of people here using FORTRAN (mostly 77, but some 90), lots of Java and some Python. I don't like those, so I don't use them.
Was there any training for people
without any significant background in
programming?
I think this doesn't apply to me, since I graduated in CS -- but where I work there is no formal training, but people (Engineers, Physicists, Mathematicians) do help each other.
Did you have anything like version
control, bug tracking?
Version control is absolutely important! I keep my code and data in three different machines, in two different sides of the world -- in Git repositories. I sync them all the time (so I have version control and backups!) I don't do bug control, although I may start doing that.
But my colleagues don't BTS or VCS at all.
How would you go about trying to
create a decent environment for
programming, without getting too much
in the way of the individual
scientists (esp. physicists are
stubborn people!)
First, I'd give them as much freedom as possible. (In the University where I work I could chooe between having someone install Ubuntu or Windows, or install my own OS -- I chose to install my own. I don't have support from them and I'm responsible for anything that happens with my machins, including security issues, but I do whatever I want with the machine).
Second, I'd see what they are used to, and make it work (need FORTRAN? We'll set it up. Need C++? No problem. Mathematica? OK, we'll buy a license). Then see how many of them would like to learn "additional tools" to help them be more productive (don't say "different" tools. Say "additional", so it won't seem like anyone will "lose" or "let go" or whatever). Start with editors, see if there are groups who would like to use VCS to sync their work (hey, you can stay home and send your code through SVN or GIT -- wouldn't that be great?) and so on.
Don't impose -- show examples of how cool these tools are. Make data analysis using R, and show them how easy it was. Show nice graphics, and explain how you've created them (but start with simple examples, so you can quickly explain them).
I would suggest F# as a potential candidate for performing science-related manipulations given its strong semantic ties to mathematical constructs.
Also, its support for units-of-measure, as written about here makes a lot of sense for ensuring proper translation between mathematical model and implementation source code.
First of all, I would definitely go with a scripting language to avoid having to explain a lot of extra things (for example manual memory management is - mostly - ok if you are writing low-level, performance sensitive stuff, but for somebody who just wants to use a computer as an upgraded scientific calculator it's definitely overkill). Also, look around if there is something specific for your domain (as is R for statistics). This has the advantage of already working with the concepts the users are familiar with and having specialized code for specific situations (for example calculating standard deviations, applying statistical tests, etc in the case of R).
If you wish to use a more generic scripting language, I would go with Python. Two things it has going for it are:
The interactive shell where you can experiment
Its clear (although sometimes lengthy) syntax
As an added advantage, it has libraries for most of the things you would want to do with it.
I'm no expert in this area, but I've always understood that this is what MATLAB was created for. There is a way to integrate MATLAB with SVN for source control as well.