Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
With Sphinx-doc, you can create a bunch of ReStructureText files, with an index.rst file which includes a table of contents macro that auto generates a table of contents from the other included files, and a conf.py that acts as a compilation config. You can then compile the lot into a single python-doc-style site, complete with index, navigation tools, and a search function.
Is there any comparable tool for markdown (preferably pandoc-style markdown)?
Some static site generators that work with Markdown:
Jekyll is very popular and also the engine behind GitHub pages.
Python variants: Hyde or Pelican
nanoc (used f.ex. in the GitHub API documentation)
Middlemanapp: maybe the best one?
I think none of them use pandoc (maybe because it's written in Haskell), but they all use an enhanced Markdown syntax or can be configured to use pandoc.
Other interesting ways to generate a site from markdown:
Markdown-Wikis that are file based: f.ex. Gollum, the Wiki-Engine that is also used by GitHub
Telegram: commercial; written by David Pollak, the inventor the Lift-Scala-framework
Engines that use Pandoc:
Gitit: Pandoc Wiki
Hakyll: Haskell library to generate static sites
Pandoc-Plugin forIkiwiki
Yst static site generator
Gouda - generates a site from a directory of markdown files
Rippledoc - generates a navigable site from nested directories of markdown files
The definitive listing of Static Site Generators
A good overview of static site generators: http://staticsitegenerators.net/
Pandoc, the GNU make and sed commands, sprinkled with some CSS are all you need to fully automate the creation of a static website starting from Markdown.
Pandoc offers three command line options which can provide navigation between pages as well as the creation of a table of contents (TOC) based on the headings inside the page. With CSS you can make the navigation and TOC look and behave the way you want.
-B FILE, --include-before-body=FILE
Include contents of FILE, verbatim, at the beginning of the document body (e.g. after the tag in HTML, or the \begin{document} command in LaTeX). This can be used to include navigation bars or banners in HTML documents. This option can be used repeatedly to include multiple files. They will be included in the order specified. Implies --standalone.
--toc, --table-of-contents
Include an automatically generated table of contents.
--toc-depth=NUMBER
Specify the number of section levels to include in the table of contents.
The default is 3 (which means that level 1, 2, and 3 headers will be listed in the contents).
As a matter of fact, my personal website is built this way. Check out its makefile for more details. It is all free-libre open-source licensed under the GNU GPL version 3.
If you're OK not using Pandoc, mkdocs would seem to fit your needs.
If you definitely want to use Pandoc-flavoured Markdown, you could check out pdsite. I wrote it as a way to get mkdocs-style site generation with more input formats (like emacs org-mode) and without the Python dependencies - you pass in a folder of Markdown files and get out an HTML site (including automatically-generated multi-level navigation links). It's similar to Serge Stroobandt's approach in that it's just a shell script (it does require you to have tree installed however). Unfortunately it doesn't have a search function yet, although that wouldn't be too hard to add...
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Surely, I am the 100th user who is asking this but after I have searched through similar topics here and on other websites I still cannot find what I need.
I like to have a simple command line tool for my GNU/Linux which converts .doc(x) files to .pdf BUT the output should look the same as the original.
LibreOffice doesn't seem like a good choice for this because it does not convert well in some cases. I have found a website freepdfconvert.com which does the job very well, but I cannot upload any sensitive files since it is a big risk. I don't say they would do anything bad with them but it is how it is.
If I can't find any good tool maybe I will have to write one myself.
Unfortunately there are no Linux-based guaranteed 1-to-1 convertors for Word (doc/docx) to PDF. This is because Word, a Microsoft product, uses a proprietary format that changes slightly with every release. As it was not traditionally a publicly documented format and Microsoft does not port Word/Office to Linux (nor ever will) then you must rely upon reverse engineered third party tools for older formats (doc) and proper interpretation of the Office Open XML format by third party developers.
We found the best open source solution is LibreOffice (which was forked from OpenOffice.org, which itself was called Star Office before it was open sourced). It is much more actively developed than AbiWord, as another answer suggested.
The usage from the command line is simple and well documented with plenty of examples:
soffice --headless --convert-to pdf filename.doc
Or also you can use libreoffice instead of soffice on newer versions.
There is also Pandoc.
Pandoc, mainly known for its Markdown-capable processing goodness (for outputting HTML, LaTeX, PDF, EPUB and what-not) in recent months has gained a rather well-working capability to process DOCX input files.
(NOTE: Pandoc only works for DOCX, not for DOC files.)
For its PDF output to work, it requires a working LaTeX installation (with either or all of pdflatex, lualatex and xelatex included). In this case the following simple command should work:
pandoc -o output.pdf -f docx input.docx
Note however, that the output layout and font styles now will not look at all similar to what it would look if you exported the DOCX from Word to PDF. It will be using the styles of a default LaTeX document.
You can influence the output style of the LaTeX-generated PDF by using a custom template file like this...
pandoc \
-o output.pdf \
-f docx \
--template=my-latex-template.tmplt \
input.docx
...but this is a feature more for Pandoc/LaTeX experts to use than for beginners.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Does anyone know of a good tool to generate Google Protobuf documentation using the .proto source files?
[Update: Aug 2017. Adapted to the full Go rewrite of protoc-gen-bug, currently 1.0.0-rc]
The protoc-doc-gen, created by #estan (see also his earlier answer) provides a good and easy way to generate your documentation in html, json, markdown, pdf and other formats.
There are number of additional things that I should mention:
estan is no longer the maintainer of protoc-doc-gen, but pseudomuto is
In contrast to what I've read on various pages it is possible to use rich inline formatting (bold/italic, links, code snippets, etc.) within comments
protoc-gen-doc has been completely rewritten in Go and now uses Docker for generation (instead of apt-get)
The repository is now here: https://github.com/pseudomuto/protoc-gen-doc
To demonstrate the second point I have created an example repository to auto-generate the Dat Project Hypercore Protocol documentation in a nice format.
You can view various html and markdown output generation options at (or look here for a markdown example on SO):
https://github.com/aschrijver/protoc-gen-doc-example
The TravisCI script that does all the automation is this simple .travis.yml file:
sudo: required
services:
- docker
language: bash
before_script:
# Create directory structure, copy files
- mkdir build && mkdir build/html
- cp docgen/stylesheet.css build/html
script:
# Create all flavours of output formats to test (see README)
- docker run --rm -v $(pwd)/build:/out -v $(pwd)/schemas/html:/protos:ro pseudomuto/protoc-gen-doc
- docker run --rm -v $(pwd)/build/html:/out -v $(pwd)/schemas/html:/protos:ro -v $(pwd)/docgen:/templates:ro pseudomuto/protoc-gen-doc --doc_opt=/templates/custom-html.tmpl,inline-html-comments.html protos/HypercoreSpecV1_html.proto
- docker run --rm -v $(pwd)/build:/out -v $(pwd)/schemas/md:/protos:ro pseudomuto/protoc-gen-doc --doc_opt=markdown,hypercore-protocol.md
- docker run --rm -v $(pwd)/build:/out -v $(pwd)/schemas/md:/protos:ro -v $(pwd)/docgen:/templates:ro pseudomuto/protoc-gen-doc --doc_opt=/templates/custom-markdown.tmpl,hypercore-protocol_custom-template.md protos/HypercoreSpecV1_md.proto
deploy:
provider: pages
skip_cleanup: true # Do not forget, or the whole gh-pages branch is cleaned
name: datproject # Name of the committer in gh-pages branch
local_dir: build # Take files from the 'build' output directory
github_token: $GITHUB_TOKEN # Set in travis-ci.org dashboard (see README)
on:
all_branches: true # Could be set to 'branch: master' in production
(PS: The hypercore protocol is one of the core specifications of the Dat Project ecosystem of modules for creating decentralized peer-to-peer applications. I used their .proto file to demonstrate concepts)
An open source protobuf plugin that generates DocBook and PDF from the proto files.
http://code.google.com/p/protoc-gen-docbook/
Disclaimer: I am the author of the plugin.
In Protobuf 2.5 the "//" comments you put into your .proto files actually makes it into the generated java source code as Javadoc comments. More specifically the protoc compiler will take your "//" comments like this:
//
// Message level comments
message myMsg {
// Field level comments
required string name=1;
}
will go into your generated java source files. For some reason protoc encloses the Javadoc comments in <pre> tags. But all in all it is a nice new feature in v2.5.
In addition to the askldjd's answer, I'd like to point out my own tool at https://github.com/estan/protoc-gen-doc . It is also a protocol buffer compiler plugin, but can generate HTML, MarkDown or DocBook out of the box. It can also be customized using Mustache templates to generate any text based format you like.
Documentation comments are written using /** ... */ or /// ....
The thread is old, but the question still seems relevant. I have had very good results with doxygen + proto2cpp. proto2cpp works as an input filter for doxygen.
Doxygen supports so called input filters, which allow you to transform code into something doxygen understands. Writing such a filter for transforming the Protobuf IDL into C++ code (for example) would allow you to use the full power of Doxygen in .proto files. See item 12 of the Doxygen FAQ.
I did something similar for CMake, the input filter just transforms CMake macros and functions to C function declarations. You can find it here.
Since the .proto file is mostly just declaration, I usually find that the source file with inline comments is straightforward and effective documentation.
https://code.google.com/apis/protocolbuffers/docs/techniques.html
Self-describing Messages
Protocol Buffers do not contain descriptions of their own types. Thus,
given only a raw message without the corresponding .proto file
defining its type, it is difficult to extract any useful data.
However, note that the contents of a .proto file can itself be
represented using protocol buffers. The file
src/google/protobuf/descriptor.proto in the source code package
defines the message types involved. protoc can output a
FileDescriptorSet – which represents a set of .proto files – using the
--descriptor_set_out option. With this, you could define a
self-describing protocol message like so:
message SelfDescribingMessage { // Set of .proto files which define
the type. required FileDescriptorSet proto_files = 1;
// Name of the message type. Must be defined by one of the files in
// proto_files. required string type_name = 2;
// The message data. required bytes message_data = 3; }
By using classes like DynamicMessage (available in C++ and Java), you
can then write tools which can manipulate SelfDescribingMessages.
All that said, the reason that this functionality is not included in
the Protocol Buffer library is because we have never had a use for it
inside Google.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've tried using LaTeX and DocBook for documenting programming tools, to get PDF output. What I've found is that these tools are excellent in some ways - easily versioned, and generating very usable PDF manuals. But there is a serious flaw. Code-snippets cannot simply be cut-and-pasted out of the PDF.
With DocBook, the problem is the loss of whitespace - mostly for indentation, but any repeated spaces seem to get stripped out. So, once you paste the snippet into a text editor, you'll need to clean up the indentation and vertical alignment. Not too much hassle for two or three lines, but it quickly gets annoying.
With LaTeX - well, it's a mess. The following was taken from a PDF generated using the LaTeX in MikTeX 2.8.
node myclas s
f f i e l d f i e l d 0 1 : i n t ;
f i e l d f i e l d 0 2 : ” char ” ;
g;
The intended example is...
node myclass
{
field field01 : int;
field field02 : "char*";
};
Other than the fact LaTeX plays with the quotes, the intended form is what you see in Adobe Reader - but not much like what you get from a cut-and-paste. Don't ask me what's going on with the spaces, or why the braces turned into letters, or what happened to the asterisk - I don't know!
Mostly, I've noticed these things playing with ways of keeping my own personal notes, and just went back to other ways. Some notes are in HTML or plain text, so I can version them. Others are in an old Journal program I've used for years. But I've written a tool that I may want to release soon - and I'll want to include a usable PDF manual, which will need to include examples.
So - is there a way of creating PDF documentation where the code snippets can be easily cut-and-pasted? Preferably a way that allows me to keep "sources" in versioned text files.
EDIT
Any solution must be portable. I will need to use it on Linux and on Windows XP.
EDIT
It looks like this may be impossible.
I've tried printing from Notepad++ to the Adobe Acrobat Pro 7 printer driver. The resulting document looked fine, but cutting and pasting gave the same missing whitespace problems as occur with DocBook.
I tried using the touchup text tool in Acrobat Pro to add leading spaces. These are preserved when you save and reload - but when you select text normally in acrobat, they aren't included. You can only cut-and-paste including those spaces using the touchup text tool, so far as I can tell, which is obviously not included in reader.
In other words, this looks like a fundamental limitation - not of the PDF format itself so much as the tools that work with it. There appears to be a general assumption at work here that whitespace is insignificant - which for my purposes obviously isn't true.
EDIT
One solution may be a "text field". I can add these fairly easily using Acrobat Pro, can set a fixed width font, enter multiple lines of text and make the field read only. In Acrobat Pro 7, the text in the field then isn't selectable - but in Reader 9 it is selectable and everything is preserved when you cut and paste.
The question is - can text fields be generated directly using some kind of markup language that is usable to create complete manuals?
I'd suggest enscript. I use it for producing archives and documentations.
Also, you can merge multiple source codes ps'ed with enscript into another pdf.
If your code is kept in external files, one way would be to attach the original file(s) as PDF attachments. This could be done with Docbook, LaTeX, DITA, and a few others.
For example, if you are using this method to include code in Docbook, you can write some code to your XSL customization layer for adding the external file as an attachment to the PDF. As far as I know, this is portable (although I haven't personally tried to open PDF files with attachments in Evince, Okular, Xpdf, etc to see what happens).
If you are processing the Docbook files using even FOP, you should still be able to write something into your customization layer to attach files. See the section on PDF attachments. You could even output a link to the attachment below the codeblock in the PDF if you want to make it more discoverable to people.
A similar solution should be possible using LaTeX with the attachfile package.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a command-line Unix tool that will format/indent/prettify source code in different languages? I'm especially interested in Java, JavaScript, PHP, and XML, but ideally it would handle others.
(I'm not looking for something to generate syntax-highlighting markup; I already know of a few tools that do that.)
Artistic Style.
http://astyle.sourceforge.net/
If you have set your auto-formatting options as project-specific settings in Eclipse, you can do something like:
/opt/local/eclipse/eclipse -nosplash
-application org.eclipse.jdt.core.JavaCodeFormatter
-verbose
-config .settings/org.eclipse.jdt.core.prefs
src/ tests/ doc/examples/
This means that you practically install and configure Eclipse for this purpose if only for using it's autoformatting features, regardless of what editor you use normally. :)
Source: http://blogs.operationaldynamics.com/andrew/software/java-gnome/eclipse-code-format-from-command-line
Additional Notes
On Mac OS X:
/Applications/eclipse/java-oxygen/Eclipse.app/Contents/MacOS/eclipse -nosplash -application org.eclipse.jdt.core.JavaCodeFormatter -verbose -config ~/my-eclipse-workspace/.metadata/.plugins/org.eclipse.core.runtime/.settings/org.eclipse.jdt.core.prefs MyClass.java
I've always found Vim's code formatter a great option. It is aware of many languages and can be reasonably customized.
You can pipe the relevant commands into vim like this:
vim MyClass.java <<< gg=G:wq
Explanation:
gg=G formats the file
:wq saves the file and returns to the command prompt
Check out indent and enscript.
Vim generally has automatic syntax highlighting and is available on most Unix-based systems when you install. For formatting and indentation in Vim I use the :set autoindent and :set tabstop=4 automatically when I start it. autoindent keeps the current indentation you are at when you start a new line, and tabstop sets how much your code is indented when you press tab (only for indentation, for tab in general use shiftwidth). To have these options configured whenever you start Vim put them in a ~/.vimrc file.
For XML and HTML I have used htb.
If you are an Eclipse user then JTidy is another option.
For Java there is Jalopy.
So, I bring to your attention Style Revisor, source code formatter with GUI and command-line interface. It will be support different languages, include JavaScript and PHP. If you're interested in command-line usage - you can define your own formatting style as addon. Of course, you can also use many predefined styles. Example:
./Style\ Revisor --lang=PHP --style=GNU --path=~/to-your-project-root-dir
Currently, Style Revisor supports two languages: C and Objective-C. Welcome: http://style-revisor.com/
Sincerely.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
PDF Succinctly book is longer and has helpful pictures.
Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
When I first started working with PDF, I found the PDF reference very hard to navigate.
It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.
If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.
Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.
I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.
This may help shed a little light:
(from page 11 of PDF32000.book)
PDF syntax is best understood by considering it as four parts, as shown in Figure 1:
• Objects. A PDF document is a data structure composed from a small set of basic types of data objects.
Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other
syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects.
Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream
object.
• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are
accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub-
clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level
mechanism for protecting a document’s contents from unauthorized access.
• Document structure. The PDF document structure specifies how the basic object types are used to
represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7,
"Document Structure," describes the overall document structure; later clauses address the detailed
semantics of the components.
• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of
a page or other graphical entity. These instructions, while also represented as objects, are conceptually
distinct from the objects that represent the document structure and are described separately. Sub-clause
7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.
Looks like navigating a PDF file will require a little more than a passing effort.
If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.
Didier have a tool to parse the PDF:
http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip
or here:
http://blog.didierstevens.com/programs/pdf-tools/ which cataloged several related pdf-analysis tools.
Another tool is here:
http://mshahzadlatif.wordpress.com/2011/09/28/view-pdf-structure-using-adobe-acrobat-or-a-free-tool-called-pdfxplorer/
Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.
One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.
Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.
I'm trying to make up a spreadsheet to create a PDF form from code.
You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.
On windows I used a free tool PDF Analyzer to see the internal structure of PDF files.
This will help in your understanding when reading the reference manual.
(I'm affiliated with PDF Analyzer, no intention to promote)
To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.
Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.
To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the different octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.