Missing ".TH" section when converting .rst file to groff format in pandoc (reStructuredText, manpage) - documentation

What should I put in a source reStructuredText file to populate the "Title Heading" (.TH) line in the destination file when using pandoc to convert it to groff-format?
I have a python project whose documentation is built with sphinx. As such, most of the project's documentation is already written in reStructured Text (.rst files). I need to write a manpage, so I'd like to write it in reST format.
Unfortunately, when I use pandoc to convert the source .rst file to man (groff-format), the file doesn't render properly with man since it's missing the Title Heading.
For example, consider the following source file source.rst
==========
my-program
==========
----------------------
my-program description
----------------------
:manual section: 1
:manual group: John Doe
Synopsis
========
**my-program**
Description
===========
**my-program** is magical. It does what you need!
I use pandoc to convert it to groff format as follows:
user#disp117:~$ pandoc source.rst -t man > my-program.1
user#disp117:~$
user#disp117:~$ cat my-program.1
.SH my-program
.SS my-program description
.TP
manual section
1
.TP
manual group
John Doe
.SS Synopsis
.PP
\f[B]my-program\f[R]
.SS Description
.PP
\f[B]my-program\f[R] is magical.
It does what you need!
user#disp117:~$
Now, if I try to render that groff file, then it doesn't format properly.
user#disp117:~$ groffer --text my-program.1
manual section 1 manual group John Doe my‐program my‐program is
magical. It does what you need!
...
However, if I manually add the a .TH line to the file, then it works as expected.
user#disp117:~$ echo -e ".TH my_program(1)\n$(cat my-program.1)" > my-program.1
user#disp117:~$
user#disp117:~$ groffer --text my-program.1
my_program(1)() my_program(1)()
my-program
my-program description
manual section
1
manual group
John Doe
Synopsis
[B]my-program
[R]
Description
[B]my-program
[R] is magical. It does what you need!
my_program(1)()
user#disp117:~$
What do I need to add to source.rst such that pandoc will produce a file in groff-format that includes the .TH line?

Pandoc generates "snippets" by default; those snippets are intended to be integrated into a complete document. Make pandoc generate a complete document with
pandoc --standalone ...
or
pandoc -s ...

Related

How to set all arguments of the Title line in manpage written in reStructuredText, converted to groff with pandoc

How can I get pandoc to properly set all of the arguments in the "Title line" (.TH) when converting from a .rst file to a man file?
According to the documentation man man-pages, the "Title line" takes positional arguments:
Title line
The first command in a man page should be a TH command:
.TH title section date source manual
The arguments of the command are as follows:
title The title of the man page, written in all caps (e.g., MAN-PAGES).
section
The section number in which the man page should be placed (e.g., 7).
date The date of the last nontrivial change that was made to the man page. (Within the man-pages project, the necessary up‐
dates to these timestamps are handled automatically by scripts, so there is no need to manually update them as part of a
patch.) Dates should be written in the form YYYY-MM-DD.
source The source of the command, function, or system call.
For those few man-pages pages in Sections 1 and 8, probably you just want to write GNU.
For system calls, just write Linux. (An earlier practice was to write the version number of the kernel from which the
manual page was being written/checked. However, this was never done consistently, and so was probably worse than in‐
cluding no version number. Henceforth, avoid including a version number.)
For library calls that are part of glibc or one of the other common GNU libraries, just use GNU C Library, GNU, or an
empty string.
For Section 4 pages, use Linux.
In cases of doubt, just write Linux, or GNU.
manual The title of the manual (e.g., for Section 2 and 3 pages in the man-pages package, use Linux Programmer's Manual).
I haven't found any documentation on how pandoc magically translates .rst files into groff files, but I've found that I can get it to spit-out a .TH line with a reStructuredText heading in the document like so:
user#buskill:~/tmp/groff$ cat source.rst
==========
my-program
==========
Synopsis
========
**my-program**
Description
===========
**my-program** is magical. It does what you need!
user#buskill:~/tmp/groff$
user#buskill:~/tmp/groff$ pandoc -s source.rst -t man
.\" Automatically generated by Pandoc 2.9.2.1
.\"
.TH "my-program" "" "" "" ""
.hy
.SH Synopsis
.PP
\f[B]my-program\f[R]
.SH Description
.PP
\f[B]my-program\f[R] is magical.
It does what you need!
user#buskill:~/tmp/groff$
The above execution shows that pandoc extracted the first argument to .TH from the reST heading (my-program), but the remaining arguments are all blank. If I try to specify them in the heading directly, it doesn't work.
user#buskill:~/tmp/groff$ head source.rst
==============================
my-program "one" "two" "three"
==============================
Synopsis
========
**my-program**
Description
user#buskill:~/tmp/groff$ pandoc -s source.rst -t man
.\" Automatically generated by Pandoc 2.9.2.1
.\"
.TH "my-program \[dq]one\[dq] \[dq]two\[dq] \[dq]three\[dq]" "" "" "" ""
.hy
.SH Synopsis
.PP
\f[B]my-program\f[R]
.SH Description
.PP
\f[B]my-program\f[R] is magical.
It does what you need!
user#buskill:~/tmp/groff$
What do I need to add to the source.rst file such that pandoc will populate the arguments in the destination file's .TH line? And, in general, where can I find reference documentation that describes this?
You can fix this by including the section in the title, defining the date in source.rst, and setting footer & header as variables.
Solution
Update your source.rst file as follows
========
one(two)
========
:date: three
Synopsis
========
**my-program**
Description
===========
**my-program** is magical. It does what you need!
And now re-render the manpage with the following command
user#buskill:~/tmp/groff$ pandoc -s source.rst --variable header=five --variable footer=four -t man
.\" Automatically generated by Pandoc 2.9.2.1
.\"
.TH "one" "two" "three" "four" "five"
.hy
.SH Synopsis
.PP
\f[B]my-program\f[R]
.SH Description
.PP
\f[B]my-program\f[R] is magical.
It does what you need!
user#buskill:~/tmp/groff$
Why this works
I couldn't find great reference documentation from pandoc for the conversion between .rst and man, so I solved this with trial-and-error.
First I found in the pandoc documentation that you can see a default template for the destination format using the -D argument
https://pandoc.org/MANUAL.html#templates
user#buskill:~$ pandoc -D man
$if(has-tables)$
.\"t
$endif$
$if(pandoc-version)$
.\" Automatically generated by Pandoc $pandoc-version$
.\"
$endif$
$if(adjusting)$
.ad $adjusting$
$endif$
.TH "$title/nowrap$" "$section/nowrap$" "$date/nowrap$" "$footer/nowrap$" "$header/nowrap$"
$if(hyphenate)$
.hy
$else$
.nh
$endif$
$for(header-includes)$
$header-includes$
$endfor$
$for(include-before)$
$include-before$
$endfor$
$body$
$for(include-after)$
$include-after$
$endfor$
$if(author)$
.SH AUTHORS
$for(author)$$author$$sep$; $endfor$.
$endif$
user#buskill:~$
I found that you can set the title and section by setting the main heading of the document to <title>(<section>).
And I found that you could set the date with a Field Name in source.rst
For some reason the formatting of the header and footer gets messed-up when defining them as field names, so I set those on the command line with
--variable header=five --variable footer=four

Pandoc URL anchor character encoding

I have a markdown document I converted to PDF using the command
pandoc --pdf-engine tectonic --from markdown --template eisvogel --listings -V linkcolor:blue --output test.pdf test.md
The conversion works well, but links with anchors in them are converted so the '#' is '%23'. How can I get round this?
[this link](https://example.com/pages/mypage#heading)
becomes
[this link](https://example.com/pages/mypage%23heading)
which gives a 404.
I recently ran into the same problem. I've settled on the following workaround with xelatex pdf-engine:
this link
Not sure the regular [this link](https://example.com/pages/mypage#heading) is a bug or I just need to dig deeper into the documentation. But for now, I'm good :)
could you not just use a typical output format. Seems like you might be needing a specific format for a specific reason though.
\---
title: "Habits"
author: John Doe
date: March 22, 2005
output: pdf_document
\---

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?

I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:
$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf
This does not remove the PageLabel* metadata which can be verified with:
$ pdftk example_orig.pdf dump_data | grep PageLabel
How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.
I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.
Using a text editor to remove PDF metadata
If it is the first time you edit a PDF, make a backup copy first.
Open your PDF with a text editor that can handle binary blobs. vim -b will be fine.
Locate the /Info dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key names plus the (some values) following them).
Be careful to not use more spaces than there were characters initially. Otherwise your xref table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).
For additional measure, locate the /XML string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).
In case your PDF does not make the /Info dictionary accessible, transform it with the help of qpdf.
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
Apply the procedure outlined above. (The qdf---orig.pdf now should be much better suited for
Re-compact your edited file:
qpdf qdf---orig.pdf edited---orig.pdf
Done! Enjoy your edited---orig.pdf. Check if it has all the data removed:
pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files provided, it became clear to me that the /PageLabel key is not part of the /Info dictionary (PDF's Document Information Dictionary), but of the /Root object.
That's probably one reason why pdftk was unable to update it with the method the OP described.
The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.
In the case of the OP's example files, the complication comes into play: the /Root object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm). That means one has to decompress it with the help of qpdf first:
Use qpdf:
qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
Open the resulting file in binary mode with vim:
vim -b q-notes.pdf
Locate the 1 0 obj marker for the beginning of the /Root object, containing a dictionary named /PageLabels.
(a) To disable page labels altogether, just replace the /PageLabels string by /Pagelabels, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)
(b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as
<feff0031>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
(These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)
Edit these values to read:
<feff0031>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
[....]
<feff0035>
[....]
<feff0036>
Save the file and run qpdf again in order to re-compress the PDF:
qpdf q-notes.pdf notes.pdf
These now hopefully are the page labels the OP is looking for....
Since the OP seems to be familiar with editing pdftk's output of dump_data output, he can possibly edit the output and use update_data to apply the fix to the PDF without needing to resort to qpdf and vim.
Update 2:
User #Iserni posted a very good, short and working answer, which limits itself to one command, pdftk, which the OP seems to be familiar with already, plus sed -- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf like my answer did.
Unfortunately #Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!
So temporarily, I'll include a copy of #Iserni's answer here, until his is undeleted again:
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
# Remove temp file
rm -f temp.pdf mangled.pdf
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
rm -f temp.pdf mangled.pdf

How to change header ("Contents") of automatic TOC when using Pandoc?

When converting markdown to pdf with pandoc (version 1.12.1) the ToC option adds an english header: "Contents".
Since my document is in Dutch, I would like to be able to put the Dutch equivalent of contents there. But unfortunately I couldn't find any configuration options for this, neither did I found clues in the default.latex file.
My query:
pandoc -S --toc essay.md --biblio "MCM Essay.bib" --csl apa.csl -o mcm.pdf
I'm using windows
I use MIKTex, like in the pandoc instructions
The string "Contents" is not supplied by pandoc, but by latex (which pandoc calls to create the PDF).
Try adding
-Vlang=dutch
to your command line. This will be passed to latex in the documentclass options, and LaTeX will provide the right string.
Adding
-V toc-title="My Custom TOC Header"
to the pandoc command line will also work. See https://pandoc.org/MANUAL.html#variables-set-automatically.

How can I know which text document map to which ID

I am learning Mahout with 'Mahou in Action' and right now I am in chapter 8. I just downloaded the Reuters-21578 file and use the following commands to convert all the documents to SequenceFile:
bin/mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
and I got chunk-0 in the 'reuters-seqfiles' folder.
My question is: How can I know which document has been assigned to which ID in this sequence file?