How can I get localized quotation marks with pandoc? - pdf

I would like to get quotation marks like these „...” but when I process my markdown text wih Pandoc, it gives me “...”. It probably boils down to the question how to make Pandoc use locale settings? Here is my command line:
pandoc -o out.pdf -f markdown -t latex in.md

You may want to specify the language for the whole document, so that it not only affects the quotes, but also ligatures, unbreakable spaces, and other locale specifics.
In order to do that, you may specify the lang option. From pandoc's manual:
lang: identifies the main language of the document, using a code according to BCP 47 (e.g. en or en-GB). For some output formats,
pandoc will convert it to an appropriate format stored in the
additional variables babel-lang, polyglossia-lang (LaTeX) and
context-lang (ConTeXt).
Moreover, the same manual states that:
if your LaTeX template or any included header file call for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
In my opinion, the best way to ensure quotes localization is thus to add:
---
lang: fr-FR
header-includes:
- \usepackage{csquotes}
---
Or even better, to edit the pandoc default template
pandoc -D latex > ~/.pandoc/templates/default.latex
and permanently add \usepackage{csquotes} to it.

Pandoc currently (Nov. 2015) only generates English type quotes. But you have two options:
You can use the --no-tex-ligatures option to turn off the --smart typography option which is turned on by default for LaTeX output. Then use the proper unicode characters (e.g. „...”) you desire and use a --latex-engine that supports unicode (lualatex or xelatex).
You could use \usepackage[danish=quotes]{csquotes} or similar in your Pandoc LaTeX template. From the README:
If the --smart option is specified, pandoc will produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”
Note: if your LaTeX template calls for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.

Even though #mb21 already provided an answer, I would like to add, that nowadays it's possible to simply include what you need in (for example) your yaml metadata block, so it is not necessary anymore to create your own template. Example:
---
title: foo
author: bar
# other stuff
header-includes:
- \usepackage[german=quotes]{csquotes}
---
# contents

Related

Cross references to headings with leading numbers in PDF

I am using Pandoc to convert a markdown file to a PDF and I have some issues with creating references to headings with leading numbers.
Here is the code:
Take me to the [first paragraph](#1-paragraph)
## 1 Paragraph
In the converted PDF the link does not work.
When I remove the leading number everything works fine.
So whats the correct way to link to this kind of headings?
A good way to go about this is to look at pandoc's “native” output, i.e., the internal representation of the document after parsing:
$ echo '## 1 Paragraph' | pandoc -t native test.md
[ Header
2
( "paragraph" , [] , [] )
[ Str "1" , Space , Str "Paragraph" ]
]
The auto-generated ID for the heading is paragraph. The reason for that is that HTML4 doesn't allow identifiers that start with numbers, so pandoc skips those. Hence, [first paragraph](#paragraph) will work.
However, GitHub Flavored Markdown is written with HTML5 in mind, and numbers are allowed as the first id character in that case. Pandoc supports GitHub's scheme as well, and those auto-identifiers are enabled with --from=markdown+gfm_auto_identifiers.
Probably better than manual numbering of headings is to call pandoc with --number-sections (or -N); the numbering will be performed automatically.

How can I properly create multilingual metadata in pdftk

pdftk let's you set the title of a PDF with the following command:
pdftk input.pdf update_info metadata.txt output output.pdf
However, if I use special characters in the metadata.txt file (such as German characters or chinese characters) then it doesn't seem to work.
Here's an example of changing the title:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
However, the PDF ends up with a strange character for the ü
In the documentation of pdftk it says that non-ASCII characters should be encoded as XML numerical entities. However, I Googled myself silly but couldn't find anything that works.
The best reference I've found is Numerical Character Reference, which is applicable to XML (and XHTML and SGML).
This is generally used to represent characters that are not directly encodable.
In your case, the character is U+252, ü which can be substituted with ü (Decimal), &0374; (Octal), or ü (Hexidecimal).
Using a decimal reference, your file should be encoded as:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
Note:
If you're on 'Nix, you can use recode to encode the file.
% cat metadata.txt | recode ..xml
This answer seems better as there is no need to install extra tools. Instead, it uses PDFtk’s built-in flag dump_data_utf8 and update_info_utf8:
pdftk input.pdf update_info_utf8 metadata.txt output output.pdf
It works perfect for Chinese.

Failure to read full line including embedded zero bytes

Lua script:
i=io.read()
print(i)
Command line:
echo -e "sala\x00m" | lua ll.lua
Output:
sala
I want it to print all character from input, similar to this:
salam
in HEX editor:
0000000: 7361 6c61 006d 0a sala.m.
How can I print all character from input?
You tripped over one of the few places where the Lua standard library is still not 8-bit-clean.
Specifically, file reading line-by-line is not embedded-0 proof.
The reason it isn't yet is an unfortunate combination of:
Only standard C90 or equally portable constructs are allowed for the core, which does not provide for efficient 0-clean text parsing.
Every solution discussed to date on the mailinglist under that constraint has considerable overhead.
Embedded 0-bytes in text files are quite rare.
Workarounds:
Use a modified library, fixing these formats: "*l" "*L" for file:read(...)
parse your raw data yourself. (read a block using a number or as much as possible using "*a")
Badger the Lua developers/maintainers for a bugfix until they give in.

Text codepage in groff

How to setup a correct codepage in groff?
For example to use a cyrillic language.
Man page notes about -T switch. But troff -T utf8 -ms troff_file.txt
gives:
warning: invalid input character code 128`
-Tutf8 only selects the output device. If you want groff to accept Unicode input, use the -K switch.
Another way to have Cyrillic is simply to use Tlatin1 to enable eight-bit character codes, and feed groff a source file in a single-byte encoding such as CP 1251 or KOI8-R. Don't forget to prepare your hyphenation file accordingly.

LaTeX math in BibTeX citations not working with pandoc

Background
I am using pandoc to convert Markdown → PDF with references incorporated from a BibTeX citation database. I would like a citation in my bibliography to match the typographical conventions in the original article, namely italics and subscripts. The citation in the bibliography should look like this:
I have the following citation exported from Zotero as BibTeX.
#article{stanley_restrictions_1969,
title = {Restrictions on the Possible Values of $r_{12}$, Given $r_{13}$ and $r_{23}$},
volume = {29},
issn = {0013-1644},
url = {http://dx.doi.org/10.1177/001316446902900304},
doi = {10.1177/001316446902900304},
number = {3},
urldate = {2013-01-04},
journal = {Educational and Psychological Measurement},
author = {Stanley, J. C. and Wang, M. D.},
month = oct,
year = {1969},
pages = {579--581}
}
Zotero escapes the dollar signs, brackets, and underscores (\$r\_\{12\}\$) when I export to BibTeX format, but I just use sed to take them out before invoking pandoc. But then pandoc escapes them again. If I convert from Markdown → LaTeX, pandoc produces:
Stanley, J. C., \& Wang, M. D. (1969). Restrictions on the Possible
Values of \$r\_12\$, Given \$r\_13\$ and \$r\_23\$. \emph{Educational
and Psychological Measurement}, \emph{29}(3), 579--581.
doi:10.1177/001316446902900304
which means I get:
Question
How can one include LaTeX math in the BibTeX citations used by pandoc when converting from Markdown → PDF?
It's not supported. Here's an issue on the pandoc bug tracker. Pandoc uses bibutils to read bibtex databases, converting them to MODS XML which is then read by citeproc-hs. Unfortunately, MODS doesn't have any way of representing math. And bibutils doesn't recognize math in bibtex. So there's no clear solution at the moment -- short of writing a bibtex parser from scratch that uses pandoc to convert LaTeX in fields -- maybe not a bad idea!
The upcoming 1.12 release of pandoc will allow you to include your citation database in a YAML form inside the document itself (or in a separate file). When citations are included this way, simple math will be supported, as well as some other kinds of markup. There will be a tool for converting an existing bibtex database to the YAML form, though because tool, like pandoc, uses bibutils, it won't convert math, and you'll have to modify that later.