Pandoc and foreign characters

Pandoc and foreign characters - pdf

I've been trying to use Pandoc to convert some Markdown into a PDF file. This is a sample that Pandoc will not convert for me:
# Header!
## Sub Header
themselves derived respectively from the Greek ἀναρχία i.e. 'anarchy'
That's just something I grabbed from the top of the wikipedia database dump. Pandoc doesn't like that at all. This is the error message it gives me:
pandoc: Error producing PDF from TeX source.
! Package inputenc Error: Unicode char \u8:ἀ not set up for use with LaTeX.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
l.53 ...es derived respectively from the Greek ἀ
Is there a command switch I can give it to get around this? I tried following the advice to do something like this, but it failed:
iconv -t utf-8 test.md | pandoc -o test.pdf
Update Before following John's advice below, see this.
Update 2 This is the command that ultimately got it working. Hopefully this will help someone:
pandoc test2.md -o test2.pdf --latex-engine=xelatex --template=my.latex --variable mainfont="DejaVu Serif" --variable sansfont=Arial
And this is the contents of my.latex:
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$lang$,$endif$$if(papersize)$$papersize$,$endif$]{$documentclass$}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
% use microtype if available
\IfFileExists{microtype.sty}{\usepackage{microtype}}{}
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[utf]{inputenc}
\usepackage{ucs}
$if(euro)$
\usepackage{eurosym}
$endif$
\else % if luatex or xelatex
\usepackage{fontspec}
\ifxetex
\usepackage{xltxtra,xunicode}
\fi
\defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
\setromanfont{TeX Gyre Pagella}
\newcommand{\euro}{€}
$if(mainfont)$
\setmainfont{$mainfont$}
$endif$
$if(sansfont)$
\setsansfont{$sansfont$}
$endif$
$if(monofont)$
\setmonofont{$monofont$}
$endif$
$if(mathfont)$
\setmathfont{$mathfont$}
$endif$
\fi
$if(geometry)$
\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}
$endif$
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(tables)$
\usepackage{longtable}
$endif$
$if(graphics)$
\usepackage{graphicx}
% We will generate all images so they have a width \maxwidth. This means
% that they will get their normal width if they fit onto the page, but
% are scaled down if they would overflow the margins.
\makeatletter
\def\maxwidth{\ifdim\Gin#nat#width>\linewidth\linewidth
\else\Gin#nat#width\fi}
\makeatother
\let\Oldincludegraphics\includegraphics
\renewcommand{\includegraphics}[1]{\Oldincludegraphics[width=\maxwidth]{#1}}
$endif$
\ifxetex
\usepackage[setpagesize=false, % page size defined by xetex
unicode=false, % unicode breaks when used with xetex
xetex]{hyperref}
\else
\usepackage[unicode=true]{hyperref}
\fi
\hypersetup{breaklinks=true,
bookmarks=true,
pdfauthor={$author-meta$},
pdftitle={$title-meta$},
colorlinks=true,
urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,
linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,
pdfborder={0 0 0}}
\urlstyle{same} % don't use monospace font for urls
$if(links-as-notes)$
% Make links footnotes instead of hotlinks:
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
$endif$
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em} % prevent overfull lines
$if(numbersections)$
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\ifxetex
\usepackage{polyglossia}
\setmainlanguage{$mainlang$}
\else
\usepackage[$lang$]{babel}
\fi
$endif$
$for(header-includes)$
$header-includes$
$endfor$
$if(title)$
\title{$title$}
$endif$
\author{$for(author)$$author$$sep$ \and $endfor$}
\date{$date$}
\begin{document}
$if(title)$
\maketitle
$endif$
$for(include-before)$
$include-before$
$endfor$
$if(toc)$
{
\hypersetup{linkcolor=black}
\setcounter{tocdepth}{$toc-depth$}
\tableofcontents
}
$endif$
$body$
$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$
$endif$
$for(include-after)$
$include-after$
$endfor$
\end{document}

Use the --pdf-engine=xelatex option.

By default, Pandoc use the pdflatex engine when converting markdown file to pdf files. pdflatex can not handle Unicode characters very smoothly as xelatex. You should try xelatex instead. But, merely using xelatex command is not enough. As is often the case, you need to choose a proper font which contains glyphs for the Unicode characters your want to typeset.
I am a Chinese user, so take Chinese for example. If you have a test.md which contains the following content:
你好汉字
you can use the following command to compile this markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="KaiTi" test.md -o test.pdf
In the above command, --pdf-engine=xelatex is used to select the LaTeX engine (for the new version of Pandoc, --latex-engine option is deprecated). -V CJKmainfont="KaiTi" is used to select the proper font which support Chinese. For other languages, you may use the flag -C mainfont="<FONT_NAME>".
How to find a font which support your language
In order to find a font which supports your language, you need to know your language code. Then, if you are on Linux system or on Windows systems with TeX Live installed. You can use the following command to find a valid font for you language:
fc-list :lang=zh #find the font which support Chinese (language code is `zh`)
The output on my Linux system is shown below
If you choose to use, e.g. the font Source Han Serif CN, then use the following command to compile your markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="Source Han Serif CN" test.md -o test.pdf

UPDATE: the answer below seems to be valid for pandoc 1.x but with later versions the syntax has changed
Coming back to this post in five years time and the issue is still there. The command
pandoc -s test.md -t latex -o test.pdf
fails when test.md contains text with non-latin characters, Greek, Cyrillic, CJK, Hebrew and Arabic included.
LaTeX was designed before Unicode and its support for different character sets is robust in some areas but far from comprehensive, so the advice to use XeLaTeX is valid yet requires one to choose the main font carefully, since there is no automatic choice.
Below is a small taxonomy of possible issues and some solutions. All tested with Pandoc 1.19.
Cyrillic
Support for Cyrillic alphabet in LaTeX is provided via T2A font encoding.
Consider a small sample:
# Header
## Subheader
Tetris (Russian: Тетрис) quoting Wikipedia is a tile-matching puzzle
video game
Running this example with pandoc would fail with:
! Package inputenc Error: Unicode char Т (U+422)
(inputenc) not set up for use with LaTeX.
See the inputenc package documentation for explanation.
A fix is available as fontenc option is a predefined variable in default.latex template.
Running this example with
pandoc -t latex -o tetris.pdf -V fontenc=T2A cyrillic.md
would produce correct rendering
This however would not handle other language features correctly such as hyphenation. A better way would be to use Babel and have it select the correct font encoding.
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=russian cyrillic.md
Or to switch languages with Babel commands inside Markdown
# Header
## Subheader
Tetris (Russian: \foreignlanguage{russian}{Тетрис}) quoting Wikipedia
is a tile-matching puzzle video game
And run with
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=english \
-V babel-otherlangs=russian cyrillic2.md
Greek
The example in the original post contains characters both from the main and extended Greek Unicode codepages.
Anyway, the widely used LGR greek font encoding is not covered by LaTeX 3 project and is classified as a local encoding, i.e. it may vary from site to site and from system to system according to the LaTeX Encoding Guide.
On TeX Live the following packages need to be installed: texlive-greek-inputenc, texlive-greek-fontenc and texlive-cbfonts. Note that you need Babel 3.9 or later.
However the result of
pandoc -t latex -o anarchy.pdf -V fontenc=LGR greek.md
may appear unexpected.
In order to correct this issue one has to setup LaTeX Babel package correctly. And insert commands to switch between the languages in the original text:
# Header!
## Sub Header
themselves derived respectively from the Greek \textgreek{ἀναρχία}
i.e. 'anarchy'
Compiling this with the following command
pandoc -s greek2.md -t latex -V fontenc=T2A -V lang -V babel-lang=english \
-V babel-otherlangs=greek -o greek.pdf
would produce the output exactly as you would expect it to be:
XeLaTeX
All of this would not be needed if we were using XeLaTeX.
Just running the original example with
pandoc -s greek.md --latex-engine=xelatex -t latex -o greek.pdf
would produce
Because the font does not contain anything in the greek character positions the output contains some white space instead.
Selecting one of the popular fonts as the new mainfont would help a bit
pandoc -s greek.md --latex-engine=xelatex \
-V mainfont="Liberation Serif" -t latex -o greek.pdf
However characters from the extended Greek codepage such as the small letter alpha with psili accent are not rendered.
The Font Setup for Greek with XeTeX/LuaTeX Guide suggests to use DejaVu, Libertine or Free font families.
Indeed with DejaVu Serif, Linux Libertine O as well as Tempora and perhaps some other fonts, the result would be as expected. See below the rendering with XeLaTeX and Linux Libertine fonts.
pandoc -s greek.md --latex-engine=xelatex -V mainfont="Linux Libertine O" \
-t latex -o greek.pdf

Works for Cyrillic characters
pandoc myfile.md --pdf-engine=xelatex -V mainfont=Arial

You can use --latex-engine=xelatex, as said before, but the best I have found is to use the lang variable to specify the document language in the header, like this: lang: ru-RU. A working example on my debian workstation:
---
title: Lady Macbeth de Mzensk (Chostakovitch, livret d'Alexandre Preis, 1934)
lang: ru-RU
---
# Acte I / Tableau 1
*[Народ ненадежный]*
Ха, ха, ха, ха, ха, ха, ха. *[...]* Чуыствуем
На кого ты нас покидаешь?
Без хозяина будет скучно,
скучно, тоскливо, безрадостно.
Не работа. Без тебя невеселье. Воз вращайся
Как можно скорей, скорей !
Then you can launch:
$ pandoc -o your-file-output.pdf your-source-file.md

If you are using LaTeX intermediate output, then you can use inline \mbox{t\'ext} to get accented characters. Without the \mbox{}, the backslash often isn't interpreted correctly by the Pandoc parser.

I had a similar issue trying to get mathematical symbols to show up in the output.
As others have mentioned, with recent pandoc versions (v2.2.3.2 in my case) the option to use is pdf-engine=xelatex. I did not need to specify a font in this case:
pandoc -o MyDoc.pdf --pdf-engine=xelatex MyDoc.md
I did get an error that the latinmodern-math font was missing. I installed it using:
tlmgr install collection-fontsrecommended

Related

How to make "pandoc" use a specific font when converting a plaintext file to PDF?

After a lot of trouble, I was finally able to run the command without errors:
pandoc -i 1.txt -o 1.pdf
The result is a PDF with completely messed up text because it uses some other font than Courier[ New]. Some varying-width, default font.
After reading and searching for a long time, I found this: https://pandoc.org/MANUAL.html#creating-a-pdf
The option "fontfamily" is mentioned, so I tried to do:
pandoc -i 1.txt -o 1.pdf --fontfamily=Courier
However, this results in:
Unknown option --fontfamily.
Try pandoc --help for more information.
I have looked through the entire "pandoc --help" output without finding any mention of fonts.
How do I set the font to be used?
(I'm trying my very best to not also add: "and why is it so incredibly difficult/cryptic/undocumented to do the most basic imaginable thing?"...)
I'm not even sure that this will fix all the problems. I just assume that the document is all messed up because the font isn't using fixed-width letters.

Pandoc, markdown to pdf doesn't wrap long words in paragraphs

I'm trying to generate a clean PDF from markdown using Pandoc and xelatex.
When I convert :
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I end up having :
Here is the command I use to generate the PDF :
/usr/local/bin/pandoc --verbose \
--chapters --from=markdown+yaml_metadata_block -S \
--latex-engine=xelatex \
--listings -H listings-setup.tex \
--template template.pdf \
--toc --chapters \
-o test.pdf \
metadata.yml \
test.md
I use the document class : report
I have tried different things from inside the template and the extra header I'm using but I have now idea what template is Pandoc using when generating paragraphs.
I see under my template.pdf (extracted from Pandoc), but doesn't seem to apply here :
\setlength{\emergencystretch}{3em} % prevent overfull lines

You've a few possibilities. Since pandoc uses LaTeX for PDF generation, these are adapted from this LaTeX answer:
Annotate the proper language:
---
lang: en-GB
---
rest of document
use soft hyphens inside a word to explicitly denote the allowed places to break. You can either use the unicode character or the HTML entity  which pandoc will convert automatically for LaTeX etc. For example; cryptography
Specify exceptions via \hyphenation{cryp-to-graphy}

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?

If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.

You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.

Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps

I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps

https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.

If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.

There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

From Markdown to PDF: how to change the font-size with Pandoc?

I'm converting some Markdown files into PDF using Pandoc like this:
pandoc input.md -V geometry:margin=1in -o output.pdf
By default, the font-size is quite small in the pdf. I'd like to make all the fonts bigger (title, sub title, text, etc.). How can I do that?

Add this to your incantation:
-V fontsize=12pt

If you want to go bigger than 12pt you can use the extsizes package.
Was already pre installed for me, so this worked out of the box with pandoc:
---
documentclass: extarticle
fontsize: 14pt
---
…
Possible sizes are 8pt, 9pt, 10pt, 11pt, 12pt, 14pt, 17pt, 20pt.

For new users of pandoc (like myself), as an alternative to specifying variables with the -V flag, you can add them to the YAML metadata block of the markdown file. To change the fontsize, prepend following to your markdown document.
---
fontsize: 12pt
---
Works with fontsize 10,11, and 12. In addition to the comments on John MacFarlane's answer, there is some good info on specifying additional fontsizes with latex in this article (inline latex can be used in pandoc markdown documents being converted to pdf).

How to change header ("Contents") of automatic TOC when using Pandoc?

When converting markdown to pdf with pandoc (version 1.12.1) the ToC option adds an english header: "Contents".
Since my document is in Dutch, I would like to be able to put the Dutch equivalent of contents there. But unfortunately I couldn't find any configuration options for this, neither did I found clues in the default.latex file.
My query:
pandoc -S --toc essay.md --biblio "MCM Essay.bib" --csl apa.csl -o mcm.pdf
I'm using windows
I use MIKTex, like in the pandoc instructions

The string "Contents" is not supplied by pandoc, but by latex (which pandoc calls to create the PDF).
Try adding
-Vlang=dutch
to your command line. This will be passed to latex in the documentclass options, and LaTeX will provide the right string.

Adding
-V toc-title="My Custom TOC Header"
to the pandoc command line will also work. See https://pandoc.org/MANUAL.html#variables-set-automatically.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas