Pandoc, markdown to pdf doesn't wrap long words in paragraphs

Pandoc, markdown to pdf doesn't wrap long words in paragraphs - pdf

I'm trying to generate a clean PDF from markdown using Pandoc and xelatex.
When I convert :
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I end up having :
Here is the command I use to generate the PDF :
/usr/local/bin/pandoc --verbose \
--chapters --from=markdown+yaml_metadata_block -S \
--latex-engine=xelatex \
--listings -H listings-setup.tex \
--template template.pdf \
--toc --chapters \
-o test.pdf \
metadata.yml \
test.md
I use the document class : report
I have tried different things from inside the template and the extra header I'm using but I have now idea what template is Pandoc using when generating paragraphs.
I see under my template.pdf (extracted from Pandoc), but doesn't seem to apply here :
\setlength{\emergencystretch}{3em} % prevent overfull lines

You've a few possibilities. Since pandoc uses LaTeX for PDF generation, these are adapted from this LaTeX answer:
Annotate the proper language:
---
lang: en-GB
---
rest of document
use soft hyphens inside a word to explicitly denote the allowed places to break. You can either use the unicode character or the HTML entity  which pandoc will convert automatically for LaTeX etc. For example; cryptography
Specify exceptions via \hyphenation{cryp-to-graphy}

Related

Auto-crop pdf from command line

I need to automatically crop a pdf file (remove white margins). So far, I tried two tools which aren't perfect:
pdfcrop
Issue: it doesn't crop some pdfs.
pdf-crop-margins
Issue: sometimes it crops too much (fine details).

I had the same problem,.. I render very long single page PDF's with wkhtmltopdf like such
wkhtmltopdf \
--disable-javascript \
--print-media-type \
--zoom 2 \
--page-width 750px \
--page-height 100000px \
https://www.foobar.com \
foobar.pdf;
Than I need to trim off all the whitespace on the bottom
I tried Briss https://formulae.brew.sh/formula/briss#default, but that did not work for me.
So I tried pdfcropmargins that did not work for you and Bingo!
On MacOs 11.6 I need to access the command like so:
/Users/<usernamehere>/Library/Python/3.8/bin/pdfcropmargins -p 0 foobar.pdf;
That will crop with a file write out named foobar_cropped.pdf

Ghostscript to convert PDF 2 Back to PDF version 1.7

I need to build a PDF serverside by reading in a number of PDFs and inserting each page into a new multipage PDF. The problem is that the PDFs are provided in version 2.0 format, but my application can only read version 1.7. I would like to convert the version 2 files back into a version 1.7 file so that my application can read it.
I am using ghostscript version 9.27, and have tried several commands, but each time I end up with an empty PDF. Example:
/usr/local/bin/gs \
-q -dNOPROMPT \
-dBATCH \
-dDEVICEWIDTH=595 \
-dDEVICEHEIGHT=842 \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-sFileName=pdf-version-2.pdf \
-sOutputFile=fileout.pdf
There is no error, just an empty PDF. The "file" command does give the expected output "PDF document, version 1.7" but that's not much good when the file is blank. Any help greatly appreciated!

OK so I think the problem is your command line (as pointed out to me by one of my colleagues). You've specified -sFileName=pdf-version-2.pdf, which looks like you're trying to specify the input file.
There is no Ghostscript switch -sFileName, you specify the input filename(s) by simply putting the name on the command line. So you really want:
/usr/local/bin/gs \
-q -dNOPROMPT \
-dBATCH \
-dDEVICEWIDTH=595 \
-dDEVICEHEIGHT=842 \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-sOutputFile=fileout.pdf \
pdf-version-2.pdf
For reasons that were good 30 years ago, the Ghostscript command line switches are copied into the PostScript environment, where they can be accessed by PostScript programs. So while its true (and possibly the szource of your confusion) that some of the utility programs shipped with Ghostscript do use -sFileName, Ghostscript itself doesn't, it just defines a PostScript variable using that name, so that programs can read it.
Because you've specified BATCH and NOPROMPT, but haven't specified an input file, the interpreter starts up, erases the current page to white, then exits. Closing the pdfwrite device causes it to write out the current content of the page, which is, well, white, resulting in your empty PDF file.
The slightly modified command line above worked well for me, but as I noted in my comments specifying DEVICEWIDTHPOINTS and DEVICEHEIGHTPOINTS won't actually do anything.

How to change header ("Contents") of automatic TOC when using Pandoc?

When converting markdown to pdf with pandoc (version 1.12.1) the ToC option adds an english header: "Contents".
Since my document is in Dutch, I would like to be able to put the Dutch equivalent of contents there. But unfortunately I couldn't find any configuration options for this, neither did I found clues in the default.latex file.
My query:
pandoc -S --toc essay.md --biblio "MCM Essay.bib" --csl apa.csl -o mcm.pdf
I'm using windows
I use MIKTex, like in the pandoc instructions

The string "Contents" is not supplied by pandoc, but by latex (which pandoc calls to create the PDF).
Try adding
-Vlang=dutch
to your command line. This will be passed to latex in the documentclass options, and LaTeX will provide the right string.

Adding
-V toc-title="My Custom TOC Header"
to the pandoc command line will also work. See https://pandoc.org/MANUAL.html#variables-set-automatically.

Pandoc and foreign characters

I've been trying to use Pandoc to convert some Markdown into a PDF file. This is a sample that Pandoc will not convert for me:
# Header!
## Sub Header
themselves derived respectively from the Greek ἀναρχία i.e. 'anarchy'
That's just something I grabbed from the top of the wikipedia database dump. Pandoc doesn't like that at all. This is the error message it gives me:
pandoc: Error producing PDF from TeX source.
! Package inputenc Error: Unicode char \u8:ἀ not set up for use with LaTeX.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
l.53 ...es derived respectively from the Greek ἀ
Is there a command switch I can give it to get around this? I tried following the advice to do something like this, but it failed:
iconv -t utf-8 test.md | pandoc -o test.pdf
Update Before following John's advice below, see this.
Update 2 This is the command that ultimately got it working. Hopefully this will help someone:
pandoc test2.md -o test2.pdf --latex-engine=xelatex --template=my.latex --variable mainfont="DejaVu Serif" --variable sansfont=Arial
And this is the contents of my.latex:
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$lang$,$endif$$if(papersize)$$papersize$,$endif$]{$documentclass$}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
% use microtype if available
\IfFileExists{microtype.sty}{\usepackage{microtype}}{}
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[utf]{inputenc}
\usepackage{ucs}
$if(euro)$
\usepackage{eurosym}
$endif$
\else % if luatex or xelatex
\usepackage{fontspec}
\ifxetex
\usepackage{xltxtra,xunicode}
\fi
\defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
\setromanfont{TeX Gyre Pagella}
\newcommand{\euro}{€}
$if(mainfont)$
\setmainfont{$mainfont$}
$endif$
$if(sansfont)$
\setsansfont{$sansfont$}
$endif$
$if(monofont)$
\setmonofont{$monofont$}
$endif$
$if(mathfont)$
\setmathfont{$mathfont$}
$endif$
\fi
$if(geometry)$
\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}
$endif$
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(tables)$
\usepackage{longtable}
$endif$
$if(graphics)$
\usepackage{graphicx}
% We will generate all images so they have a width \maxwidth. This means
% that they will get their normal width if they fit onto the page, but
% are scaled down if they would overflow the margins.
\makeatletter
\def\maxwidth{\ifdim\Gin#nat#width>\linewidth\linewidth
\else\Gin#nat#width\fi}
\makeatother
\let\Oldincludegraphics\includegraphics
\renewcommand{\includegraphics}[1]{\Oldincludegraphics[width=\maxwidth]{#1}}
$endif$
\ifxetex
\usepackage[setpagesize=false, % page size defined by xetex
unicode=false, % unicode breaks when used with xetex
xetex]{hyperref}
\else
\usepackage[unicode=true]{hyperref}
\fi
\hypersetup{breaklinks=true,
bookmarks=true,
pdfauthor={$author-meta$},
pdftitle={$title-meta$},
colorlinks=true,
urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,
linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,
pdfborder={0 0 0}}
\urlstyle{same} % don't use monospace font for urls
$if(links-as-notes)$
% Make links footnotes instead of hotlinks:
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
$endif$
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em} % prevent overfull lines
$if(numbersections)$
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\ifxetex
\usepackage{polyglossia}
\setmainlanguage{$mainlang$}
\else
\usepackage[$lang$]{babel}
\fi
$endif$
$for(header-includes)$
$header-includes$
$endfor$
$if(title)$
\title{$title$}
$endif$
\author{$for(author)$$author$$sep$ \and $endfor$}
\date{$date$}
\begin{document}
$if(title)$
\maketitle
$endif$
$for(include-before)$
$include-before$
$endfor$
$if(toc)$
{
\hypersetup{linkcolor=black}
\setcounter{tocdepth}{$toc-depth$}
\tableofcontents
}
$endif$
$body$
$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$
$endif$
$for(include-after)$
$include-after$
$endfor$
\end{document}

Use the --pdf-engine=xelatex option.

By default, Pandoc use the pdflatex engine when converting markdown file to pdf files. pdflatex can not handle Unicode characters very smoothly as xelatex. You should try xelatex instead. But, merely using xelatex command is not enough. As is often the case, you need to choose a proper font which contains glyphs for the Unicode characters your want to typeset.
I am a Chinese user, so take Chinese for example. If you have a test.md which contains the following content:
你好汉字
you can use the following command to compile this markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="KaiTi" test.md -o test.pdf
In the above command, --pdf-engine=xelatex is used to select the LaTeX engine (for the new version of Pandoc, --latex-engine option is deprecated). -V CJKmainfont="KaiTi" is used to select the proper font which support Chinese. For other languages, you may use the flag -C mainfont="<FONT_NAME>".
How to find a font which support your language
In order to find a font which supports your language, you need to know your language code. Then, if you are on Linux system or on Windows systems with TeX Live installed. You can use the following command to find a valid font for you language:
fc-list :lang=zh #find the font which support Chinese (language code is `zh`)
The output on my Linux system is shown below
If you choose to use, e.g. the font Source Han Serif CN, then use the following command to compile your markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="Source Han Serif CN" test.md -o test.pdf

UPDATE: the answer below seems to be valid for pandoc 1.x but with later versions the syntax has changed
Coming back to this post in five years time and the issue is still there. The command
pandoc -s test.md -t latex -o test.pdf
fails when test.md contains text with non-latin characters, Greek, Cyrillic, CJK, Hebrew and Arabic included.
LaTeX was designed before Unicode and its support for different character sets is robust in some areas but far from comprehensive, so the advice to use XeLaTeX is valid yet requires one to choose the main font carefully, since there is no automatic choice.
Below is a small taxonomy of possible issues and some solutions. All tested with Pandoc 1.19.
Cyrillic
Support for Cyrillic alphabet in LaTeX is provided via T2A font encoding.
Consider a small sample:
# Header
## Subheader
Tetris (Russian: Тетрис) quoting Wikipedia is a tile-matching puzzle
video game
Running this example with pandoc would fail with:
! Package inputenc Error: Unicode char Т (U+422)
(inputenc) not set up for use with LaTeX.
See the inputenc package documentation for explanation.
A fix is available as fontenc option is a predefined variable in default.latex template.
Running this example with
pandoc -t latex -o tetris.pdf -V fontenc=T2A cyrillic.md
would produce correct rendering
This however would not handle other language features correctly such as hyphenation. A better way would be to use Babel and have it select the correct font encoding.
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=russian cyrillic.md
Or to switch languages with Babel commands inside Markdown
# Header
## Subheader
Tetris (Russian: \foreignlanguage{russian}{Тетрис}) quoting Wikipedia
is a tile-matching puzzle video game
And run with
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=english \
-V babel-otherlangs=russian cyrillic2.md
Greek
The example in the original post contains characters both from the main and extended Greek Unicode codepages.
Anyway, the widely used LGR greek font encoding is not covered by LaTeX 3 project and is classified as a local encoding, i.e. it may vary from site to site and from system to system according to the LaTeX Encoding Guide.
On TeX Live the following packages need to be installed: texlive-greek-inputenc, texlive-greek-fontenc and texlive-cbfonts. Note that you need Babel 3.9 or later.
However the result of
pandoc -t latex -o anarchy.pdf -V fontenc=LGR greek.md
may appear unexpected.
In order to correct this issue one has to setup LaTeX Babel package correctly. And insert commands to switch between the languages in the original text:
# Header!
## Sub Header
themselves derived respectively from the Greek \textgreek{ἀναρχία}
i.e. 'anarchy'
Compiling this with the following command
pandoc -s greek2.md -t latex -V fontenc=T2A -V lang -V babel-lang=english \
-V babel-otherlangs=greek -o greek.pdf
would produce the output exactly as you would expect it to be:
XeLaTeX
All of this would not be needed if we were using XeLaTeX.
Just running the original example with
pandoc -s greek.md --latex-engine=xelatex -t latex -o greek.pdf
would produce
Because the font does not contain anything in the greek character positions the output contains some white space instead.
Selecting one of the popular fonts as the new mainfont would help a bit
pandoc -s greek.md --latex-engine=xelatex \
-V mainfont="Liberation Serif" -t latex -o greek.pdf
However characters from the extended Greek codepage such as the small letter alpha with psili accent are not rendered.
The Font Setup for Greek with XeTeX/LuaTeX Guide suggests to use DejaVu, Libertine or Free font families.
Indeed with DejaVu Serif, Linux Libertine O as well as Tempora and perhaps some other fonts, the result would be as expected. See below the rendering with XeLaTeX and Linux Libertine fonts.
pandoc -s greek.md --latex-engine=xelatex -V mainfont="Linux Libertine O" \
-t latex -o greek.pdf

Works for Cyrillic characters
pandoc myfile.md --pdf-engine=xelatex -V mainfont=Arial

You can use --latex-engine=xelatex, as said before, but the best I have found is to use the lang variable to specify the document language in the header, like this: lang: ru-RU. A working example on my debian workstation:
---
title: Lady Macbeth de Mzensk (Chostakovitch, livret d'Alexandre Preis, 1934)
lang: ru-RU
---
# Acte I / Tableau 1
*[Народ ненадежный]*
Ха, ха, ха, ха, ха, ха, ха. *[...]* Чуыствуем
На кого ты нас покидаешь?
Без хозяина будет скучно,
скучно, тоскливо, безрадостно.
Не работа. Без тебя невеселье. Воз вращайся
Как можно скорей, скорей !
Then you can launch:
$ pandoc -o your-file-output.pdf your-source-file.md

If you are using LaTeX intermediate output, then you can use inline \mbox{t\'ext} to get accented characters. Without the \mbox{}, the backslash often isn't interpreted correctly by the Pandoc parser.

I had a similar issue trying to get mathematical symbols to show up in the output.
As others have mentioned, with recent pandoc versions (v2.2.3.2 in my case) the option to use is pdf-engine=xelatex. I did not need to specify a font in this case:
pandoc -o MyDoc.pdf --pdf-engine=xelatex MyDoc.md
I did get an error that the latinmodern-math font was missing. I installed it using:
tlmgr install collection-fontsrecommended

wrong encode when update pdf meta data using ghostscript and pdfmark

I have a base pdf file, and want to update the title into Chinese (UTF-8) using ghostscript and pdfmark, command like below
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=result.pdf base.pdf pdfmarks
And the pdfmarks file (encoding is UTF-8 without BOM) is below
[ /Title (敏捷开发)
/Author (Larry Cai)
/Producer (xdvipdfmx (0.7.8))
/DOCINFO pdfmark
The command is successfully executed, while when I check the properties of the result.pdf
The title is changed to æŁ‘æ“·å¼•å‘
Please give me hints how to solve this, are there any parameters in gs command or pdfmark?

The PDF Reference states that the Title entry in the document info dictionary is of type 'text string'. Text strings are defined as using either PDFDocEncoding or UTF-16BE with a Byte Order Mark (see page 158 of the 1.7 PDF Reference Manual).
So you cannot specify a Title using UTF-8 without a BOM.
I would imagine that if you replace the Title string with a string defining the content using UTF-16BE with a BOM then it will work properly. I would suggest you use a hex string rather than a regular PostScript string to specify the data, simply for ease of use.

Using the idea from Happyman Chiu my solution is next. Get a UTF-16BE string with BOM by
echo -n '(敏捷开发)' | iconv -t utf-16 |od -x -A none | tr -d ' \n' | sed 's/./\U&/g;s/^/</;s/$/>/'
You will get <FEFF0028654F63775F0053D10029>. Substitute this for title.
/Title <FEFF0028654F63775F0053D10029>

follow pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject
I use this function to create the string from utf-8 for info.txt, to be used by gs command.
function str_in_pdf($str){
$cmd = sprintf("echo '%s'| iconv -t utf-16 |od -x -A none",$str);
exec($cmd,$out,$ret);
return "<" . implode("",$out) .">";
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas