I am using Pandoc to convert a markdown file to a PDF and I have some issues with creating references to headings with leading numbers.
Here is the code:
Take me to the [first paragraph](#1-paragraph)
## 1 Paragraph
In the converted PDF the link does not work.
When I remove the leading number everything works fine.
So whats the correct way to link to this kind of headings?
A good way to go about this is to look at pandoc's “native” output, i.e., the internal representation of the document after parsing:
$ echo '## 1 Paragraph' | pandoc -t native test.md
[ Header
2
( "paragraph" , [] , [] )
[ Str "1" , Space , Str "Paragraph" ]
]
The auto-generated ID for the heading is paragraph. The reason for that is that HTML4 doesn't allow identifiers that start with numbers, so pandoc skips those. Hence, [first paragraph](#paragraph) will work.
However, GitHub Flavored Markdown is written with HTML5 in mind, and numbers are allowed as the first id character in that case. Pandoc supports GitHub's scheme as well, and those auto-identifiers are enabled with --from=markdown+gfm_auto_identifiers.
Probably better than manual numbering of headings is to call pandoc with --number-sections (or -N); the numbering will be performed automatically.
Related
I am writing markdown files in Obsidian.md and trying to convert them via Pandoc and LaTeX to PDF. Text itself works fine doing this, howerver, in Obsidian I use ==equal signs== to highlight something, however this doesn't work in LaTeX.
So I'd like to create a filter that either removes the equal signs entirely, or replaces it with something LaTeX can render, e.g. \hl{something}. I think this would be the same process.
I have a filter that looks like this:
return {
{
Str = function (elem)
if elem.text == "hello" then
return pandoc.Emph {pandoc.Str "hello"}
else
return elem
end
end,
}
}
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
A string like Hello, World! becomes a list of inlines in pandoc: [ Str "Hello,", Space, Str "World!" ]. Lua filters don't make matching on that particularly convenient: the best method is currently to write a filter for Inlines and then iterate over the list to find matching items.
For a complete example, see https://gist.github.com/tarleb/a0646da1834318d4f71a780edaf9f870.
Assuming we already found the highlighted text and converted it to a Span with with class mark. Then we can convert that to LaTeX with
function Span (span)
if span.classes:includes 'mark' then
return {pandoc.RawInline('latex', '\\hl{')} ..
span.content ..
{pandoc.RawInline('latex', '}')}
end
end
Note that the current development version of pandoc, which will become pandoc 3 at some point, supports highlighted text out of the box when called with
pandoc --from=markdown+mark ...
E.g.,
echo '==Hi Mom!==' | pandoc -f markdown+mark -t latex
⇒ \hl{Hi Mom!}
I use online SHA256 converters to calculate a hash for a given file. There, I have seen an effect I don't understand.
For testing purposes, I wanted to calculate the hash for a very simple file. I named it "test.txt", and its only content is the string "abc", followed by a new line (I just pressed enter).
Now, when I put "abc" and newline into a SHA256 generator, I get the hash
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb
But when I put the complete file into the same generator, I get the hash
552bab6864c7a7b69a502ed1854b9245c0e1a30f008aaa0b281da62585fdb025
Where does the difference come from? I used this generator (in fact, I tried several ones, and they always yield the same result):
https://emn178.github.io/online-tools/sha256_checksum.html
Note that this different does not arise without newlines. If the file just contains the string "abc", the hash is
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
for the file as well as just for the content.
As noted in my comment, the difference is caused by how newline characters are represented across different operating systems (see details here):
On UNIX and UNIX-like systems, newlines are represented by a line feed character (\n).
On DOS and Windows systems, newlines are represented by a carriage return followed by a line feed character (\r\n).
Compare the following two commands and their output, corresponding to the SHA256 values in your question:
echo -en "abc\n" | sha256sum
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb
echo -en "abc\r\n" | sha256sum
552bab6864c7a7b69a502ed1854b9245c0e1a30f008aaa0b281da62585fdb025
The issue you are having could come from the character encoding of the new line.
In windows the new line is escaped with \r\n and in linux is escaped with \n.
These 2 have a different dec value (\r is 13 and \n is 10).
More info you can find here:
https://en.wikipedia.org/wiki/Newline
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Even i faced same issue. but providing the data in hex mode helped to understand the actual behavior.
Canonicalization of data needs to be performed before SHA calculations which will eliminate such issues. Canonicalization needs to be performed both at Generation side and also at verification side.
This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.
I cannot find an "official" documentation on whether the keywords and keyword phrases in the meta data of a PDF file are to be separated by a comma or by a comma with space.
The following example demonstrates the difference:
keyword,keyword phrase,another keyword phrase
keyword, keyword phrase, another keyword phrase
Any high-quality references?
The online sources I found are of low quality.
E.g., an Adobe press web page says "keywords must be separated by commas or semicolons", but in the example we see a semicolon with a following space before the first keyword and a semicolon with a following space between each two neighbor keywords. We don't see keyword phrases in the example.
The keywords metadata field is a single text field - not a list. You can choose whatever is visually pleasing to you. The search engine which operates on the keyword data may have other preferences, but I would imagine that either comma or semicolon would work with most modern search engines.
Reference: PDF 32000-1:2008 on page 550 at 1. Adobe; 2. The Internet Archive
ExifTool, for example parses for comma separated values, but if it does not find a comma it will split on spaces:
# separate tokens in comma or whitespace delimited lists
my #values = ($val =~ /,/) ? split /,+\s*/, $val : split ' ', $val;
foreach $val (#values) {
$et->FoundTag($tagInfo, $val);
}
I dont have a "high-quality references" but, if i generated a pdf using latex i do it in the following way:
adding in my main.tex following line:
\usepackage[a-1b]{pdfx}
then i write a file main.xmpdata and add this lines:
\Title{My Title}
\Author{My Name}
\Copyright{Copyright \copyright\ 2018 "My Name"}
\Kewords{KeywordA\sep
KeywordB\sep
KeywordC}
\Subject{My Short Discription}
after generating the pdf with pdflatex i used a python script based on "pdfminer.six" to extract the metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open('main.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
if 'Metadata' in doc.catalog:
metadata = resolve1(doc.catalog['Metadata']).get_data()
print(metadata) # The raw XMP metadata
The part with the Keywords then look like this:
...<rdf:Bag><rdf:li>KeywordA</rdf:li>\n <rdf:li>KeywordB...
and looking with "Adobe Acrobat Reader DC" at the properties of "main.pdf" i find in the properties the following entry in the section keywords:
;KeywordA;KeywordB;KeywordC
CommonLook claim to be "a global leader in electronic document accessibility, providing software products and professional services enabling faster, more cost-efficient, and more reliable processes for achieving compliance with the leading PDF and document accessibility standards, including WCAG, PDF/UA and Section 508."
They provide the following advice on PDF metadata:
Pro Tip: When you’re entering Keywords into the metadata, separate
them with semicolons as opposed to commas.
although give no further reasoning as to why this is the preferred choice.
I would like to get quotation marks like these „...” but when I process my markdown text wih Pandoc, it gives me “...”. It probably boils down to the question how to make Pandoc use locale settings? Here is my command line:
pandoc -o out.pdf -f markdown -t latex in.md
You may want to specify the language for the whole document, so that it not only affects the quotes, but also ligatures, unbreakable spaces, and other locale specifics.
In order to do that, you may specify the lang option. From pandoc's manual:
lang: identifies the main language of the document, using a code according to BCP 47 (e.g. en or en-GB). For some output formats,
pandoc will convert it to an appropriate format stored in the
additional variables babel-lang, polyglossia-lang (LaTeX) and
context-lang (ConTeXt).
Moreover, the same manual states that:
if your LaTeX template or any included header file call for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
In my opinion, the best way to ensure quotes localization is thus to add:
---
lang: fr-FR
header-includes:
- \usepackage{csquotes}
---
Or even better, to edit the pandoc default template
pandoc -D latex > ~/.pandoc/templates/default.latex
and permanently add \usepackage{csquotes} to it.
Pandoc currently (Nov. 2015) only generates English type quotes. But you have two options:
You can use the --no-tex-ligatures option to turn off the --smart typography option which is turned on by default for LaTeX output. Then use the proper unicode characters (e.g. „...”) you desire and use a --latex-engine that supports unicode (lualatex or xelatex).
You could use \usepackage[danish=quotes]{csquotes} or similar in your Pandoc LaTeX template. From the README:
If the --smart option is specified, pandoc will produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”
Note: if your LaTeX template calls for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
Even though #mb21 already provided an answer, I would like to add, that nowadays it's possible to simply include what you need in (for example) your yaml metadata block, so it is not necessary anymore to create your own template. Example:
---
title: foo
author: bar
# other stuff
header-includes:
- \usepackage[german=quotes]{csquotes}
---
# contents