Background
I am using pandoc to convert Markdown → PDF with references incorporated from a BibTeX citation database. I would like a citation in my bibliography to match the typographical conventions in the original article, namely italics and subscripts. The citation in the bibliography should look like this:
I have the following citation exported from Zotero as BibTeX.
#article{stanley_restrictions_1969,
title = {Restrictions on the Possible Values of $r_{12}$, Given $r_{13}$ and $r_{23}$},
volume = {29},
issn = {0013-1644},
url = {http://dx.doi.org/10.1177/001316446902900304},
doi = {10.1177/001316446902900304},
number = {3},
urldate = {2013-01-04},
journal = {Educational and Psychological Measurement},
author = {Stanley, J. C. and Wang, M. D.},
month = oct,
year = {1969},
pages = {579--581}
}
Zotero escapes the dollar signs, brackets, and underscores (\$r\_\{12\}\$) when I export to BibTeX format, but I just use sed to take them out before invoking pandoc. But then pandoc escapes them again. If I convert from Markdown → LaTeX, pandoc produces:
Stanley, J. C., \& Wang, M. D. (1969). Restrictions on the Possible
Values of \$r\_12\$, Given \$r\_13\$ and \$r\_23\$. \emph{Educational
and Psychological Measurement}, \emph{29}(3), 579--581.
doi:10.1177/001316446902900304
which means I get:
Question
How can one include LaTeX math in the BibTeX citations used by pandoc when converting from Markdown → PDF?
It's not supported. Here's an issue on the pandoc bug tracker. Pandoc uses bibutils to read bibtex databases, converting them to MODS XML which is then read by citeproc-hs. Unfortunately, MODS doesn't have any way of representing math. And bibutils doesn't recognize math in bibtex. So there's no clear solution at the moment -- short of writing a bibtex parser from scratch that uses pandoc to convert LaTeX in fields -- maybe not a bad idea!
The upcoming 1.12 release of pandoc will allow you to include your citation database in a YAML form inside the document itself (or in a separate file). When citations are included this way, simple math will be supported, as well as some other kinds of markup. There will be a tool for converting an existing bibtex database to the YAML form, though because tool, like pandoc, uses bibutils, it won't convert math, and you'll have to modify that later.
Related
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/
This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.
I'm trying to convert a rather long docx file to a pdf via pandoc, and was wondering if there's any way that I can get pandoc to automatically use multi-line tables in the output pdf. I know that there's a pandoc multiline table markdown extension, so would there be anyway to convert a docx file into a multiline-markdown-compliant format, or just convert it like this to a pdf directly?
On Windows you would normally print the DOCX to a PDF printer.
That is the easiest and most precise way to convert from DOCX to PDF.
I wanted to have RST as the main format.
Therefore I converted some DOCX to restructuredText (RST) with pandoc.
Table entries were sometimes made multi-line, where they were not,
breaking words.
I converted them to list-table with this script: listtable.py
(0 in the join param string brings together lines in that position=column without space)
From RST to HTML, PDF, DOCX I use this script:
dcx.py
In the Makefile you will see the pandoc command to convert to pdf.
I doubt, a direct conversion using pandoc would bring good results.
Use an intermediate (light) markup text format and edit that until the conversion to pdf fits your expectations.
This makes only sense if you want a pure text version anyway.
I would like to get quotation marks like these „...” but when I process my markdown text wih Pandoc, it gives me “...”. It probably boils down to the question how to make Pandoc use locale settings? Here is my command line:
pandoc -o out.pdf -f markdown -t latex in.md
You may want to specify the language for the whole document, so that it not only affects the quotes, but also ligatures, unbreakable spaces, and other locale specifics.
In order to do that, you may specify the lang option. From pandoc's manual:
lang: identifies the main language of the document, using a code according to BCP 47 (e.g. en or en-GB). For some output formats,
pandoc will convert it to an appropriate format stored in the
additional variables babel-lang, polyglossia-lang (LaTeX) and
context-lang (ConTeXt).
Moreover, the same manual states that:
if your LaTeX template or any included header file call for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
In my opinion, the best way to ensure quotes localization is thus to add:
---
lang: fr-FR
header-includes:
- \usepackage{csquotes}
---
Or even better, to edit the pandoc default template
pandoc -D latex > ~/.pandoc/templates/default.latex
and permanently add \usepackage{csquotes} to it.
Pandoc currently (Nov. 2015) only generates English type quotes. But you have two options:
You can use the --no-tex-ligatures option to turn off the --smart typography option which is turned on by default for LaTeX output. Then use the proper unicode characters (e.g. „...”) you desire and use a --latex-engine that supports unicode (lualatex or xelatex).
You could use \usepackage[danish=quotes]{csquotes} or similar in your Pandoc LaTeX template. From the README:
If the --smart option is specified, pandoc will produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”
Note: if your LaTeX template calls for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
Even though #mb21 already provided an answer, I would like to add, that nowadays it's possible to simply include what you need in (for example) your yaml metadata block, so it is not necessary anymore to create your own template. Example:
---
title: foo
author: bar
# other stuff
header-includes:
- \usepackage[german=quotes]{csquotes}
---
# contents
This may be a stupid question, but I can't figure it out.
I have made some tables in SPSS. Now I want them over to my latex document.
What I do, it that I right-click the table in SPSS, and press export.
Here I can choose between PDF or .doc. BUT the PDF-file created, generates a file with the table on top of a page (A4 size, with "page 1" at the bottom). I do not want this, I only want the table.
example how it turns out:
Example how I want it to turn out:
If I export to word, I can further save as PDF, but same problem occurs.
Screenshot works, but does not give me the same picture-quality that I prefer.
Do anyone of you have any tips for me?
Thanks :)
Unfortunately SPSS does not provide native table export to Latex. It does provide table export to html and xls, which can post-hoc be converted to Tex tables. PDF output for everything forces to export the full page (very annoying for graphics as well) - but you probably don't want to insert the image of the table (you could crop the PDF if need be), but have a Tex table (in the same font) as your document anyway.
One thing I have done in the past to make the export to text tables with specific markup is to use the PRINT or LIST commands to print the text table to the output (or to a text file) that is closer to the end goal. In this NABBLE post I have some syntax that makes pandoc flavored pipe style markdown tables - it should be pretty clear how that same approach could be used for Tex tables (actually Tex tables should be much simpler).
Here is an example of some code using LIST to make a the markup closer to Tex tables.
DATA LIST FREE / Variable (A1) Mean Median (2F4.2).
BEGIN DATA
A 3.25 2.00
B 2.56 2.50
C 9.87 10.20
END DATA.
*Using LIST to make Latex style table.
STRING Mid (A1) End (A2).
COMPUTE Mid = "&".
COMPUTE End = "//".
LIST /VARIABLES = Variable Mid Mean Mid Median End.
And here is a screen shot of the produced output on my machine.
So here I would still have to copy-paste the text output into my Tex document, (and make the header row).
You can also use OMS to save designed items in a variety of formats, including XML and then use an xml-to-Latex tool such as xmltex. You could probably even generate such a conversion with XSLT from the XML.
From the Viewer, you could also retrieve the table with Python scripting and use a Python-based converter tool.