How to make "pandoc" use a specific font when converting a plaintext file to PDF? - pdf

After a lot of trouble, I was finally able to run the command without errors:
pandoc -i 1.txt -o 1.pdf
The result is a PDF with completely messed up text because it uses some other font than Courier[ New]. Some varying-width, default font.
After reading and searching for a long time, I found this: https://pandoc.org/MANUAL.html#creating-a-pdf
The option "fontfamily" is mentioned, so I tried to do:
pandoc -i 1.txt -o 1.pdf --fontfamily=Courier
However, this results in:
Unknown option --fontfamily.
Try pandoc --help for more information.
I have looked through the entire "pandoc --help" output without finding any mention of fonts.
How do I set the font to be used?
(I'm trying my very best to not also add: "and why is it so incredibly difficult/cryptic/undocumented to do the most basic imaginable thing?"...)
I'm not even sure that this will fix all the problems. I just assume that the document is all messed up because the font isn't using fixed-width letters.

Related

How to identify the pdf object in raw pdf file?

I want to remove certain objects using programs.
Using cpdf I can get the objects, if I can somehow identify the objects that I want to delete, then I should be able to modify pdf files with programs.
$ cpdf in.pdf -output-json -output-json-parse-content-streams -o out.json
$ cpdf -j out.json -o out.pdf
However, I can not find out the object corresponding to my target text. For example, text search does not work on a raw pdf file. What is the best way to identify the target object of a text?
EDIT: Here is a test pdf. Please remove XYZ from the top of each page. Note that the test is a significant simplification of the real pdf file. So the solution should not be so simple so that it can not be applied to real complicated pdf files.
curl -s https://i.stack.imgur.com/whsnm.gif | tail -c +43 > test.pdf
The output of cpdf -output-json -output-json-parse-content-streams may or may not contain text which is recognisable to you. This depends on the font encodings in use, and the way in which text is layed out. In your file, for example, the painting of the string "XYZ" is represented as
[ "\u0000;\u0000<\u0000=", "Tj" ]
This is a string representing three codepoints indexing into the font. Cpdf presently has no way to show you what actual text this corresponds to; a future version will.
So I don't think your task can be done via cpdf -output-json in the general case, or indeed in this specific case.

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?
I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.
I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.
If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).
Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?
If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.
You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.
Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps
I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps
https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.
If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.
There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

How to change header ("Contents") of automatic TOC when using Pandoc?

When converting markdown to pdf with pandoc (version 1.12.1) the ToC option adds an english header: "Contents".
Since my document is in Dutch, I would like to be able to put the Dutch equivalent of contents there. But unfortunately I couldn't find any configuration options for this, neither did I found clues in the default.latex file.
My query:
pandoc -S --toc essay.md --biblio "MCM Essay.bib" --csl apa.csl -o mcm.pdf
I'm using windows
I use MIKTex, like in the pandoc instructions
The string "Contents" is not supplied by pandoc, but by latex (which pandoc calls to create the PDF).
Try adding
-Vlang=dutch
to your command line. This will be passed to latex in the documentclass options, and LaTeX will provide the right string.
Adding
-V toc-title="My Custom TOC Header"
to the pandoc command line will also work. See https://pandoc.org/MANUAL.html#variables-set-automatically.

convert pdf to svg

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.
Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.
Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/
pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf
pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done
I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.
inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).
Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg