Convert text file with ansi colours to pdf - pdf

I have a plain text file containing ansi escape codes for colouring text. If it matters, I generated this file using the python package colorama.
How can convert this file to pdf with colors properly rendered? I was hoping for a simple solution using e.g. pandoc, but any other command line or programmatic solutions would be ok as well. Example:
echo -e '\e[0;31mHello World\e[0m' > example.txt
# Convert to pdf
pandoc example.txt -o example.pdf # Gives error

Related

Batch convert svg to pdf page size

I have a number of svg files created with inkscape that contain text in non-standard fonts. As far as I understand, in order to have them printed I need to convert the text to paths. It seems that if I just use
convert input.svg output.pdf
the text is automatically converted to paths. Is this correct?
However my problem is with the page size. The input svg have a page size of A5, landscape. However the converted pdf seem to be cut on the right and bottom of the image by about 5% of the image width/height.
Why is that? How do I fix it?
As long as you have Inkscape on your system, ImageMagick convert actually delegates the PDF export to Inkscape. You can use it directly on the command line as
inkscape -zA output.pdf input.svg
Quote from man:
Used fonts are subset and embedded.
There are some options to manipulate the export area. -C explicitely sets the page area, -D the drawing bounding box.
You could even preserve the SVG format by using
inkscape -Tl output.svg input.svg
which would convert text to path.
Lastely, since you have to batch-process multiple files, you should open a shell with
inkscape --shell
and process all files in one go. Otherwise, startup time of inkscape would be 1-3 seconds for every file. Something like:
ls -1 *.svg | awk -F. \
'{ print "-AC " $1 ".pdf" $0 }
END { print "quit" }' | \
inkscape --shell

Find a word with strike-through in a PDF file

Is there a programmatic way to find a specific word containing a strike-through in a PDF file?
For example: test
I've tried converting the PDF to txt using the following (ghostscript) command:
gswin64.exe -sDEVICE=txtwrite -o output.txt "C:\docs\Letter.pdf"
But the word 'test' just appears as normal in output.txt
Any suggestions would be greatly appreciated

How to get the hidden text layout that tesseract creates for pdf files?

I don't have much experience with ocr. Here's what I try:
tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf
The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf.
My question is: can I get this layout as a file (hocr or html)?
(Config parameters preferred, not API.)
What I've tried:
tesseract -l eng -psm 1 image_str007_0001.jpg output hocr
and
hocr2pdf -i image_str007_001 -o output.pdf < output.hocr
In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?
If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.
You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.
Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps
I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps
https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.
If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.
There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

How to change header ("Contents") of automatic TOC when using Pandoc?

When converting markdown to pdf with pandoc (version 1.12.1) the ToC option adds an english header: "Contents".
Since my document is in Dutch, I would like to be able to put the Dutch equivalent of contents there. But unfortunately I couldn't find any configuration options for this, neither did I found clues in the default.latex file.
My query:
pandoc -S --toc essay.md --biblio "MCM Essay.bib" --csl apa.csl -o mcm.pdf
I'm using windows
I use MIKTex, like in the pandoc instructions
The string "Contents" is not supplied by pandoc, but by latex (which pandoc calls to create the PDF).
Try adding
-Vlang=dutch
to your command line. This will be passed to latex in the documentclass options, and LaTeX will provide the right string.
Adding
-V toc-title="My Custom TOC Header"
to the pandoc command line will also work. See https://pandoc.org/MANUAL.html#variables-set-automatically.