IPython/ Jupyter download as PDF styling - pdf

Imagine editing a typical IPython (4.x) notebook, notebook.ipynb, in the Jupyter editor. The code, graphs, and markdown get rendered exactly how you like them when previewed in the browser.
But then you "Download as PDF via LaTeX" and get something slightly different:
A centered title/ date header has been added.
The font is now serif instead of sans serif.
Section headers are numbered.
I'd like to change the default output to be a little more "what you see is what you get". In particular: I don't want a title header; I don't want numbering on my section headers; and I want sans serif font (code blocks look better with sans IMHO). How can I do this using the LaTeX custom template.tplx files and/ or the jupyter_nbconvert_config.py configuration?
I don't mind having to use the jupyter nbconvert command, but my first choice would be a one-click solution from the browser.
Thanks!

You can run the following on your notebook file from the command line (in the same directory):
ipython nbconvert --to latex notebook.ipynb
This will generate a tex file, which you can then open with a latex editor such as Texmaker. There you can edit the latex code to conform to any style you want (i.e. changing font, changing margins, changing numbering, etc.). Finally, convert the tex to pdf (most latex editors have tools for this).
Of course, this isn't an automated solution, but it allows for detailed changes and customization, so your final pdf comes out exactly as you want.

What you are looking for is to use a different latex template.
See this post for more details.
Changing style of PDF-Latex output through IPython Notebook conversion
Basically, you will need to edit your tplx files in your /nbconvert/templates/latex directory.
I'm still learning latex, but I did manage to change my default font for my documents to San-Serif by using adding this \renewcommand{\familydefault}{\sfdefault} to my article.tplx file.
Like so:
((* block docclass *))
\renewcommand{\familydefault}{\sfdefault}
\documentclass{article}
((* endblock docclass *))

Related

Export Jupyter Notebook to PDF With Same Markdown Rendering

Problem
My notebook is solely Markdown and I would like to export it to a PDF with the same Markdown rendering that JupyterLab displays. However, the regular PDF export converts it to LaTex and then to a PDF and it looks nothing like how I want it formatted. I would rather not have to manually edit a Tex file every time I want to export a notebook to a PDF, especially since it is very time-consuming for large files.
Exporting to WebPDF looks much closer to the result I desire, however, the page size is all over the place and I would like it to be Letter size (8.5 x 11 inches).
Question
How can I control the page size on the WebPDF export?
Bonus Question
Is it possible to get the PDF to look the way it does on JupyterLab Markdown rendering, including the dark theme? (printing the page to PDF does a terrible job and makes all the text an image)
Okay, I am a little confused by the question, but I will do my best to answer this.
First, I would like to introduce you to pandoc. Pandoc is a document conversion system. This will let you control how your markdown is converted into a pdf or any other desired format that pandoc converts to. For additional formatting control, pandoc has support for templates. Which will allow you to customize exactly how that document is treated on export.
Now to address your page size question. I do not think that you can control this from markdown alone, however you can if you use pandoc. This can be done by adding some LaTeX code into your markdown file. You can find the information on how to control page size using LaTeX here. Once you add this LaTeX code, you can convert to pdf using pandoc and a pandoc template. Pandoc provides a number of default templates which will work fine. Here is an example of the command used to do this conversion:
pandoc /filepath/doc_name.md -o doc_name.pdf --template /file_path/pandoc-templates/default.latex
Bonus question:
You can make a custom pandoc template to replicate any formatting and rendering that is done in JupyterLab Markdown. I am not too familiar with JuypterLabs, but making pandoc templates is not too bad and pandoc provides great documentation available here.

How to convert unusual unicode characters (UTF-8) to PDF?

I would like to convert a text file containing Unicode characters in UTF-8 to a PDF file. When I cat the file or look at it with vim, everything is great, but when I open the file with LibreOffice, the formatting is off. I have tried various fonts, none of which have worked. Is there a font file somewhere on my Ubuntu 16.04 system which is used for display in a terminal window? It seems that would be the font to tell LibreOffice to use.
I am not attached to LibreOffice. Any app that will convert the text file into a PDF file is fine. I have tried txt2pdf and pandoc without success.
This is what the file looks like
To be more specific about the problem, below is an example of what the above lines look like in LibreOffice using Liberation Mono font (no mono font does better):
I answered to you by mail, but here is the answer. You are using some very specific characters; the most difficult to find being in the Miscellaneous Symbols unicode block. For instance the SESQUIQUADRATE which sould is on your second line as ⚼.
A quick search lead me to the two following candidates (for monospace fonts):
Everson Mono
GNU Unifont
As you can see, the block is partially covered by PragmataPro which is a very good font; however, I tried with an old version and found all your own characters, but an issue occured because the Sun character (rendered as ☉) seems to be printed twice wider than the other characters, but my version of this font is rather old and perhaps buggy.
Once you have chosen the font suiting your needs, you may be able to render your documents as PDF with various tools. I made all my experiments with txt2pdf which I use daily for many documents.

Cannot display Unicode Characters (like λ) in PDF output of Jupyter

I'm using Julia in a jupyter notebook.
I want to generate a pdf for the results of my work. However, when generating the pdf, the λ of the mathematical expression λ=3 is lost so that the output in the pdf is =3.
Here is the jupyter notebook code
In[1]: λ=3
Out[1]: 3
Here is the pdf generated with the jupyter notebook
In[1]: =3
Out[1]: 3
This is not the case in with the pdf generated with nteract where the expression λ=3 is fully printed out. However the overall appearance of the pdf generated with nteract is not as nice as the pdf generated with jupyter notebook.
Here is printed pdf generated with nteract (looks exactly the same as the code itself):
In[1]: λ=3
Out[1]: 3
Can somebody know how to print such character with jupyter notebook?
Many thanks in advance
The issue is related to how Jupyter itself generates and compiles the latex file. Jupyter, by default, compiles the file with xelatex to support Unicode. My guess is, however, xelatex requires some configurations in the files and Jupyter does not generate the file that works out of the box with plain xelatex command.
You can change the configuration of Jupyter to compile latex file generated with pdflatex or latex command.
Solution:
Find your Jupyter configuration directory(i.e. output of jupyter --config-dir, on Linux usually ~/.jupyter. To find out which jupyter IJulia uses run using IJulia; IJulia.jupyter then find out the config directory of that jupyter)
Create the file jupyter_notebook_config.py in this directory if there is not one, already.
Put the following line of code at the end of this file and save it:
c.PDFExporter.latex_command = ['pdflatex', '{filename}']
Then you can export the PDF file using the notebook, as usual. The characters should appear, just right. This should work provided that pdflatex command is found in your shell.
If you do not have pdflatex but have latex you can also use the following line instead of the code above:
c.PDFExporter.latex_command = ['latex', '--output-format=pdf', '{filename}']
If you are not able to change the configuration of Jupyter, download the latex file and compile it with the command latex --output-format=pdf filename.tex.
Hope it works!
My solution to this is ugly, and I will get to it below, but first it is important to understand why this is happening.
Why is this happening
The intermediate .tex file that is generated indirectly calls for the Latin Modern fonts. Latin Modern is a fine choice for math fonts, but it is sucky for monospaced. The Latin Modern mono font does not include Greek.
Latin Modern is set by the unicode-math LaTeX package, which is loaded in the generated LaTeX around line 43.
\ifPDFTeX
\usepackage[T1]{fontenc}
\IfFileExists{alphabeta.sty}{
\usepackage{alphabeta}
}{
\usepackage[mathletters]{ucs}
\usepackage[utf8x]{inputenc}
}
\else
\usepackage{fontspec}
\usepackage{unicode-math}
\fi
So the unicode-math package will be loaded if you are using XeLaTeX (which is a good default) or LuaTeX or any other LaTeX engine for which fontspec is available.
The unicode-math package very reasonably uses Latin Modern for math, but if nothing is set otherwise, it will also use Latin Modern for monospaced fonts. From the documentation
Once the package is loaded, traditional TFM-based maths fonts are no longer supported; you can only switch to a different OpenType maths font using the \setmathfont command. If you do not load an OpenType maths font before \begin{document}, Latin Modern Math will be loaded automatically.
The creators of unicode-math assume that you will set your non-math fonts up after you have loaded the unicode-math, but that isn't done with the .tex generated by jupyter nbconvert. (I don't know if this is a jupyter thing or a Pandoc thing, but either way we end up with a document that is used Latin Modern for more than just math.)
So one solution is to set some other mono font after unicode-math is loaded and before \begin{Document}.
My solution
My solution is tuned for what I already had set up. It may not be the right approach for you, and it certainly will need some adjusting.
My Makefile used to have a simple juypter nbconvert --to=pdf in it. But now I need to edit the intermediate .tex file. So I have this for a notebook named computation-examples. You will need to use your own file name or do some Make rule magic.
# sed on macOS is just weird. Resorting to perl
computation-examples.tex: computation-examples.ipynb
jupyter nbconvert --to=latex $<
perl -pi -e '/^([ \t]*)\\usepackage{unicode-math}/ and $$_.="$$1\\usepackage[default]{fontsetup}\n"' $#
The perl adds the line \usepackage[default]{fontsetup} immediately after the line with\usepackage{unicode-math}. There are probably nicer ways to do that. I started with sed, but gave up. So the .tex file that is then processed to PDF by XeLaTeX has this.
\else
\usepackage{fontspec}
\usepackage{unicode-math}
\usepackage[default]{fontsetup}
\fi
The fontsetup package preserves all of the goodness of unicode-math while setting up the non-math founts. The default is to use the OpenType (.otf) Computer Modern Unicode fonts that will be a part of any TeX distribution that has xelatex on it.
Another approach
Probably a cleaner approach, but one I haven't experimented with, would be to create a fontspec.cfg file which lies about (overrides) what font files to load for what we are calling Latin Modern Mono. I would need to reread the fontspec documentation for the hundredths time to do that.
Make magic
Since writing the above, I have set up a more general Makefile rule,
%.tex: %.ipynb
jupyter nbconvert --to=latex $<
perl -pi -e '/^([ \t]*)\\usepackage{unicode-math}/ and $$_.="$$1\\usepackage[default]{fontsetup}\n"' $#
which sits along rules to make a PDF from a .tex file.
But if you aren't using make and Makefiles, then you can wrap up that perl monstrosity into a script of your choosing.

Pandoc disable figure stretching from Markdown to PDF conversion

I create PDF documents from Markdown documents using the simplest pandoc command:
pandoc my.md -o my.pdf
The figures inside the PDF are all stretched, i.e: 100% width.
Which configuration should I give to pandoc to leave the figures as is without changing figure size.
Currently you cannot control that feature directly from Markdown.
In recent months there have been some discussions going on in the Pandoc developer + user community about how to best implement it and create an easy-to-use syntax, for example
![Image Caption](./path/to/image.jpg "Image Comment"){width="60%", height="150px"}
(Warning: Example only, made up on the fly and drawn out of thin air by myself -- can't remember the latest state of the discussion...) This is designed to then transfer to all the supported output formats which can contain images, not just PDF.
So this is planned to be a major new feature for the next major release of Pandoc.
As you may or may not know, Pandoc doesn't create the PDFs itself. It produces LaTeX and employs LaTeX technology (by default its pdflatex command) to convert the LaTeX to PDF (then deleting the intermediate LaTeX files).
To execute some (limited) control about how the LaTeX/PDF pages (or other outputs) look like, Pandoc uses template files. You can look at the exact template definitions your own Pandoc version uses for LaTeX/PDF output by running
pandoc -D latex
So if you are a LaTeX hacker (or know one), you are able to modify that or create your own template from scratch.
In the current release of Pandoc (v1.13.2.1), there is this code snippet in the LaTeX template:
\makeatletter
\def\maxwidth{\ifdim\Gin#nat#width>\linewidth\linewidth\else\Gin#nat#width\fi}
\def\maxheight{\ifdim\Gin#nat#height>\textheight\textheight\else\Gin#nat#height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
This should keep the original image sizes if they fit into the page width, and scale them down to the page width if they don't.
If this is not the behavior you experience with your PDF output, I suspect you are an a rather old version of Pandoc.
For using your own template instead of the builtin internal one, you can add
--template=/path/to/myown-template.latex
to the Pandoc command line.
#KurtPfeifle Thanks for your help. I updated the latex to set static width and hight for the images using the tip.
In my latex template I have:
\setkeys{Gin}{width=128pt,height=192pt,keepaspectratio}
This works great for the mobile images. But I also have a cover page, where the cover figure is now small sized.
I tried creating 2 different latex files and combining them but the figure sizes are back to being stretched:
pandoc _cover_page.md -o _cover_page.tex
pandoc ... -template=mobile_images.latex -o remaining.tex
pandoc _cover_page.tex remaining.tex -o out.pdf
Is there an easy way to combine latex files whicih obey the templates in Pandoc?
I can create 2 pdf files: cover.pdf and remaining.pdf, and combine them too. Is there an easy tool that you know?

How to get style information of elements in PDF using Apache Tika?

I am playing around with Apache Tika to extract text from PDF files. I would like to know how to get style information like font size, text color, whether specific piece of text (few words) are in Italics, Bold, etc. using Apache Tika?
Is it even possible to get this type of information?
Also I would like to if it is possible to get table information using Apache Tika? Information like start of table, start of first row, first cell, etc.
It is probably more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.
I used https://pdfclown.org for stream text blocks and font height extraction:
Example
v.0.2.0
Converting the pdf to the Scalable Vector Graphics (svg) xml format with mupdf will give you the information you want.
Download the mupdf tool here:
http://artifex.com/developers-mupdf-download/mupdf-download-resources/
and choose the GNU AGPL LICENSE
Or here:
https://mupdf.com/downloads/
Details:
https://mupdf.com/index.html
After you download the executable you should add the path to the mupdf executable to your PATH Environment Variable.
You can then use the following from a command line interface (CLI) to convert the pdf (note - there will be a separate svg file for each page):
mutool convert -F svg -O text=text -o "your_pdf_pg.svg" your_pdf.pdf
More CLI details:
https://mupdf.com/docs/manual-mutool-convert.html
In all of the cases I have seen, the font, size, style, color, and page coordinates for each line of text where that information is the same. Except for underlines and strikeouts which are included in the svg file as <paths> in the same coordinate system as the text. So you can develop some code to parse the xml and tag the text with the respective <u> </u> or <del> </del> accordingly.