ocrmypdf 13.4.1 command line works, but API missing text layers when using output_type="pdf" - ocrmypdf

I recently upgraded from ocrmypdf 9.0.3/tesseract 4.x to ocrmypdf 13.4.1/tesseract 5.1.
When using ocrmypdf 9.x or 13.x, this works on on the cli:
$ ocrmypdf --output-type pdf sample-file.pdf output-file.pdf
However, when I use the API within my app,
import ocrmypdf
ocrmypdf.ocr("path/to/inputfile.pdf", "path/to/outputfile.pdf", output_type="pdf")
The text layers are added only when I use ocrmypdf 9.x and no text is searchable when I use 13.4.1.
However, if I use:
ocrmypdf.ocr("inputfile.pdf", "outputfile.pdf", output_type="pdfa")
then appropriate text layers are set when using either 9.x or 13.4.1
I feel like I'm missing something very basic... any help here?

This turned out to be a non-issue.
There was a post-processing step involved that subsequently changed the output.
13.4.x works fine.

Related

Using matplotlib with latex mode with non-default fonts

I am using Windows 10 with Anaconda and Spyder 4. When using matplotlib, I would like to use the font Proxima Nova and render with LaTeX.
If in my matplotlibrc file I specify
font.family : Proxima Nova
then the figure renders with the font Proxima Nova. This means that the font is installed on my system (as it is) and matplotlib can use it. However, if in the matplotlibrc file I also specify
text.usetex: True
then, even though I have specified Proxima Nova as the font, the figure renders in the default LaTeX font, which I guess is Computer Modern.
I have tried
matplotlib.font_manager._rebuild()
In the source code file and also have tried specifying the fonts in the source code file and not in the matplotlibrc file. However I always get the same result. I have also followed all the advice on this help page, including making sure that latex, dvipng and ghostscript are all the PATH variable. However nothing seems to work.
I would like to note that I can use Proxima Nova separately when compiling Latex documents, so that should not be an issue either.
How can I get matplotlib to be able to use a non-default font and render with LateX at the same time?
After some further investigation, I was able to get to use Proxima Nova with Latex, although there are still some outstanding issues.
The main issue is that if the font Proxima Nova is used with Latex, one needs to use Lualatex and not plain Latex. Here is the Matplotlib instruction on using matplotlib with Lualatex.
The key to getting things to work was this post.
At the very beginning of my .py file, I have the following code:
import matplotlib as mpl
mpl.use("pgf")
mpl.rcParams.update({
'font.family': 'sans-serif',
'text.usetex': True,
'pgf.rcfonts': False,
'pgf.texsystem': 'lualatex',
'pgf.preamble': r'\usepackage{fontspec} \setmainfont{Proxima Nova}',
})
The code above should be placed at the very top of the code, above any other imports.
The problem, however is that this solution works only after performing the following steps:
Delete the .matplotlib/tex.cache folder and restart spyder
Replace 'font.family': 'sans-serif' and \setmainfont{Proxima Nova} with 'font.family': 'serif' and \setmainfont{Times New Roman} respectively. Run python once.
Revert back to 'font.family': 'sans-serif' and
\setmainfont{Proxima Nova} and run python again.
The output with the correct font is produced.
Unless the above 4 steps are performed, the output is compiled with the default DejaVu Sans font and not with Proxima Nova. I am not sure why...
After getting help on the matplotlib github forum, I was pointed to the following solution:
mpl.rcParams.update({
'font.family': 'sans-serif',
'text.usetex': True,
'pgf.rcfonts': False,
'pgf.texsystem': 'lualatex',
'pgf.preamble': r'\usepackage{fontspec} \setsansfont{Proxima Nova}',
})
In other words you need to use setsansfont in stead of setmainfont. You can see the matplotlib forum page here.

VuePress - markdown content filled with special characters

Attempting to follow along with the get-started tutorial for VuePress. Upon first running the application, this is what I got:
��#� �H�e�l�l�o� �V�u�e�P�r�e�s�s� � �
I tried changing the text in case it was something weird with PowerShell. This is what I get:
��A�n�y�t�h�i�n�g�
Running node v8.11.4 on Windows 10, just trying to play with the technology.
Your system may create the file with a different encoding (in my case, windows powershell was using utf-16) - simply change the encoding with your favorite editor.
See https://github.com/vuejs/vuepress/issues/276

Converting IPython notebook mhchem markdown to pdf

Having figured out how to insert mhchem into MathJax expressions in an IPython notebook, (from this question) I was pretty pleased until I wanted to convert the notebook to PDF. My markdown cell contains:
$\require{mhchem}$
$\ce{CH2O2->[k_1]H2 + CO2}$
But attempting to convert the notebook to PDF using the following terminal command
ipython nbconvert untitled.ipynb --to pdf
results in an ! Undefined Control Sequence error - similar to this post. However the solution proposed (\texorpdfstring) isn't recognized by the MathJax interpreter in the markdown cell. Any ideas or workarounds to resolve the issue?
IPython 3.0.0 /MiKTeX 2.9 /win7x86
To include the mhchem package you need to create a custom template. Moreover it is necessary to handle the \require mathjax command which is not available in pdflatex.
A possible template (chemtempl.tplx) could look like
((*- extends 'article.tplx' -*))
((* block packages *))
((( super() )))
\usepackage{mhchem}
((* endblock packages *))
((* block commands *))
((( super() )))
% disable require command
\newcommand{\require}[1]{}
((* endblock commands *))
This template extends to default article template by overriding the package and commands block. The super call is similar to the python command and includes the original block content.
The \require command is tackled by defining a new command which takes one argument and does nothing -> this way it can remain in the notebook.
The template is used like
jupyter nbconvert --template chemtempl.tplx --to pdf notebook.ipynb
if you are using IPython 3.x just replace jupyter with ipython.
To use the template also with the download as pdf feature in the notebook,
a config file has to be used. To this end create a config file (.ipython/profile_default/ipython_config.py for IPython 3.x or .jupyter/jupyter_notebook_config.py for IPython 4.x (Jupyter)) with the following content:
c = get_config()
c.LatexExporter.template_path = ['/home/path/to/template/']
c.LatexExporter.template_file = 'chemtempl.tplx'
After restarting Jupyter the new config should be loaded and the template be used.
(I guess for IPython 3.x the file could also be named ipython_notebook_config.py but currently I have no 3.x version to test this.)
After failing to follow the answer above, I discovered an easy workaround.
You can simply edit the default LaTex template used by nbconvert to include
\usepackage{mhchem}
in the preamble. This means that all chem equations will work just like any math equations in your notebook.
For me (macOS) the template was found at /Users/USER/opt/anaconda3/share/jupyter/nbconvert/templates/latex/base.tex.j2
I'm not sure if this was related, but I got a few compile errors once I modified the preamble related to spaces next to my $ in math blocks. It was all fixed after making sure there was no space within the $'s and one space outside them.

Transparent inline matplotlibs in IPython

I'd like the background of my matplotlib plots to be transparent in my IPython notebook. This may sound silly because the notebook itself defaults to a white background but:
1) I use a solarized background and
2) more importantly, I want them to be transparent for when I embed the notebook directly into my blog via nbconvert.
It's easy enough to use something like savefig('file', transparent=True) , but I'm not saving the figures, I am displaying them inline (by calling IPython with ipython notebook --matplotlib inline.
I've been playing around with the IPython notebook configuration file, especially with c.InlineBackend.rc. For example, I upgraded to the dev version of matplotlib to get access to its new savefig.transparent rcParam, and tried configuring that with c.InlineBackend.rc = {'savefig.transparent': True}, but as expected it only affects plots saved with savefig.
Note that I am using the recent IPython 2.0 release. This must be possible somehow, right? Any light that you can shed would be appreciated.
Just to follow up, the issue opened on Github by tillsten has been patched so something like this:
rcParams['figure.facecolor'] = (0,0,0,0)
should work now after you update IPython. Three cheers for open source.
The inline plots are html objects (<img>) with class ui-resizable. So you can change their default behavior by customizing the CSS for your notebooks:
locate your settings for notebooks: in a terminal, type
ipython locate
in the indicated directory, go to subdir profile_default\static\custom (or any profile you want to use instead)
edit or create a CSS file named custom.css
put this in it:
img.ui-resizable
{
opacity:0.4;
}
Close your notebooks, kill IPython and restart it (so that it recreates the served files).
It should work with exported notebooks, as long as you export them as html and you change the css there too.
It's not exactly what you want, but it does the job.

How do I get dojo.currency.format to use the correct currency symbol when using a custom dojo build?

When I use my custom build of dojo, dojo.currency.format doesn't use the correct currency symbol.
This is the statement I use:
dojo.currency.format(1234.567, {currency: "USD"});
This is the result when I use the standard dojo release:
"$1,234.57"
This is the result when I use my custom build of dojo:
"¤1,234.57"
How can I get my custom dojo build to produce the correct results?
I encountered this issue when first trying to use the dojo build. It has to do with the character encoding of the files. Check out the character encoding of an unzipped release (non source). Compare that to the character encoding of files in unbuilt source, and the encoding of files are a custom build. To see if this is an issue, (in chrome) you can force the browser to render the contents in a given encoding. You can try this to see if it is actually the issue you are having.
The easy solution to this (for me at least) was to set the charset on the dojo script tags
<script type="text/javascript" src="/path/to/dojo" charset="UTF-8"></script>
Dojo has a couple of pages on encoding that are worth taking a look at.
If you are using shrinksafe in the build, you may also need to specify the encoding there:
java -jar -Dfile.encoding=UTF8 shrinksafe.jar
Does your build have access to dojo/cldr/nls directory for the localization files of your locale? Check in Firebug whether it attempts, but fails to load currency.js from mentioned directory.