docsplit conversion to PDF mangles non-ASCII characters in docx on Linux - pdf

My documentation management app involves converting a .docx file containing non-ASCII Unicode characters (Japanese) to PDF with docsplit (via the Ruby gem, if it matters). It works fine on my Mac. On my Ubuntu machine, the resulting PDF has square boxes where the characters should be, whether invoked through Ruby or directly on the command line. The odd thing is, when I open up the .docx file directly in LibreOffice and do a PDF export, it works fine. So it would seem there is some aspect to how docsplit invokes LO that causes the Unicode characters to be handled improperly. I have scoured various parts of the documentation and code for options that I might need to specify, with no luck. Any ideas of why this could be happening?
FWIW, docsplit invokes LO with the following options line in pdf_extractor.rb:
options = "--headless --invisible --norestore --nolockcheck --convert-to pdf --outdir #{escaped_out} #{escaped_doc}"
I notice that the output format can optionally be followed by an output filter a in pdf:output_filter_name--is this something I need to think about using?

I have tracked this down to the --headless option which docsplit passes to LibreOffice. That invokes a non-X version of LO, which apparently does not have the necessary Japanese fonts. Unfortunately, there appears to be no way to pass options to docsplit to tell it to omit the --headless option to LO, so I will end up patching or forking the code somehow.

Related

How to convert unusual unicode characters (UTF-8) to PDF?

I would like to convert a text file containing Unicode characters in UTF-8 to a PDF file. When I cat the file or look at it with vim, everything is great, but when I open the file with LibreOffice, the formatting is off. I have tried various fonts, none of which have worked. Is there a font file somewhere on my Ubuntu 16.04 system which is used for display in a terminal window? It seems that would be the font to tell LibreOffice to use.
I am not attached to LibreOffice. Any app that will convert the text file into a PDF file is fine. I have tried txt2pdf and pandoc without success.
This is what the file looks like
To be more specific about the problem, below is an example of what the above lines look like in LibreOffice using Liberation Mono font (no mono font does better):
I answered to you by mail, but here is the answer. You are using some very specific characters; the most difficult to find being in the Miscellaneous Symbols unicode block. For instance the SESQUIQUADRATE which sould is on your second line as ⚼.
A quick search lead me to the two following candidates (for monospace fonts):
Everson Mono
GNU Unifont
As you can see, the block is partially covered by PragmataPro which is a very good font; however, I tried with an old version and found all your own characters, but an issue occured because the Sun character (rendered as ☉) seems to be printed twice wider than the other characters, but my version of this font is rather old and perhaps buggy.
Once you have chosen the font suiting your needs, you may be able to render your documents as PDF with various tools. I made all my experiments with txt2pdf which I use daily for many documents.

Cannot display Unicode Characters (like λ) in PDF output of Jupyter

I'm using Julia in a jupyter notebook.
I want to generate a pdf for the results of my work. However, when generating the pdf, the λ of the mathematical expression λ=3 is lost so that the output in the pdf is =3.
Here is the jupyter notebook code
In[1]: λ=3
Out[1]: 3
Here is the pdf generated with the jupyter notebook
In[1]: =3
Out[1]: 3
This is not the case in with the pdf generated with nteract where the expression λ=3 is fully printed out. However the overall appearance of the pdf generated with nteract is not as nice as the pdf generated with jupyter notebook.
Here is printed pdf generated with nteract (looks exactly the same as the code itself):
In[1]: λ=3
Out[1]: 3
Can somebody know how to print such character with jupyter notebook?
Many thanks in advance
The issue is related to how Jupyter itself generates and compiles the latex file. Jupyter, by default, compiles the file with xelatex to support Unicode. My guess is, however, xelatex requires some configurations in the files and Jupyter does not generate the file that works out of the box with plain xelatex command.
You can change the configuration of Jupyter to compile latex file generated with pdflatex or latex command.
Solution:
Find your Jupyter configuration directory(i.e. output of jupyter --config-dir, on Linux usually ~/.jupyter. To find out which jupyter IJulia uses run using IJulia; IJulia.jupyter then find out the config directory of that jupyter)
Create the file jupyter_notebook_config.py in this directory if there is not one, already.
Put the following line of code at the end of this file and save it:
c.PDFExporter.latex_command = ['pdflatex', '{filename}']
Then you can export the PDF file using the notebook, as usual. The characters should appear, just right. This should work provided that pdflatex command is found in your shell.
If you do not have pdflatex but have latex you can also use the following line instead of the code above:
c.PDFExporter.latex_command = ['latex', '--output-format=pdf', '{filename}']
If you are not able to change the configuration of Jupyter, download the latex file and compile it with the command latex --output-format=pdf filename.tex.
Hope it works!
My solution to this is ugly, and I will get to it below, but first it is important to understand why this is happening.
Why is this happening
The intermediate .tex file that is generated indirectly calls for the Latin Modern fonts. Latin Modern is a fine choice for math fonts, but it is sucky for monospaced. The Latin Modern mono font does not include Greek.
Latin Modern is set by the unicode-math LaTeX package, which is loaded in the generated LaTeX around line 43.
\ifPDFTeX
\usepackage[T1]{fontenc}
\IfFileExists{alphabeta.sty}{
\usepackage{alphabeta}
}{
\usepackage[mathletters]{ucs}
\usepackage[utf8x]{inputenc}
}
\else
\usepackage{fontspec}
\usepackage{unicode-math}
\fi
So the unicode-math package will be loaded if you are using XeLaTeX (which is a good default) or LuaTeX or any other LaTeX engine for which fontspec is available.
The unicode-math package very reasonably uses Latin Modern for math, but if nothing is set otherwise, it will also use Latin Modern for monospaced fonts. From the documentation
Once the package is loaded, traditional TFM-based maths fonts are no longer supported; you can only switch to a different OpenType maths font using the \setmathfont command. If you do not load an OpenType maths font before \begin{document}, Latin Modern Math will be loaded automatically.
The creators of unicode-math assume that you will set your non-math fonts up after you have loaded the unicode-math, but that isn't done with the .tex generated by jupyter nbconvert. (I don't know if this is a jupyter thing or a Pandoc thing, but either way we end up with a document that is used Latin Modern for more than just math.)
So one solution is to set some other mono font after unicode-math is loaded and before \begin{Document}.
My solution
My solution is tuned for what I already had set up. It may not be the right approach for you, and it certainly will need some adjusting.
My Makefile used to have a simple juypter nbconvert --to=pdf in it. But now I need to edit the intermediate .tex file. So I have this for a notebook named computation-examples. You will need to use your own file name or do some Make rule magic.
# sed on macOS is just weird. Resorting to perl
computation-examples.tex: computation-examples.ipynb
jupyter nbconvert --to=latex $<
perl -pi -e '/^([ \t]*)\\usepackage{unicode-math}/ and $$_.="$$1\\usepackage[default]{fontsetup}\n"' $#
The perl adds the line \usepackage[default]{fontsetup} immediately after the line with\usepackage{unicode-math}. There are probably nicer ways to do that. I started with sed, but gave up. So the .tex file that is then processed to PDF by XeLaTeX has this.
\else
\usepackage{fontspec}
\usepackage{unicode-math}
\usepackage[default]{fontsetup}
\fi
The fontsetup package preserves all of the goodness of unicode-math while setting up the non-math founts. The default is to use the OpenType (.otf) Computer Modern Unicode fonts that will be a part of any TeX distribution that has xelatex on it.
Another approach
Probably a cleaner approach, but one I haven't experimented with, would be to create a fontspec.cfg file which lies about (overrides) what font files to load for what we are calling Latin Modern Mono. I would need to reread the fontspec documentation for the hundredths time to do that.
Make magic
Since writing the above, I have set up a more general Makefile rule,
%.tex: %.ipynb
jupyter nbconvert --to=latex $<
perl -pi -e '/^([ \t]*)\\usepackage{unicode-math}/ and $$_.="$$1\\usepackage[default]{fontsetup}\n"' $#
which sits along rules to make a PDF from a .tex file.
But if you aren't using make and Makefiles, then you can wrap up that perl monstrosity into a script of your choosing.

Why do I obtain countless 'programming' pages of characters/numbers when printing pdf/png files using lpr?

I've got a silly problem which is literally driving me mad:
When I try to print a file using lpr file.pdf depending on the file I obtain one of the following issues:
the printer does not recognise the A4 format
the file is printed but together with a countless number of pages of programming code ( the 'real' face of a PDF file I guess), characters and numbers.
The same happens also for PNG files.
I'm using MAC OS X El capitan and a Xerox colorQube printer.
Clearly if I open the file with Acrobat or Preview and just make the printing manually I have no problem at all.
I hope you can give me some clues because I couldn't find anything useful on the web.
PS: If I use the option -l the printer prints a sheet saying that the printer is not configured to print pdf files directly.
lpr sends file directly to printer, it may not understand pdf as-is, but since pdf is a successor to postscript - it can contain familiar commands so something gets printed, but the rest, probably the embedded preview and so on - gets printed as raw text
Try using ghostscript to convert to postscript before sending to printer:
gs -dSAFER -dNOPAUSE -sDEVICE=(your printer name) -sOutputFile=\|lpr file.pdf

Headless conver-to PDF soft-hyphen replaced with zero-width whitespace

i'm working on an webapp creating LibreOffice Documents that i want to convert to PDFs with unoconv and a headless libreoffice.
There is just one problem i can't solve: The soft-hyphens i include in the .odt are replaced with zero-width whitespaces in the resulting PDF. The Problem is not related to unoconv - i tried it directly with a headless libreoffice (same result). i tried both v 4.1.4.2 as well as 4.2.5.2.
i tried another font (Ubuntu) (i use Arial as the body font) as i expected that the missing Arial font on Linux causing the problem (i have the problem on the production server with debian 7 as well on a virtualbox with ubuntu 12.04).
i even installed the arial font in hope it caused the problem due to libreoffice inability to calculate where to set the "real" hyphens without the font file at hand.
strange thing: using LO 4.1.4.2 on my mac (headless of course) produces flawless PDFs. So the problem must be related to either linux or some missing "graphical" package in my server setup. i installed the hyphen-de package which results in hyphens based on the dictionary, but the specified soft-hyphens are still replaced with zero-width whitespaces.
the problem affects both body text as well as text boxes that are used for annotations.
i'd appreciate any hint very very much!
I had a similar problem.
I had to install the right language hyphenation package that fit with the document's language.

TeXnicCenter - spelling not working correctly

I have installed 2.02 Stable 64 bit version of TeXnicCenter and have following problem with spelling check. In one of my existing LaTeX document the grammar of the text in English is checked correctly and all typos are being underlined. In this file German language is not being recognise although I change setting for the language in the options for spelling. However, in other of my existing LaTex document the spelling tool is not recognising English text but it recognises text in German.
Here some hint: It could be that the other LaTex file has been created within German Windows environment. Now I have the Win 7 environment in English. Is it possible that it is connected with the text formatting? Is it possible to change it? Or is there a different cause?
Some other hint: When I generate a new LaTex file the spelling works fine for both English and German. So it is just the problem with the existing document.
Good hint from your side towards text encoding Phil. Solution is a bit different though. Apparently TexnicCenter is saving .tex files with ANSI encoding as default. As soon as .tex files are saved with UTF-8 encoding, spelling check works fine. There are not options to be set in the program. One has to go through Files->Save As and set the encoding while saving.
I know this is an old topic but here is what solved my issue: manually change the project language. Go to project > properties and then change the language there.