How to convert unusual unicode characters (UTF-8) to PDF? - pdf

I would like to convert a text file containing Unicode characters in UTF-8 to a PDF file. When I cat the file or look at it with vim, everything is great, but when I open the file with LibreOffice, the formatting is off. I have tried various fonts, none of which have worked. Is there a font file somewhere on my Ubuntu 16.04 system which is used for display in a terminal window? It seems that would be the font to tell LibreOffice to use.
I am not attached to LibreOffice. Any app that will convert the text file into a PDF file is fine. I have tried txt2pdf and pandoc without success.
This is what the file looks like
To be more specific about the problem, below is an example of what the above lines look like in LibreOffice using Liberation Mono font (no mono font does better):

I answered to you by mail, but here is the answer. You are using some very specific characters; the most difficult to find being in the Miscellaneous Symbols unicode block. For instance the SESQUIQUADRATE which sould is on your second line as ⚼.
A quick search lead me to the two following candidates (for monospace fonts):
Everson Mono
GNU Unifont
As you can see, the block is partially covered by PragmataPro which is a very good font; however, I tried with an old version and found all your own characters, but an issue occured because the Sun character (rendered as ☉) seems to be printed twice wider than the other characters, but my version of this font is rather old and perhaps buggy.
Once you have chosen the font suiting your needs, you may be able to render your documents as PDF with various tools. I made all my experiments with txt2pdf which I use daily for many documents.

Related

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

GhostScript PS to PDF conversion - No Color

Referring to this post, GhostScript Conversion Font Issues, is it safe to assume that GhostScript's PS-to-PDF conversions still do not guarantee cut-&-paste text from the converted document? Because I too am getting garbled copy-&-paste results with formatted documents, although it works with plain text files.
sample Word document .DOC
printed to PostScript by MS PS Driver
converted to PDF by GhostScript
On the color issue, I am using the Microsoft PS Class Driver to print documents to PostScript format files, and then convert them to PDF format with the GhostScript v9.20 DLL (sample source and outputs attached above). The options used are as follows:
-dNOPAUSE
-dBATCH
-dSAFER
-sDEVICE=pdfwrite
-sColorConversionStrategy=/RGB
-dProcessColorModel=/DeviceRGB
However, it is converted without color. Have I missed some option?
You can never guarantee getting a PDF file with text you can cut and paste from a PostScript program. There is no guarantee that there is any ToUnicode information in the PostScript program, and without that, if the font is subset as here, then there is no way to know what the Unicode code point for a given glyph is.
Regarding colour, the PostScript file you have supplied contains no colour, so its not Ghostscript, the problem is in the way you have produced the PostScript. At a guess you have used a Printer Definition (PPD file) which is for a monochrome printer.
You might be able to improve the text by playing with the options for downloading fonts, the basic problem is that your PostScript program doesn't contain the information we need to be able to construct a ToUnicode CMap. Without that we are forced to assume that the character codes are ASCII, and in your case, because the fonts are subset, they are not ASCII.
For some reason the content of your PostScript appears to be downloading the font as bitmaps. This is ugly, doesn't scale well, and might be the source of your inability to get ToUnicode data inserted. It may also be caused by the fonts you are using, you might try some standard system fonts (if you aren't already) like TimesNewRoman.
While its great that you supplied an example to look at, I'd suggest that in future you make the example smaller, much smaller.... There's really no need for 13 pages of multiply repeated content in this case. More content means it takes more time to decipher, try and keep example files to the minimum required to demonstrate the problem.
In short, it looks like both your problems are due to the way you are (or the application) generating the PostScript.

Ghostscript PDF to PDF/A conversion font issues

I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

How to confirm a TrueType PDF font is missing glyphs

I have a PDF which renders fine in Acrobat but fails to print during the PDF to PS conversion process on our printer's RIP. After uncompressing with pdftk and editing I've found if I replace the usage of a certain font it will print.
The font is a strange one, a TrueType subset with a single character (space).
If I pass the PDF through Ghostscript it reports no errors, however an Acrobat pre-flight check will report a missing glyph for space. This error is not reported for the original file. I'm just using a basic command: gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -o gs.pdf original_sample.pdf
I've pulled out the font data from the original PDF and saved it. Running TTFDUMP.exe produces an interesting result where it seems that the 'glyf' table is missing:
4. 'glyf' - chksm = 0x00000000, off = 0x00000979, len = 0
5. 'head' - chksm = 0xE463EA67, off = 0x00000979, len = 54
Just wondering, am I interpreting this result correctly? Is it valid to run TTFDUMP like this on extracted data from a PDF? I think a 'glyf' table is required based on the spec, at least for the first 4 necessary characters.
TTFDUMP run on the ghostscript PDF produces a similar result but with a 1-byte 'glyf' table.
If so it seems that Acrobat doesn't particularly care about the missing space while other programs (including the printer) do. It's odd it isn't reported as missing though until it runs through Ghostscript.
The PDF is created by Adobe InDesign and the font is copyrighted like most so I can't share it.
Edit - I've accepted Ken's answer as he helped me on the Ghostscript bug tracker. In summary, it seems the font is broken as suspected due to the missing glyf table. Until I hear otherwise I'll have to suppose this is a bug in InDesign, and will continue investigating.
Yes you can run ttfdump on an embedded subset font, its still a perfectly valid font.
A missing glyph is not specifically a problem, because the .notdef glyph is used instead, a missing .notdef means a font isn't legal.
I think you are mistaken about the legality of sharing the PDF file (from the point of view of font embedding). Practically every PDF file you see will contain copyright fonts, but these are permitted to be embedded and distributed as part of a PDF (or indeed PostScript) file. TrueType fonts contain flags which control the DRM of the font, and which can deny embedding in in PDF (or other formats). Ghostscript honours these embedding flags in the font as does Acrobat Distiller and other Adobe products.
There were some fonts which inadvertently shipped with DRM which prevented embedding, and there's a list somewhere of these, along with an explicit statement from the font foundry that its permissible to embed these fonts. I think this was somewhere on the Adobe web site a few years back.
So if you have a PDF file with the font embedded in it (especially if it was produced by an Adobe application) then I would be comfortable that its legal to share.
I'm having some trouble figuring out what the problem actually is, and how you are using Ghostscript. If you are running the PDF->PS and then back to PDF then all bets are off frankly. Round-tripping files will often provoke problems.
In any event I'm happy to look at the file but you will have to make it available.