Wrong Character Ordering of Unicode Font in PDF - pdf

I have a PDF file containing some Burmese text in it. The PDF is generated with wkhtmltopdf with embedded fonts (Myanmar Census 3 [Pyidaungsu] for the Burmese text).
Some of the Burmese text is displayed correctly, while others are not: some characters are displayed in the wrong order.
However, when I copy the text from the PDF over to another program, it looks correct (indicating that the Unicode mapping was embedded correctly in the PDF).
Any pointers as to why the rendering of the font is out of order (in Burmese or any other language)?
Versions tried:
- wkhtmltopdf 0.12.2.1 and 0.9.9.1
- ghostscript 9.15
- OS: Mac OSX and Linux

Related

Problem showing a font with license restriction to pdf

I'm programming to convert a file to pdf on mac, file contains a Chinese text
using a font STFangsong which has license restriction and is not embeddable, I
tried many CMaps to encode it, but it seems the root cause is because pdf viewer(both
mac previewer and acrobat reader) does not recognize the font, as shown in the pdf
file properties, Actual Font Unknown and there is a pop message says can't find or
create the font.
The PDF 32000-1:2008 9.6.6.4 tells a guideline that when encoding truetype font,
the font program should be embedded, though no specific explanation, from my
understanding, embedding can guarantee the pdf is readable everywhere, but I do not
need this since the font is licensed, I just want it can be shown on my computer.
So my question here is does those pdf viewer has limitation on those CJK characters
when embedding is forbidden?
By the way, I used Microsoft word to write a text with the font and save word to
pdf, and it shows the font is embedded subset, does it mean Microsoft have bought the
license?

Ghostscript converting pdf to text file, output is unreadable

I was trying to convert a pdf document into text file. everything works until i open the output file to see its unreadable the characters are in some Chinese font
" 琀攀猀琀 "
this is my command line
gswin64c.exe -ps2ascii -sDEVICE=txtwrite -sOutputFile=outputtext.txt test.pdf
im i doing something wrong?
You haven't posted the file, so its not possible to be absolutely certain, however....
Almost certainly the text in your PDF file is not encoded using an ASCII encoding scheme (possibly contains sunset fonts), and does not contain a ToUnicode CMap for the font in question. Additionally the glyph names are not standard names (or its a TrueType font, which don't have named glyphs).
Without any of the above information txtwrite doesn't have any clue what the character codes represent, and so simply emits them verbatim.
Given that you are seeing Chinese glyphs, I would suspect that the original font is a CIDFont, probably a TrueType font, subset and has no ToUnicode CMap.
In this case, the only way to get the text out will be to use OCR.

Ghostscript generated pdf content is not able to copy

I am trying to convert a postscript file which contains some telugu Font (i.e Vani Bold). After converting the file into pdf I am not able to copy the text from generated pdf file .When I see the properties of pdf file in centos document viewer it is showing like below
I am using below command to convert postscript file to pdf
bin/gs -dBATCH -sDEVICE=pdfwrite -sNOPAUSE -dQUITE -sOutputFile=/home/cloudera/Desktop/PrintTest/telugu.pdf /home/cloudera/Desktop/PrintTest/VirtualPrinter_27_09_2016_19_11_41_691.ps
I tried with ghostscript 9.19 and 9.20 as well,but no change.
Following is the link to my postscript file which I am trying to convert into pdf.
click here for postscript file
I have been struggling with this since 10 days .Please provide some solution for this.
I can tell you why you can't copy & paste the text, but I'm not sure I can provide an acceptable solution.
First, not all pdf viewers can deal with unicode characters (for example,xpdf can't, it just ignores them, while mudpf and qpdfview work).
Second, to be able to convert font glyphs to unicode characters, the font object in the PDF file must contain a /ToUnicode property. If you look at the generated PDF after decompression (mutool clean -d), you can see that the Vani font in object 8 0 doesn't have it, while both the Arial font in object 10 0 and the Calibri font in object 12 0 do.
So very likely the Vani font is missing this unicode translation information, you need to either add this information (e.g. with fontforge), or choose a different font that has this information.
Related question:
https://superuser.com/questions/1124583/text-in-pdf-turns-gibberish-on-copying-but-displays-fine/1124617#1124617

Which Chinese font is commonly supported by PDF readers of Chinese people?

I am generating PDF files which contain English and Chinese characters (using the Ruby Prawn library). I don't want to embed a Chinese font file in the generated PDF files, because these files need to stay small. So I'm wondering if I could just mentioning a Chinese font name in my PDF files, and have the PDF readers correctly rendering the Chinese characters because the PDF readers would already have the Chinese font file.
Is that something sensible? If so is there any commonly used Chinese font that one can expect to be installed in most of the PDF readers used by Chinese people?
The best way to ensure that a PDF file can be displayed on a any reader, is to use partially embedded fonts (also known as font subset). In PDF, you don't need to include the whole font with your document, having a subset with just the glyphs that were used in the file is enough for the file to be portable.

Detect embedded characters in PDF using PdfBox

I am extracting text from a PDF file using PdfBox. When the PDF does not contain any embedded fonts everything works fine. The problem occurs when there are some TrueType embedded fonts. I discovered that in same cases the embedded fonts replace the shape of default characters with some other shapes. For example a char code for 'ï' is used to encode 'ł'. I am aware that I cannot get the real shape of the character without any mapping or OCR. I would like to know which characters might be redefined by the embedded characters. My question is how can I know which characters in the PDF stream are defined by the embedded fonts?