PDFBox changes font name from language character to english character - pdfbox

While merging 2 pdf with pdfbox / opening pdf in pdfbox debugger, if PDF file have font name with language character like MSゴシック or MS明朝, then it changes font to ",l,rfsfvfbfn" or ",l,r-3/4c"
If we have pdf with font "MS Gothic" in english, it merges document properly.
Changing font name leads to spacing issue in merged document and makes it unreadable as few characters overlaps.
Difference in font while open in PDFDebugger
Difference in font after merging

Related

how do I extract the Arabic text of this PDF file correctly?

Today i tried to search a Arabic word in a PDF file that contained Arabic content.
All PDF reader soft wares cannot search any Arabic word in this PDF file.
So I dragged PDF file into Firefox browser and selected a area that contained some words by inspect elements and saw this:
hw ½oiC instead of آخرین سخن
What is type of the encoding used in this PDF file?
how can i encode this to normal text?
It's difficult to comment on the file you are looking at without seeing it but a good starting point is to try Acrobat and by either copying the text and pasting it into a text editor or doing a search for the text content will reveal if it can be extracted correctly or not.
If it can't be extracted properly then there's a good chance the font is lacking a ToUnicode entry (see Section 9.10.1 of the ISO PDF 32000-1:2008 specification for more information).

Pdf partial font embedding with iText

I am asked to include partial font into a pdf.
I think I will use iText and I found how to embed the font but I found no clue about partial embedding.
Does anybody know if partial embedding is automatic ? Or maybe iText does not have this feature ?
Thank you.
When does iText embed the full font, a subset, or no font?
In this answer, it is assumed that you use the BaseFont class and the Font class like this:
BaseFont bf = BaseFont.createFont(pathToFont, encoding, embedded);
Font font = new Font(bf, 12);
In this snippet:
pathToFont is the path to a font file (.ttf, .ttc, otf, .afm),
encoding is an encoding such as "winansi", BaseFont.IDENTITY_H,...
embedded is a boolean: true or false.
Will iText embed the font or not?
That's determined by the embedded parameter:
If it is false, the font isn't embedded.
If it is true, the font is embedded, except in the case of the Standard Type 1 fonts or Type 1 fonts for which the .pfb file is missing or CJK fonts.
Regarding the exceptions:
The Standard Type 1 fonts are 4 flavors of Helvetica (regular, bold, italic, bold-italic), 4 flavors of Times Roman (...), 4 flavors of Courier (...), Symbol and Zapfdingbats. iText ships with 14 Adobe Font Metrics (AFM) files. These files contain the metrics that are needed to calculate widths of glyphs and words. iText doesn't have the necessary Printer Font Binary (PFB) files that are required to embed the font.
Type 1 fonts are stored in two files: an AFM file and a PFB file. If you provide an AFM file, iText will look for the PFB file in the same directory. If iText doesn't find any PFB file, the font can't be embedded.
CJK stands for a series of Chinese, Japanese and Korean fonts that are available in downloadable font packs. It's a special type of Asian fonts; Asian fonts in .ttf, .otf or .ttc files can be embedded.
Will iText subset the font or not?
iText will always try to embed a subset of the font, not the whole font, except in case you provide a Type 1 font (AFM and PFB file). In case a Type 1 font is provided, the full font is embedded.
Can iText embed the full font?
Yes, you can force iText to embed the full font by adding this line:
bf.setSubset(false);
However, this value will be ignored in case you use the encoding Identity-H because that's how it's described in ISO-32000-1. iText will only embed full fonts that are stored inside the PDF as a simple font (256 characters); iText will never embed fonts that are stored as a composite font (up to 65,535 characters).

Which Chinese font is commonly supported by PDF readers of Chinese people?

I am generating PDF files which contain English and Chinese characters (using the Ruby Prawn library). I don't want to embed a Chinese font file in the generated PDF files, because these files need to stay small. So I'm wondering if I could just mentioning a Chinese font name in my PDF files, and have the PDF readers correctly rendering the Chinese characters because the PDF readers would already have the Chinese font file.
Is that something sensible? If so is there any commonly used Chinese font that one can expect to be installed in most of the PDF readers used by Chinese people?
The best way to ensure that a PDF file can be displayed on a any reader, is to use partially embedded fonts (also known as font subset). In PDF, you don't need to include the whole font with your document, having a subset with just the glyphs that were used in the file is enough for the file to be portable.

PdfBox - change font of text in PDF

Is it possible to change text fonts in existed PDF through PdfBox? If yes how to do that? I have problems with some special fonts in PDF and I want to change them to font that is widely supported.
Thanks

How to find the used characters in a subsetted font?

I have PDF files which are dynamically generated, with text, vectors, and subsetted fonts. I can see which fonts are used in various viewers - is there a way of displaying the actual subsetted characters of those fonts?
For example, I see the document contains the subsetted subsetted fonts "AAAAAC+FreeMono" and "AAAAAD+DejaVuSans". How do I find how many characters were subsetted from these fonts, and what characters they were?
(I tried loading the fonts in FontForge, but it just crashes while opening the file)
The solution is to save the font data to a file and load it into a font editor. A subset font file is still a valid font file but it is possible that FontForge expects some data in the font that is not there. I have seen also many fonts that are not properly subset and this could also cause loading problems in a font editor.