PDF font mapping error - pdf

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.

The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.

Related

Nbspace not available

I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?
The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);

Writing Unicode into PDF

I have Unicode text (a sequence of Unicode codes) and a TTF font (bytes of a TTF file). I would like to write that text into a PDF file using that font.
I understand PDF quite well. I don't mind using two bytes per character. I would like to attach the TTF file as it is (charcode-to-glyf map should be used from a TTF file).
What font Subtype and Encoding value should I use? Is it possible to avoid having ToUnicode record?
I tried to use Subtype = "/TrueType", but it requires to specify FirstChar, LastChar and Widths (which are already inside TTF).
You cannot use Unicode with a Font, at all (except in the limited case of Latin, or nearly Latin, languages), because Fonts use an Encoding, and an Encoding is a single byte array. So you can't reference more than 256 characters from a Font, and a character code can't be more than a single byte.
The first problem with 'using Unicode' is that Unicode is not a simple 2-byte Encoding, its a multi-byte format, with variable lengths and sometimes a single glyph is represented by multiple Unicode code points.
So, in order to deal with this you need to use a CIDFont, not a Font. You cannot 'use the charcode-to-glyf map', by which I assume you mean the CMAP subtable in the TTF font. You must compose the CIDFont with a CMap in order to map the multiple bytes in the text string into the character codes for lookup in the CMap, which gives you the CID to reference the precise character program in the font.
It may be possible to construct a single CMap which would cover every Unicode code point, but I have my doubts, it would certainly be a huge task. However certain CMaps already exist. Adobe publish a standard list on their web site which includes CMaps such as UniCNS-UCS2-H and UniCNS-UCS2-V or UniGB-UTF8-H etc.
You can probably use one of the standard CMaps.
Note that it doesn't matter that the FirstChar, LastChar etc are already stored in the TrueType font, you still need to specify them in the PDF Font object. That's because a PDF consumer might not be rendering the text at all, it could (for example) be extracting the text, in which case it doesn't need to interpret the font provided this information is available.

PDF toUnicode cmap table restore

I have multiple pdf files without 'toUnicode' cmap table. Absence of cmap table restricts me from copying the text from pdf files.
As far as I know, there is a possibility to add 'toUnicode' mapping in pdf file, but in my case adding static values is not an option, different files have different glyph codes.
So the question is the following. Is there any possibility to restore 'toUnicode' cmap table, perhaps with the help of Ghostscript, or are there any options at all?
Thanks.
No, you cannot add ToUnicode CMaps to an existing PDF file using Ghostscript.
In the general case, you can't do it at all, except manually. As you note in the question, different files will be constructed to use different character code->Glyph mappings, which means that the character code to Unicode mapping will also be different.
Since the character code selection is often based on the order in which glyphs are used in a file (so the fist glyph is character code 1, the second is character code 2 etc) you can see that there is no prospect of identifying a 'one size fits all' solution.
You could use some kind of OCR to scan the rendered output, identify each glyph and find the Unicode code point for it. Then you could construct a CMap by identifying the character code for the glyph and mapping it to the Unicode value.
You could, then, add the ToUnicode CMap to the PDF file, and update the Font Descriptor with the object number of the ToUnicode CMap.
Ghostscript won't do any of that for you, and I haven't heard of any tool which will.

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

extract information from tables in truetype font file

While parsing a pdf file, my parser encounter a Tf operator with the value of the SubType entry in the font dictionary set to TrueType. The Encoding entry is not present, the symblic flag is set.
My question is : how do I suppose to map the character codes to characters with no encoding ?
The PDF reference section 5.5.5 Character Encoding states that TrueType font has internal data represented in tables in the font files. It seems that those tables would help me map the character codes. Am I getting it right ? How can I extract those information from the font file ?
The font file extracted from the PDF gave something like :
I read Apple's documentation The True Type Font File but still don't get how can I extract those informations from those tables.
Any help, links or reading suggestion would be greatly appreciated.
Symblic flag means that encoding is set to [0..255] range. Every character code must be in the this range. Font presents glyphs only for these codes.
Here is a great set of resources regarding TrueType and OpenType font formats.
You can use freetype library function FT_Get_Char_Index for going from a character code to a glyph index. See FT_Get_Char_Index
You'll have to dump the truetype font to file and load it with freetype to get an FT_Face first.