The PDF Reference says:
A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions, using an internal data structure called a “cmap”
It goes on to explain that the behaviour of a PDF processor depends on which cmap subtables are present in the font file.
I am trying to analyze a .ttf font file extracted using fontforge from a PDF that was generated by LibreOffice. The PDF embeds this font file as a simple font, using single-byte codes. When I look at the .ttf file in fontdrop.info, it tells me the "glyphIndexMap" is as follows:
{"0":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0,"50":0,"51":0,"52":0,"53":0,"54":0,"55":0,"56":0,"57":0,"58":0,"59":0,"60":0,"61":0,"62":0,"63":0,"64":0,"65":0,"66":0,"67":0,"68":0,"69":0,"70":0,"71":0,"72":0,"73":0,"74":0,"75":0,"76":0,"77":0,"78":0,"79":0,"80":0,"81":0,"82":0,"83":0,"84":0,"85":0,"86":0,"87":0,"88":0,"89":0,"90":0,"91":0,"92":0,"93":0,"94":0,"95":0,"96":0,"97":0,"98":0,"99":0,"100":0,"101":0,"102":0,"103":0,"104":0,"105":0,"106":0,"107":0,"108":0,"109":0,"110":0,"111":0,"112":0,"113":0,"114":0,"115":0,"116":0,"117":0,"118":0,"119":0,"120":0,"121":0,"122":0,"123":0,"124":0,"125":0,"126":0,"127":0,"160":0,"161":0,"162":0,"163":0,"165":0,"167":0,"168":0,"169":0,"170":0,"171":0,"172":0,"174":0,"175":0,"176":0,"177":0,"180":0,"181":0,"182":0,"183":0,"184":0,"186":0,"187":0,"191":0,"192":0,"193":0,"194":0,"195":0,"196":0,"197":0,"198":0,"199":0,"200":0,"201":0,"202":0,"203":0,"204":0,"205":0,"206":0,"207":0,"209":0,"210":0,"211":0,"212":0,"213":0,"214":0,"216":0,"217":0,"218":0,"219":0,"220":0,"223":0,"224":0,"225":0,"226":0,"227":0,"228":0,"229":0,"230":0,"231":0,"232":0,"233":0,"234":0,"235":0,"236":0,"237":0,"238":0,"239":0,"241":0,"242":0,"243":0,"244":0,"245":0,"246":0,"247":0,"248":0,"249":0,"250":0,"251":0,"252":0,"255":0,"305":0,"338":0,"339":0,"376":0,"402":0,"675":3,"710":0,"711":0,"728":0,"729":0,"730":0,"731":0,"732":0,"733":0,"916":0,"937":0,"960":0,"8211":0,"8212":0,"8216":0,"8217":0,"8218":0,"8220":0,"8221":0,"8222":0,"8224":0,"8225":0,"8226":0,"8230":0,"8240":0,"8249":0,"8250":0,"8260":0,"8364":0,"8482":0,"8706":0,"8719":0,"8721":0,"8730":0,"8734":0,"8747":0,"8776":0,"8800":0,"8804":0,"8805":0,"9674":0,"57374":0,"64257":0,"64258":0}
(the interesting part is "675":3)
I can understand this insofar as the font contains 4 glyphs, and the glyph at index 3 is the ʣ character (decimal Unicode point 675 / U+02A3).
But in the PDF, this character is used in text strings as <01>, and no other encoding is given - so according to the PDF Reference, the mapping from <01> to the glyph at index 3 must be done according to a mapping within the .ttf file:
If no Encoding entry is specified in the font dictionary, the “cmap” subtable with platform ID 1 and encoding 0 will be used to map directly from character codes to glyph descriptions, without any consideration of character names. This is the normal convention for symbolic fonts.
I have confirmed that no Encoding entry is specified within the PDF. Here are the /Font and /FontDescriptor objects extracted using qpdf:
18 0 obj
<<
/BaseFont /BAAAAA+LiberationSerif
/FirstChar 0
/FontDescriptor 20 0 R
/LastChar 1
/Subtype /TrueType
/ToUnicode 21 0 R
/Type /Font
/Widths [
777
802
]
>>
endobj
20 0 obj
<<
/Ascent 891
/CapHeight 981
/Descent -216
/Flags 4
/FontBBox [
-543
-303
1277
981
]
/FontFile2 23 0 R
/FontName /BAAAAA+LiberationSerif
/ItalicAngle 0
/StemV 80
/Type /FontDescriptor
>>
endobj
So how can I investigate the .ttf file to confirm that "the “cmap” subtable with platform ID 1 and encoding 0" is in place and contains the mappings I think it does?
Edit: the PDF in question
How do I inspect the cmap table and subtables in a TrueType font?
OT Master Light, from Dutch Type Library, is a free tool that's quite handy for inspecting internal font tables.
Using OT Master Light it can be seen that the cmap 1:0 maps character code 0x01 to glyph index 1 (1st image, 2nd entry in the list) which is the 'dz' symbol (2nd image).
i got an PDF 1.3 where I want to extract the text.
But in the stream there are 2 different types of text.
Some plain text and some character coded text with escape sequences.
Here an example:
/TextClip BMC
BT
/T1_2 1 Tf
0 Tc 0 Tw 7 Tr 16.2626 0 0 16.2626 37.2512 581.738 Tm
(Test Test)Tj
ET
EMC
q
/GS0 gs
67.6799985 0 0 -13.4399997 37.439994 594.2399583 cm
/Im47 Do
Q
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/GS0 gs
180.959996 0 0 -9.5999998 36.959999 578.3999755 cm
/Im48 Do
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/TextClip BMC
BT
0 Tc 0 Tw 7 Tr 9.899 0 0 9.899 37.2512 569.7178 Tm
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
ET
EMC
In this example ther are 2 times the text "Test Test". One time as plan text and the other time with the escape sequence \000E\000V\000d\000e\000\003\000E\000V\000d\000e.
I only knew, if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
The 4 character after the escape sequence is at 15 next to the correct ascii code. (\000E is character "T") But what is the correct conversion?
The text block \000\003 should be a space sign. What is there the conversion hack?
Regards
The encoding of the string arguments of text showing instructions like TJ and Tj depends on the PDF font in question, cf. the specification
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 - Text-Showing Operators - in ISO 32000-1)
The font used for the first text showing operation
(Test Test)Tj
probably is a simple font with an ASCII'ish encoding, probably WinAnsiEncoding. The font itself is selected two lines above in
/T1_2 1 Tf
so you only have to look up the font resource T1_2 the associated resources (the resources of the page if you are showing us an excerpt of a page content stream) to verify.
The font used in the second text showing operation
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
appears to be a composite font with a double-byte encoding, probably Identity-H, and the underlying font program appears to have the glyph codes most often found in TrueType fonts. You should look for a ToUnicode mapping in that PDF font for easy decoding.
The instruction in which this font is selected, is not among the instructions you posted but instead must be somewhere above. This selection has been saved as part of the graphics state (in some early q instructions) and restored again (in some Q instruction between the two text showing instructions you shared).
if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
No, in your example there always are escape sequences with three octal digits. The character thereafter is a separate byte, i.e. you have the bytes '\000', 'E', '\000', 'V', '\000', 'd', '\000', 'e', '\000', '\003', '\000', 'E', '\000', 'V', '\000', 'd', '\000', and 'e'.
As mentioned above, this looks like a double-byte encoding with in particular the mappings
\000E -> 'T'
\000V -> 'e'
\000d -> 's'
\000e -> 't'
\000\003 -> ' ' (space)
This appears to be a glyph encoding often found in TrueType fonts which for Latin letters merely means a constant offset to their Unicode codes.
But there also are many different multi-byte encodings in common use, sometimes they even are ad-hoc encodings only created for the font on the page at hands.
Thus, if you seriously want to do text extraction from PDFs, you really have to study the PDF specification and implement along its requirements instead of hoping for some conversion hack.
Adobe has published a copy of the old PDF specification ISO 32000-1 on their web page at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.
Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.
In adobe acrobat x i was inserting text objects and when it is opened in adobe reader 10 it was opening properly.but in adobe reader 11 when i click on that pdf file text objects gets deleted.why this happens? How to solve it?
The source pdf file click here
The pdf file which has problem when double clicking on it in adobe reader 11.
click here
In a nutshell:
You try to change the contents of a free text annotation by changing its normal appearance stream.
This is insufficient: A compliant PDF viewer may ignore this entry and provide their own appearances. So it's mere luck that older Adobe Reader versions chose to not ignore your change.
Thus, you also need to change the information a PDF viewer is expected to create their own appearance from, i.e. foremost the rich text value of RC (in the free text annotation dictionary) that shall be used to generate the appearance of the annotation, and also the Contents value which is the Text that shall be displayed for the annotation.
Furthermore there are defects in your PDFs:
the cross reference table in your first attempt result.pdf was broken;
the intent (IT value) of the free text annotation in your source files is spelled incorrectly.
In detail:
Your result.pdf is broken. Different PDF viewers may display broken PDFs differently.
Some details:
It has been created based on your Src.pdf in append mode but additionally the following change in the original revision has been made to its /Pages object:
In the source:
6 0 obj
<</Count 6
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R ]
>>
endobj
In the result:
6 0 obj
<</Count 3
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 12 0 R 11 0 R 10 0 R ]
>>
endobj
So the order of the last three pages was changed (which is ok) and the /Count was reduced from 6 to 3. This is inconsistent as there still are 6 child objects but according to the PDF specification ISO 32000-1, Count is
The number of leaf nodes (page objects) that are descendants of this node within the page tree.
Furthermore the cross reference stream of the appended revision is broken.
xref
0 1
0000000000 65535 f
24 1
0001465240 00000 n
57 1
0001466075 00000 n
66 1
0001466909 00000 n
73 1
0001467744 00000 n
93 1
0001473484 00000 n
131 1
0001478703 00000 n
The entries are 19 bytes long including their respectively ending single byte newline character According to the spec, though,
Each entry shall be exactly 20 bytes long, including the end-of-line marker.
The format of an in-use entry shall be: nnnnnnnnnn ggggg n eol
where [...] eol shall be a 2-character end-of-line sequence
There may be more errors in the PDF but you may want to start fixing these.
EDIT
Now with the new PDF Pay-in.pdf with a proper cross reference at hand, let's look at it more in-depth.
Adobe Preflight complains about many occurances of:
[...]
An unexpected value is associated with the key
Key: IT
Value: /FreeTextTypewriter
Type: CosName
Formal Representation: Annot.AnnotFreeText
Cos ID: 86
Traversal Path: ->Pages->Kids->[0]->Annots->[13]
[...]
Ok, let's look at that object 86:
86 0 obj
<< /P 8 0 R
/Type /Annot
/CreationDate (D:20130219194939+05'30')
/T (winman)
/NM (0f202782-2274-44b8-9081-af4010be86d4)
/Subj (Typewritten Text)
/M (D:20130219195100+05'30')
/F 4
/Rect [ 53.2308 33.488 552.088 826.019 ]
/DS (font: Helv 12.0pt;font-stretch:Normal; text-align:left; color:#000000 )
/AP <</N 107 0 R >>
/Contents (wwww)
/IT /FreeTextTypewriter
/BS 108 0 R
/Subtype /FreeText
/Rotate 90
/DA (16.25 TL /Cour 12 Tf)
/RC (<?xml version="1.0"?>
<body xmlns="http://www.w3.org/1999/xhtml"
xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:10.0.0"
xfa:spec="2.0.2"
style="font-size:12.0pt;text-align:left;color:#000000;font-weight:normal;
font-style:normal;font-family:Helv;font-stretch:normal">
<p dir="ltr">
<span style="line-height:16.3pt;font-family:Helvetica">wwww</span>
</p>
</body>)
>>
endobj
Preflight stated that it is unhappy about the line /IT /FreeTextTypewriter. Looking at the PDF specification again uncovers for annotations with /Subtype /FreeText, i.e. Free Text Annotations specified in section 12.5.6.6:
IT name
(Optional; PDF 1.6) A name describing the intent of the free text annotation (see also the IT entry in Table 170). The following values shall be valid:
FreeText The annotation is intended to function as a plain free-text annotation. A plain free-text annotation is also known as a text box comment.
FreeTextCallout The annotation is intended to function as a callout. The callout is associated with an area on the page through the callout line specified in CL.
FreeTextTypeWriter The annotation is intended to function as a click-to-type or typewriter object and no callout line is drawn.
Default value: FreeText
Thus, your value FreeTextTypewriter is invalid (remember, PDF names are case sensitive!). Therefore, the annotation is (slightly) broken which may already result in all kinds of problems.
But there are other important entries here, too, to understand your issue: All you do in your appended changes is to replace the appearance stream in object 107 (as per /AP <</N 107 0 R >>) of this annotation by a different one. But this annotation contains an RC value, too, which according to the specification is
A rich text string (see 12.7.3.4, “Rich Text Strings”) that shall be used to generate the appearance of the annotation.
Thus, any PDF viewer may regenerate the appearance from that rich text description, especially as the specification in section 12.5.2 says about the content of the AP dictionary
Individual annotation handlers may ignore this entry and provide their own appearances.
Thus, simply replacing the normal appearance stream does not suffice to permanently change the appearance of that annotation, you have to change the appearance dictionary and at least remove any alternative source for the appearance.
Furthermore the entry /Contents (wwww) is not replaced by your appended changes either. So a PDF viewer trying to decide whether to use the appearance stream or not will feel tempted to somehow create a new appearance as your appearance stream in no way represents that value.
Especially when starting to manipulate the free text (e.g. when clicking into the PDF in your case), the PDF viewer knows it eventually will have to create a new appearance anyways, and unless the current appearance is as it would have created it anyway, the viewer may prefer to begin anew starting with an appearance derived from the rich text or even the contents value.