Decoding a FlateDecoded section of text in a PDF document - pdf

Using peepdf I am analyzing two simple pdf files. Both files contain a single line of text ("ZYXWVUTSRQQRSTUVWXYZ") and were created on Mac OS X.
The first file was created with TextEdit. There are only three streams, and looking at the first one (automatically decoded with peepdf) shows the text clearly.
PPDF> stream 4
q Q q 72 707.272 468 12.72803 re W n /Cs1 cs 0 sc q 0.9790795 0 0 -0.9790795 72 720
cm BT 0.0001 Tc 11 0 0 -11 5 10 Tm /TT1 1 Tf (ZYXWVUTSRQQRSTUVWXYZ) Tj ET
Q Q
The second file was created with MS Word. There are four streams but the decoded text is no where to be found. Looking at the corresponding stream in the Word doc does not reveal the decoded string:
PPDF> stream 4
q Q q 18 40 576 734 re W n /Cs1 cs 0 0 0 sc q 0.24 0 0 0.24 90 708.72 cm BT
-0.0004 Tc 50 0 0 50 0 0 Tm /TT2 1 Tf [ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\))
-1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ ET Q q 0.24 0 0 0.24 239.168 708.72
cm BT 50 0 0 50 0 0 Tm /TT2 1 Tf (+) Tj ET Q Q
It's not apparent to me where the string is in the file or what the information in this stream means. Any insights?

It's not apparent to me where the string is in the file
In general you won't see the clear text in the content stream because the encoding used there needs not be a standard encoding, nothing ASCII'ish.
[ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\)) -1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ
This operation in its array operand contains your ZYXWVUTSRQQRSTUVWXYZ with some kerning corrections for certain pairs of characters.
It looks like an ad hoc encoding using the bytes from 33 (= 0x21 = '!') onwards. '!' is used for the first glyph needed, the Z, '"' for the second one needed Y, '#' for the third one X, etc. Your test string not only starts with these chars but also ends with them, and so does the array above, (!") -1 (#) ... (#) -1 (") -1 (!).
Inspect the definition of the font used (TT2). It may (or may not) include information helping you decoding this encoding.
or what the information in this stream means. Any insights?
To understand the contents of PDF content streams, you should read the relevant sections of the PDF specification ISO 32000-1, especially chapters 8 Graphics and 9 Text.
As your question is focused on the recognition of text content, e.g. read section 9.10.2 Mapping Character Codes to Unicode Values:
A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
Edit: Concerning the comment
One of the objects gave some font info. It is 'JJOWGO+Cambria' and references object 16 as the 'font file' which was also unreadable. I'll review the manual. Can't find anything online about 'JJOWGO'.
You wont find anything specific about JJOWGO because it most likely is a random key sequence prefixed to Cambria to indicate that not all of that font is embedded but only a subset. Cf. section 9.6.4 Font Subsets of ISO 32000-1:
PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow a conforming reader to recognize font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)
For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.
EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font.
<<
/FontBBox [ -1475 -2463 2867 3117 ]
/StemV 0
/FontFile2 16 0 R
/Descent -222
/XHeight 467
/Flags 4
/Ascent 950
/FontName /JJOWGO+Cambria
/Type /FontDescriptor
/ItalicAngle 0
/AvgWidth 615
/MaxWidth 2919
/CapHeight 667
>>
This font descriptor contains no obvious encoding information. Have a look at the actual Font dictionary and look for a ToUnicode entry, cf. the quotation of section 9.10.2 above.

Comments from #mkl made it clear what is happening. The text in the pdf produced by MS Word was using a character map.
I tracked down the font dictionary by searching for objects with a ToUnicode entry:
<< /FirstChar 33
/Widths [ 538 570 571 921 604 648 593 496 621 653 220 ]
/Type /Font
/BaseFont /JJOWGO+Cambria
/LastChar 43
/Subtype /TrueType
/FontDescriptor 13 0 R
/ToUnicode 14 0 R >>
The ToUnicode entry referenced object 14, so I looked at that next:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
1 beginbfchar
<2b><0009 000d 0020 00a0>
endbfchar
10 beginbfrange
<21><21><005a>
<22><22><0059>
<23><23><0058>
<24><24><0057>
<25><25><0056>
<26><26><0055>
<27><27><0054>
<28><28><0053>
<29><29><0052>
<2a><2a><0051>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
Section 9.10.3 of ISO 32000-1 explains how beginbfrange maps character ranges to each other. Ranges of character codes are mapped to Unicode values. The "range" 21-21 contains a single character, which is "!". It is mapped to U+005a ("Z"). The mapping contains a line for every character in my test document, from Z to Q. (! to *)

Related

How do I inspect the cmap table and subtables in a TrueType font?

The PDF Reference says:
A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions, using an internal data structure called a “cmap”
It goes on to explain that the behaviour of a PDF processor depends on which cmap subtables are present in the font file.
I am trying to analyze a .ttf font file extracted using fontforge from a PDF that was generated by LibreOffice. The PDF embeds this font file as a simple font, using single-byte codes. When I look at the .ttf file in fontdrop.info, it tells me the "glyphIndexMap" is as follows:
{"0":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0,"50":0,"51":0,"52":0,"53":0,"54":0,"55":0,"56":0,"57":0,"58":0,"59":0,"60":0,"61":0,"62":0,"63":0,"64":0,"65":0,"66":0,"67":0,"68":0,"69":0,"70":0,"71":0,"72":0,"73":0,"74":0,"75":0,"76":0,"77":0,"78":0,"79":0,"80":0,"81":0,"82":0,"83":0,"84":0,"85":0,"86":0,"87":0,"88":0,"89":0,"90":0,"91":0,"92":0,"93":0,"94":0,"95":0,"96":0,"97":0,"98":0,"99":0,"100":0,"101":0,"102":0,"103":0,"104":0,"105":0,"106":0,"107":0,"108":0,"109":0,"110":0,"111":0,"112":0,"113":0,"114":0,"115":0,"116":0,"117":0,"118":0,"119":0,"120":0,"121":0,"122":0,"123":0,"124":0,"125":0,"126":0,"127":0,"160":0,"161":0,"162":0,"163":0,"165":0,"167":0,"168":0,"169":0,"170":0,"171":0,"172":0,"174":0,"175":0,"176":0,"177":0,"180":0,"181":0,"182":0,"183":0,"184":0,"186":0,"187":0,"191":0,"192":0,"193":0,"194":0,"195":0,"196":0,"197":0,"198":0,"199":0,"200":0,"201":0,"202":0,"203":0,"204":0,"205":0,"206":0,"207":0,"209":0,"210":0,"211":0,"212":0,"213":0,"214":0,"216":0,"217":0,"218":0,"219":0,"220":0,"223":0,"224":0,"225":0,"226":0,"227":0,"228":0,"229":0,"230":0,"231":0,"232":0,"233":0,"234":0,"235":0,"236":0,"237":0,"238":0,"239":0,"241":0,"242":0,"243":0,"244":0,"245":0,"246":0,"247":0,"248":0,"249":0,"250":0,"251":0,"252":0,"255":0,"305":0,"338":0,"339":0,"376":0,"402":0,"675":3,"710":0,"711":0,"728":0,"729":0,"730":0,"731":0,"732":0,"733":0,"916":0,"937":0,"960":0,"8211":0,"8212":0,"8216":0,"8217":0,"8218":0,"8220":0,"8221":0,"8222":0,"8224":0,"8225":0,"8226":0,"8230":0,"8240":0,"8249":0,"8250":0,"8260":0,"8364":0,"8482":0,"8706":0,"8719":0,"8721":0,"8730":0,"8734":0,"8747":0,"8776":0,"8800":0,"8804":0,"8805":0,"9674":0,"57374":0,"64257":0,"64258":0}
(the interesting part is "675":3)
I can understand this insofar as the font contains 4 glyphs, and the glyph at index 3 is the ʣ character (decimal Unicode point 675 / U+02A3).
But in the PDF, this character is used in text strings as <01>, and no other encoding is given - so according to the PDF Reference, the mapping from <01> to the glyph at index 3 must be done according to a mapping within the .ttf file:
If no Encoding entry is specified in the font dictionary, the “cmap” subtable with platform ID 1 and encoding 0 will be used to map directly from character codes to glyph descriptions, without any consideration of character names. This is the normal convention for symbolic fonts.
I have confirmed that no Encoding entry is specified within the PDF. Here are the /Font and /FontDescriptor objects extracted using qpdf:
18 0 obj
<<
/BaseFont /BAAAAA+LiberationSerif
/FirstChar 0
/FontDescriptor 20 0 R
/LastChar 1
/Subtype /TrueType
/ToUnicode 21 0 R
/Type /Font
/Widths [
777
802
]
>>
endobj
20 0 obj
<<
/Ascent 891
/CapHeight 981
/Descent -216
/Flags 4
/FontBBox [
-543
-303
1277
981
]
/FontFile2 23 0 R
/FontName /BAAAAA+LiberationSerif
/ItalicAngle 0
/StemV 80
/Type /FontDescriptor
>>
endobj
So how can I investigate the .ttf file to confirm that "the “cmap” subtable with platform ID 1 and encoding 0" is in place and contains the mappings I think it does?
Edit: the PDF in question
How do I inspect the cmap table and subtables in a TrueType font?
OT Master Light, from Dutch Type Library, is a free tool that's quite handy for inspecting internal font tables.
Using OT Master Light it can be seen that the cmap 1:0 maps character code 0x01 to glyph index 1 (1st image, 2nd entry in the list) which is the 'dz' symbol (2nd image).

PDF 1.3 text conversion with escape sequence and character code

i got an PDF 1.3 where I want to extract the text.
But in the stream there are 2 different types of text.
Some plain text and some character coded text with escape sequences.
Here an example:
/TextClip BMC
BT
/T1_2 1 Tf
0 Tc 0 Tw 7 Tr 16.2626 0 0 16.2626 37.2512 581.738 Tm
(Test Test)Tj
ET
EMC
q
/GS0 gs
67.6799985 0 0 -13.4399997 37.439994 594.2399583 cm
/Im47 Do
Q
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/GS0 gs
180.959996 0 0 -9.5999998 36.959999 578.3999755 cm
/Im48 Do
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/TextClip BMC
BT
0 Tc 0 Tw 7 Tr 9.899 0 0 9.899 37.2512 569.7178 Tm
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
ET
EMC
In this example ther are 2 times the text "Test Test". One time as plan text and the other time with the escape sequence \000E\000V\000d\000e\000\003\000E\000V\000d\000e.
I only knew, if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
The 4 character after the escape sequence is at 15 next to the correct ascii code. (\000E is character "T") But what is the correct conversion?
The text block \000\003 should be a space sign. What is there the conversion hack?
Regards
The encoding of the string arguments of text showing instructions like TJ and Tj depends on the PDF font in question, cf. the specification
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 - Text-Showing Operators - in ISO 32000-1)
The font used for the first text showing operation
(Test Test)Tj
probably is a simple font with an ASCII'ish encoding, probably WinAnsiEncoding. The font itself is selected two lines above in
/T1_2 1 Tf
so you only have to look up the font resource T1_2 the associated resources (the resources of the page if you are showing us an excerpt of a page content stream) to verify.
The font used in the second text showing operation
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
appears to be a composite font with a double-byte encoding, probably Identity-H, and the underlying font program appears to have the glyph codes most often found in TrueType fonts. You should look for a ToUnicode mapping in that PDF font for easy decoding.
The instruction in which this font is selected, is not among the instructions you posted but instead must be somewhere above. This selection has been saved as part of the graphics state (in some early q instructions) and restored again (in some Q instruction between the two text showing instructions you shared).
if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
No, in your example there always are escape sequences with three octal digits. The character thereafter is a separate byte, i.e. you have the bytes '\000', 'E', '\000', 'V', '\000', 'd', '\000', 'e', '\000', '\003', '\000', 'E', '\000', 'V', '\000', 'd', '\000', and 'e'.
As mentioned above, this looks like a double-byte encoding with in particular the mappings
\000E -> 'T'
\000V -> 'e'
\000d -> 's'
\000e -> 't'
\000\003 -> ' ' (space)
This appears to be a glyph encoding often found in TrueType fonts which for Latin letters merely means a constant offset to their Unicode codes.
But there also are many different multi-byte encodings in common use, sometimes they even are ad-hoc encodings only created for the font on the page at hands.
Thus, if you seriously want to do text extraction from PDFs, you really have to study the PDF specification and implement along its requirements instead of hoping for some conversion hack.
Adobe has published a copy of the old PDF specification ISO 32000-1 on their web page at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

How to get Unicode Hex Values from a Type 1 text in a pdf file?

I am trying to write a pdf parser in c++. I have some problems to read some texts that are written in languages that do not use the Latin alphabet.
For example I have a text which is described as
T1_0 257 0 R
/T1_0 1 Tf
40.2614 0 0 47.4187 120.4995 595.2451 Tm
[(\037\036)3(\035)21(\034)-8(\033)5(\032\031)]TJ
257 0 obj
<</BaseFont/HVTZBF+MyriadPro-Regular/Encoding 269 0 R/FirstChar 25/FontDescriptor 270 0 R/LastChar 31/Subtype/Type1/Type/Font/Widths[417 555 472 551 457 236 553]>>
endobj
269 0 obj
<</BaseEncoding/WinAnsiEncoding/Differences[25/uni03C2/eta/lambda/alpha/chi/iota/uni03BC]/Type/Encoding>>
endobj
I am not interested in getting the font details, but I am really interested in getting the symbols of this text in unicode. In the "Differences" table there is a name for each symbol of the text. The first and the last sylmbols are in Unicode hex, but the rest are described by their names from Adobe's "Symbol Set and Encoding" table.
e.g. "uni03C2" is "ς", "eta" is "η", "lambda" is "λ" etc
How can I get the Unicode hexadecimal value for each of the symbols of my text?
p.s.: I have also tried to decode the FontFile3 program, but I can not see it's content, except from some information about the font's license.
p.s.2: Here is a link to the file.
Thanks in advance.
You can find the names in the "Adobe Glyph List".
The uni-prefixes can be translated by removing the prefix which will end in the appropriate UTF-16 hex value. Could you share a link to this type of document?
The full specification of the AGL is available here.

write in unicode text on visible signature - pdfbox

I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.
Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.

Text object gets deleted in Adobe Reader 11

In adobe acrobat x i was inserting text objects and when it is opened in adobe reader 10 it was opening properly.but in adobe reader 11 when i click on that pdf file text objects gets deleted.why this happens? How to solve it?
The source pdf file click here
The pdf file which has problem when double clicking on it in adobe reader 11.
click here
In a nutshell:
You try to change the contents of a free text annotation by changing its normal appearance stream.
This is insufficient: A compliant PDF viewer may ignore this entry and provide their own appearances. So it's mere luck that older Adobe Reader versions chose to not ignore your change.
Thus, you also need to change the information a PDF viewer is expected to create their own appearance from, i.e. foremost the rich text value of RC (in the free text annotation dictionary) that shall be used to generate the appearance of the annotation, and also the Contents value which is the Text that shall be displayed for the annotation.
Furthermore there are defects in your PDFs:
the cross reference table in your first attempt result.pdf was broken;
the intent (IT value) of the free text annotation in your source files is spelled incorrectly.
In detail:
Your result.pdf is broken. Different PDF viewers may display broken PDFs differently.
Some details:
It has been created based on your Src.pdf in append mode but additionally the following change in the original revision has been made to its /Pages object:
In the source:
6 0 obj
<</Count 6
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R ]
>>
endobj
In the result:
6 0 obj
<</Count 3
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 12 0 R 11 0 R 10 0 R ]
>>
endobj
So the order of the last three pages was changed (which is ok) and the /Count was reduced from 6 to 3. This is inconsistent as there still are 6 child objects but according to the PDF specification ISO 32000-1, Count is
The number of leaf nodes (page objects) that are descendants of this node within the page tree.
Furthermore the cross reference stream of the appended revision is broken.
xref
0 1
0000000000 65535 f
24 1
0001465240 00000 n
57 1
0001466075 00000 n
66 1
0001466909 00000 n
73 1
0001467744 00000 n
93 1
0001473484 00000 n
131 1
0001478703 00000 n
The entries are 19 bytes long including their respectively ending single byte newline character According to the spec, though,
Each entry shall be exactly 20 bytes long, including the end-of-line marker.
The format of an in-use entry shall be: nnnnnnnnnn ggggg n eol
where [...] eol shall be a 2-character end-of-line sequence
There may be more errors in the PDF but you may want to start fixing these.
EDIT
Now with the new PDF Pay-in.pdf with a proper cross reference at hand, let's look at it more in-depth.
Adobe Preflight complains about many occurances of:
[...]
An unexpected value is associated with the key
Key: IT
Value: /FreeTextTypewriter
Type: CosName
Formal Representation: Annot.AnnotFreeText
Cos ID: 86
Traversal Path: ->Pages->Kids->[0]->Annots->[13]
[...]
Ok, let's look at that object 86:
86 0 obj
<< /P 8 0 R
/Type /Annot
/CreationDate (D:20130219194939+05'30')
/T (winman)
/NM (0f202782-2274-44b8-9081-af4010be86d4)
/Subj (Typewritten Text)
/M (D:20130219195100+05'30')
/F 4
/Rect [ 53.2308 33.488 552.088 826.019 ]
/DS (font: Helv 12.0pt;font-stretch:Normal; text-align:left; color:#000000 )
/AP <</N 107 0 R >>
/Contents (wwww)
/IT /FreeTextTypewriter
/BS 108 0 R
/Subtype /FreeText
/Rotate 90
/DA (16.25 TL /Cour 12 Tf)
/RC (<?xml version="1.0"?>
<body xmlns="http://www.w3.org/1999/xhtml"
xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:10.0.0"
xfa:spec="2.0.2"
style="font-size:12.0pt;text-align:left;color:#000000;font-weight:normal;
font-style:normal;font-family:Helv;font-stretch:normal">
<p dir="ltr">
<span style="line-height:16.3pt;font-family:Helvetica">wwww</span>
</p>
</body>)
>>
endobj
Preflight stated that it is unhappy about the line /IT /FreeTextTypewriter. Looking at the PDF specification again uncovers for annotations with /Subtype /FreeText, i.e. Free Text Annotations specified in section 12.5.6.6:
IT name
(Optional; PDF 1.6) A name describing the intent of the free text annotation (see also the IT entry in Table 170). The following values shall be valid:
FreeText The annotation is intended to function as a plain free-text annotation. A plain free-text annotation is also known as a text box comment.
FreeTextCallout The annotation is intended to function as a callout. The callout is associated with an area on the page through the callout line specified in CL.
FreeTextTypeWriter The annotation is intended to function as a click-to-type or typewriter object and no callout line is drawn.
Default value: FreeText
Thus, your value FreeTextTypewriter is invalid (remember, PDF names are case sensitive!). Therefore, the annotation is (slightly) broken which may already result in all kinds of problems.
But there are other important entries here, too, to understand your issue: All you do in your appended changes is to replace the appearance stream in object 107 (as per /AP <</N 107 0 R >>) of this annotation by a different one. But this annotation contains an RC value, too, which according to the specification is
A rich text string (see 12.7.3.4, “Rich Text Strings”) that shall be used to generate the appearance of the annotation.
Thus, any PDF viewer may regenerate the appearance from that rich text description, especially as the specification in section 12.5.2 says about the content of the AP dictionary
Individual annotation handlers may ignore this entry and provide their own appearances.
Thus, simply replacing the normal appearance stream does not suffice to permanently change the appearance of that annotation, you have to change the appearance dictionary and at least remove any alternative source for the appearance.
Furthermore the entry /Contents (wwww) is not replaced by your appended changes either. So a PDF viewer trying to decide whether to use the appearance stream or not will feel tempted to somehow create a new appearance as your appearance stream in no way represents that value.
Especially when starting to manipulate the free text (e.g. when clicking into the PDF in your case), the PDF viewer knows it eventually will have to create a new appearance anyways, and unless the current appearance is as it would have created it anyway, the viewer may prefer to begin anew starting with an appearance derived from the rich text or even the contents value.