problems with calculating byte offset - pdf
I'm tring to understand PDF struckture right now but I have a little Problem with calculating the byte offset of an String. The offsets of the objects are couted fom the begining of the file to the index of the object (6 0 obj).
I have a working hello world PDF file but when I count the offsets I get a diffrent offset than in the xref table.
If anybody understands how this is counted please let me know!
Example:
0 6 obj xref:9 me:17
0 1 obj xref:60 me:72
0 4 obj xref:145 me 187
(I count with "\r\n" (2) as line break)
Adobe Standart:http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
%PDF-1.4
%%EOF
6 0 obj
<<
/Type /Catalog
/Pages 5 0 R
>>
endobj
1 0 obj
<<
/Type /Page
/Parent 5 0 R
/MediaBox [ 0 0 612 792 ]
/Resources 3 0 R
/Contents 2 0 R
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont/Helvetica
>>
endobj
2 0 obj
<<
/Length 53
>>
stream
BT
/F1 24 Tf
1 0 0 1 260 600 Tm
(Hello World)Tj
ET
endstream
endobj
5 0 obj
<<
/Type /Pages
/Kids [ 1 0 R ]
/Count 1
>>
endobj
3 0 obj
<<
/ProcSet[/PDF/Text]
/Font <</F1 4 0 R >>
>>
endobj
xref
0 7
0000000000 65535 f
0000000060 00000 n
0000000228 00000 n
0000000424 00000 n
0000000145 00000 n
0000000333 00000 n
0000000009 00000 n
trailer
<<
/Size 7
/Root 6 0 R
>>
startxref
488
%%EOF
This is a very interesting file and reading the PDF specification initially just confused me more :-). In such cases (I'll madden some people with this) I would simply save the example PDF file and do as #KenS suggests in his previous answer; open it in Acrobat and if Acrobat reports it as damaged or asks you to save when you close the file - it doesn't like it and you can assume you've gotten it wrong.
The reason this file is interesting is the second line, the:
%%EOF
I don't agree with KenS that having this line automatically invalidates the file - I can find no text in ISO 32000 that states this. The text says that the %%EOF line at the end of a file has syntactical meaning (and explains why it is there) and it states that any line beginning with a percentage character (%) is a comment and what that means. But nowhere does it state that %%EOF is not allowed as comment somewhere else in the file (though I consider it a dumb thing to do but that is something different).
If that %%EOF line isn't there, the XREF table is correct. If it is there, its wrong. Some more explanation of what I read in the documentation:
1) As far as I understand the offset is starting from the first byte of the file (it's a byte offset, not a character offset) which is "0" and then counts up. The idea behind this is that you can open a file, set the file read position to a given offset and start reading. So if you open up the file in a binary editor that shows real bytes, the offset should match what you're seeing there. If your %%EOF line isn't there, that means the first object (6 0 obj) effectively begins at offset 9 (if you line ending character here is a single byte line ending). At this point it matches what is given as an example in the PDF specification itself, so I'm confident that offset of 9 is correct provided that second line (%%EOF) would not be in the PDF file.
2) That second line starts with a percentage sign which makes it a comment. The PDF specification states that a comment (everything from the % sign up to but not including the line end character) shall be interpreted as a single whitespace character. That's interesting and could lead to all kinds of speculation on what that means for the offset of the object following it but frankly all of that speculation is out of order and irrelevant because of what I stated before.
The idea behind this is that you can open a file, set the file read position to a given offset and start reading.
That's exactly what the cross-reference table is for and it should be taken literally. In other words, assuming single-byte line ending characters, object 6 in your example file starts at offset 15 and that's the number that should be in the XREF table for that object.
Again, take #KenS' comment into account, you cannot just assume the line ending is two bytes, you have to know what they are (and they could be mixed so you can't even assume all lines have the same). If this file would have two byte line endings for all lines, your count of 17 would be the correct one.
You cannot assume a line end is a \r\n pair, it could be \r, \n or \r\n, you need to use a binary editor to be certain. We also cannot tell you which value is correct without access to the original file, a cut/paste as above simply isn't good enough, sorry. Though 9 cannot be correct unless that %%EOF is spurious......
Your quoted PDF file isn't correct anyway, you should not have %%EOF as the second line, that 'ought' to be a binary sequence of bytes with the high bit set to make sure PDF files are transferred as binary files.
How do you know the xref in your PDF file is correct ? If you open it with Acrobat does it offer to save changes on exit ? That's a sure sign that Acrobat rebuilt the xref for you, because it was incorrect.....
[edited for clarity and because too long for comments]
I should be clearer in my explanation. I contracted 2 statements into 1 in my initial answer, and intended to clarify it in the comments. The file is incorrect, the xref offsets appear to me to be wrong, and on initial inspection appear to be wrong by the %%EOF (but see later), which looks like it is bogus in some way (inserted by cut/paste, or by an editor, or something).
Technically you can have any text which begins % anywhere in a PDF file (but outside string and stream), provided you account for it properly and don't break PDF syntax. But I still would not put two %%EOF comments in a PDF file, its too likely to confuse simple minded PDF consumers.
I don't think that having a comment before the 'x y obj' statement is necessarily wrong (I wouldn't do it, but that's not the same thing). It also doesn't totally invalidate David's point about:
The idea behind this is that you can open a file, set the file read position to a given offset and start reading.
Provided that the PDF consumer is prepared to read a comment and doesn't expect whitespace or a 'x y obj' statement (and I have seen PDF files which have preceding whitespace here). That is debatable, and while I would read the spec to say that the xref offset ought to point precisely to the first byte of the 'x y obj' line, it doesn't actually say that in so many words in the spec. And a PDF consumer does need to be able to deal with comments during the course of the object definition itself. For example I think that this:
1 0 obj
%% Here's a comment
<<
/Type /Page
/Parent 5 0 R
/MediaBox [ 0 0 612 792 ]
/Resources 3 0 R
/Contents 2 0 R
>>
Would be legal. The line begins with a '%', its not in a string or stream context, and doesn't break PDF syntax, the consumer should simply skip right over it.
This isn't very much different from:
%% Here's a comment
1 0 obj
<<
/Type /Page
/Parent 5 0 R
/MediaBox [ 0 0 612 792 ]
/Resources 3 0 R
/Contents 2 0 R
>>
Again, I wouldn't do that myself (or if I did I would point the xref to the start of the 1 0 obj, but I think its arguable that way.
But in the original example my binary editor says that (with 2 byte line endings) object 4 starts at offset 187, if I use 1 byte endings that comes down to 170. In order to get object one's xref to be correct I assume that the line endings are 1 byte, and I remove the "%%EOF\n". But subtracting 6 bytes from 170 still comes out at 164, so nowhere near the 145 the xref contains. I cannot see any way to get object 4 to be at location 145 without removing real PDF operators/structure.
Related
How do I inspect the cmap table and subtables in a TrueType font?
The PDF Reference says: A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions, using an internal data structure called a “cmap” It goes on to explain that the behaviour of a PDF processor depends on which cmap subtables are present in the font file. I am trying to analyze a .ttf font file extracted using fontforge from a PDF that was generated by LibreOffice. The PDF embeds this font file as a simple font, using single-byte codes. When I look at the .ttf file in fontdrop.info, it tells me the "glyphIndexMap" is as follows: {"0":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0,"50":0,"51":0,"52":0,"53":0,"54":0,"55":0,"56":0,"57":0,"58":0,"59":0,"60":0,"61":0,"62":0,"63":0,"64":0,"65":0,"66":0,"67":0,"68":0,"69":0,"70":0,"71":0,"72":0,"73":0,"74":0,"75":0,"76":0,"77":0,"78":0,"79":0,"80":0,"81":0,"82":0,"83":0,"84":0,"85":0,"86":0,"87":0,"88":0,"89":0,"90":0,"91":0,"92":0,"93":0,"94":0,"95":0,"96":0,"97":0,"98":0,"99":0,"100":0,"101":0,"102":0,"103":0,"104":0,"105":0,"106":0,"107":0,"108":0,"109":0,"110":0,"111":0,"112":0,"113":0,"114":0,"115":0,"116":0,"117":0,"118":0,"119":0,"120":0,"121":0,"122":0,"123":0,"124":0,"125":0,"126":0,"127":0,"160":0,"161":0,"162":0,"163":0,"165":0,"167":0,"168":0,"169":0,"170":0,"171":0,"172":0,"174":0,"175":0,"176":0,"177":0,"180":0,"181":0,"182":0,"183":0,"184":0,"186":0,"187":0,"191":0,"192":0,"193":0,"194":0,"195":0,"196":0,"197":0,"198":0,"199":0,"200":0,"201":0,"202":0,"203":0,"204":0,"205":0,"206":0,"207":0,"209":0,"210":0,"211":0,"212":0,"213":0,"214":0,"216":0,"217":0,"218":0,"219":0,"220":0,"223":0,"224":0,"225":0,"226":0,"227":0,"228":0,"229":0,"230":0,"231":0,"232":0,"233":0,"234":0,"235":0,"236":0,"237":0,"238":0,"239":0,"241":0,"242":0,"243":0,"244":0,"245":0,"246":0,"247":0,"248":0,"249":0,"250":0,"251":0,"252":0,"255":0,"305":0,"338":0,"339":0,"376":0,"402":0,"675":3,"710":0,"711":0,"728":0,"729":0,"730":0,"731":0,"732":0,"733":0,"916":0,"937":0,"960":0,"8211":0,"8212":0,"8216":0,"8217":0,"8218":0,"8220":0,"8221":0,"8222":0,"8224":0,"8225":0,"8226":0,"8230":0,"8240":0,"8249":0,"8250":0,"8260":0,"8364":0,"8482":0,"8706":0,"8719":0,"8721":0,"8730":0,"8734":0,"8747":0,"8776":0,"8800":0,"8804":0,"8805":0,"9674":0,"57374":0,"64257":0,"64258":0} (the interesting part is "675":3) I can understand this insofar as the font contains 4 glyphs, and the glyph at index 3 is the ʣ character (decimal Unicode point 675 / U+02A3). But in the PDF, this character is used in text strings as <01>, and no other encoding is given - so according to the PDF Reference, the mapping from <01> to the glyph at index 3 must be done according to a mapping within the .ttf file: If no Encoding entry is specified in the font dictionary, the “cmap” subtable with platform ID 1 and encoding 0 will be used to map directly from character codes to glyph descriptions, without any consideration of character names. This is the normal convention for symbolic fonts. I have confirmed that no Encoding entry is specified within the PDF. Here are the /Font and /FontDescriptor objects extracted using qpdf: 18 0 obj << /BaseFont /BAAAAA+LiberationSerif /FirstChar 0 /FontDescriptor 20 0 R /LastChar 1 /Subtype /TrueType /ToUnicode 21 0 R /Type /Font /Widths [ 777 802 ] >> endobj 20 0 obj << /Ascent 891 /CapHeight 981 /Descent -216 /Flags 4 /FontBBox [ -543 -303 1277 981 ] /FontFile2 23 0 R /FontName /BAAAAA+LiberationSerif /ItalicAngle 0 /StemV 80 /Type /FontDescriptor >> endobj So how can I investigate the .ttf file to confirm that "the “cmap” subtable with platform ID 1 and encoding 0" is in place and contains the mappings I think it does? Edit: the PDF in question
How do I inspect the cmap table and subtables in a TrueType font? OT Master Light, from Dutch Type Library, is a free tool that's quite handy for inspecting internal font tables.
Using OT Master Light it can be seen that the cmap 1:0 maps character code 0x01 to glyph index 1 (1st image, 2nd entry in the list) which is the 'dz' symbol (2nd image).
Decoding a FlateDecoded section of text in a PDF document
Using peepdf I am analyzing two simple pdf files. Both files contain a single line of text ("ZYXWVUTSRQQRSTUVWXYZ") and were created on Mac OS X. The first file was created with TextEdit. There are only three streams, and looking at the first one (automatically decoded with peepdf) shows the text clearly. PPDF> stream 4 q Q q 72 707.272 468 12.72803 re W n /Cs1 cs 0 sc q 0.9790795 0 0 -0.9790795 72 720 cm BT 0.0001 Tc 11 0 0 -11 5 10 Tm /TT1 1 Tf (ZYXWVUTSRQQRSTUVWXYZ) Tj ET Q Q The second file was created with MS Word. There are four streams but the decoded text is no where to be found. Looking at the corresponding stream in the Word doc does not reveal the decoded string: PPDF> stream 4 q Q q 18 40 576 734 re W n /Cs1 cs 0 0 0 sc q 0.24 0 0 0.24 90 708.72 cm BT -0.0004 Tc 50 0 0 50 0 0 Tm /TT2 1 Tf [ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\)) -1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ ET Q q 0.24 0 0 0.24 239.168 708.72 cm BT 50 0 0 50 0 0 Tm /TT2 1 Tf (+) Tj ET Q Q It's not apparent to me where the string is in the file or what the information in this stream means. Any insights?
It's not apparent to me where the string is in the file In general you won't see the clear text in the content stream because the encoding used there needs not be a standard encoding, nothing ASCII'ish. [ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\)) -1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ This operation in its array operand contains your ZYXWVUTSRQQRSTUVWXYZ with some kerning corrections for certain pairs of characters. It looks like an ad hoc encoding using the bytes from 33 (= 0x21 = '!') onwards. '!' is used for the first glyph needed, the Z, '"' for the second one needed Y, '#' for the third one X, etc. Your test string not only starts with these chars but also ends with them, and so does the array above, (!") -1 (#) ... (#) -1 (") -1 (!). Inspect the definition of the font used (TT2). It may (or may not) include information helping you decoding this encoding. or what the information in this stream means. Any insights? To understand the contents of PDF content streams, you should read the relevant sections of the PDF specification ISO 32000-1, especially chapters 8 Graphics and 9 Text. As your question is focused on the recognition of text content, e.g. read section 9.10.2 Mapping Character Codes to Unicode Values: A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"): If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode. If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D): a) Map the character code to a character name according to Table D.1 and the font’s Differences array. b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value. If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection: a) Map the character code to a character identifier (CID) according to the font’s CMap. b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary. c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2). d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography). e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value. NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.) If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing. Edit: Concerning the comment One of the objects gave some font info. It is 'JJOWGO+Cambria' and references object 16 as the 'font file' which was also unreadable. I'll review the manual. Can't find anything online about 'JJOWGO'. You wont find anything specific about JJOWGO because it most likely is a random key sequence prefixed to Cambria to indicate that not all of that font is embedded but only a subset. Cf. section 9.6.4 Font Subsets of ISO 32000-1: PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow a conforming reader to recognize font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".) For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags. EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font. << /FontBBox [ -1475 -2463 2867 3117 ] /StemV 0 /FontFile2 16 0 R /Descent -222 /XHeight 467 /Flags 4 /Ascent 950 /FontName /JJOWGO+Cambria /Type /FontDescriptor /ItalicAngle 0 /AvgWidth 615 /MaxWidth 2919 /CapHeight 667 >> This font descriptor contains no obvious encoding information. Have a look at the actual Font dictionary and look for a ToUnicode entry, cf. the quotation of section 9.10.2 above.
Comments from #mkl made it clear what is happening. The text in the pdf produced by MS Word was using a character map. I tracked down the font dictionary by searching for objects with a ToUnicode entry: << /FirstChar 33 /Widths [ 538 570 571 921 604 648 593 496 621 653 220 ] /Type /Font /BaseFont /JJOWGO+Cambria /LastChar 43 /Subtype /TrueType /FontDescriptor 13 0 R /ToUnicode 14 0 R >> The ToUnicode entry referenced object 14, so I looked at that next: /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <00><FF> endcodespacerange 1 beginbfchar <2b><0009 000d 0020 00a0> endbfchar 10 beginbfrange <21><21><005a> <22><22><0059> <23><23><0058> <24><24><0057> <25><25><0056> <26><26><0055> <27><27><0054> <28><28><0053> <29><29><0052> <2a><2a><0051> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end Section 9.10.3 of ISO 32000-1 explains how beginbfrange maps character ranges to each other. Ranges of character codes are mapped to Unicode values. The "range" 21-21 contains a single character, which is "!". It is mapped to U+005a ("Z"). The mapping contains a line for every character in my test document, from Z to Q. (! to *)
Text object gets deleted in Adobe Reader 11
In adobe acrobat x i was inserting text objects and when it is opened in adobe reader 10 it was opening properly.but in adobe reader 11 when i click on that pdf file text objects gets deleted.why this happens? How to solve it? The source pdf file click here The pdf file which has problem when double clicking on it in adobe reader 11. click here
In a nutshell: You try to change the contents of a free text annotation by changing its normal appearance stream. This is insufficient: A compliant PDF viewer may ignore this entry and provide their own appearances. So it's mere luck that older Adobe Reader versions chose to not ignore your change. Thus, you also need to change the information a PDF viewer is expected to create their own appearance from, i.e. foremost the rich text value of RC (in the free text annotation dictionary) that shall be used to generate the appearance of the annotation, and also the Contents value which is the Text that shall be displayed for the annotation. Furthermore there are defects in your PDFs: the cross reference table in your first attempt result.pdf was broken; the intent (IT value) of the free text annotation in your source files is spelled incorrectly. In detail: Your result.pdf is broken. Different PDF viewers may display broken PDFs differently. Some details: It has been created based on your Src.pdf in append mode but additionally the following change in the original revision has been made to its /Pages object: In the source: 6 0 obj <</Count 6 /Type /Pages /Kids [ 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R ] >> endobj In the result: 6 0 obj <</Count 3 /Type /Pages /Kids [ 7 0 R 8 0 R 9 0 R 12 0 R 11 0 R 10 0 R ] >> endobj So the order of the last three pages was changed (which is ok) and the /Count was reduced from 6 to 3. This is inconsistent as there still are 6 child objects but according to the PDF specification ISO 32000-1, Count is The number of leaf nodes (page objects) that are descendants of this node within the page tree. Furthermore the cross reference stream of the appended revision is broken. xref 0 1 0000000000 65535 f 24 1 0001465240 00000 n 57 1 0001466075 00000 n 66 1 0001466909 00000 n 73 1 0001467744 00000 n 93 1 0001473484 00000 n 131 1 0001478703 00000 n The entries are 19 bytes long including their respectively ending single byte newline character According to the spec, though, Each entry shall be exactly 20 bytes long, including the end-of-line marker. The format of an in-use entry shall be: nnnnnnnnnn ggggg n eol where [...] eol shall be a 2-character end-of-line sequence There may be more errors in the PDF but you may want to start fixing these. EDIT Now with the new PDF Pay-in.pdf with a proper cross reference at hand, let's look at it more in-depth. Adobe Preflight complains about many occurances of: [...] An unexpected value is associated with the key Key: IT Value: /FreeTextTypewriter Type: CosName Formal Representation: Annot.AnnotFreeText Cos ID: 86 Traversal Path: ->Pages->Kids->[0]->Annots->[13] [...] Ok, let's look at that object 86: 86 0 obj << /P 8 0 R /Type /Annot /CreationDate (D:20130219194939+05'30') /T (winman) /NM (0f202782-2274-44b8-9081-af4010be86d4) /Subj (Typewritten Text) /M (D:20130219195100+05'30') /F 4 /Rect [ 53.2308 33.488 552.088 826.019 ] /DS (font: Helv 12.0pt;font-stretch:Normal; text-align:left; color:#000000 ) /AP <</N 107 0 R >> /Contents (wwww) /IT /FreeTextTypewriter /BS 108 0 R /Subtype /FreeText /Rotate 90 /DA (16.25 TL /Cour 12 Tf) /RC (<?xml version="1.0"?> <body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:10.0.0" xfa:spec="2.0.2" style="font-size:12.0pt;text-align:left;color:#000000;font-weight:normal; font-style:normal;font-family:Helv;font-stretch:normal"> <p dir="ltr"> <span style="line-height:16.3pt;font-family:Helvetica">wwww</span> </p> </body>) >> endobj Preflight stated that it is unhappy about the line /IT /FreeTextTypewriter. Looking at the PDF specification again uncovers for annotations with /Subtype /FreeText, i.e. Free Text Annotations specified in section 12.5.6.6: IT name (Optional; PDF 1.6) A name describing the intent of the free text annotation (see also the IT entry in Table 170). The following values shall be valid: FreeText The annotation is intended to function as a plain free-text annotation. A plain free-text annotation is also known as a text box comment. FreeTextCallout The annotation is intended to function as a callout. The callout is associated with an area on the page through the callout line specified in CL. FreeTextTypeWriter The annotation is intended to function as a click-to-type or typewriter object and no callout line is drawn. Default value: FreeText Thus, your value FreeTextTypewriter is invalid (remember, PDF names are case sensitive!). Therefore, the annotation is (slightly) broken which may already result in all kinds of problems. But there are other important entries here, too, to understand your issue: All you do in your appended changes is to replace the appearance stream in object 107 (as per /AP <</N 107 0 R >>) of this annotation by a different one. But this annotation contains an RC value, too, which according to the specification is A rich text string (see 12.7.3.4, “Rich Text Strings”) that shall be used to generate the appearance of the annotation. Thus, any PDF viewer may regenerate the appearance from that rich text description, especially as the specification in section 12.5.2 says about the content of the AP dictionary Individual annotation handlers may ignore this entry and provide their own appearances. Thus, simply replacing the normal appearance stream does not suffice to permanently change the appearance of that annotation, you have to change the appearance dictionary and at least remove any alternative source for the appearance. Furthermore the entry /Contents (wwww) is not replaced by your appended changes either. So a PDF viewer trying to decide whether to use the appearance stream or not will feel tempted to somehow create a new appearance as your appearance stream in no way represents that value. Especially when starting to manipulate the free text (e.g. when clicking into the PDF in your case), the PDF viewer knows it eventually will have to create a new appearance anyways, and unless the current appearance is as it would have created it anyway, the viewer may prefer to begin anew starting with an appearance derived from the rich text or even the contents value.
Create PDF file without dictionary data type (angle brackets)
I want to create a PDF file which does not contain angle brackets in it's source. This apparently implies not using the dictionary data type, as this involves << and >>. Is it possible to completely avoid angle brackets and still create a PDF file with formatted content? Can it be done hiding in a stream, using a character encoding technique or with an alternative dictionary notation? The solution is needed for an obfuscation technique; the bracket-problem cannot be circumvented.
I think this is not possible. Every element in a PDF file is contained in some dictionary: the document catalog (the root dictionary), the page objects, the page content streams, all of them are dictionaries that require the char sequence << >>. Sample catalog dictionary: 1 0 obj << /Pages 2 0 R /Type /Catalog >> endobj If you want to use a "instructions sequence only" presentation format, you may try using PostScript instead. Edit after comments: Using a stream object with some filter encoding will not solve your problem, since you still need to specify the filter type in the stream dictionary. Example: 5 0 obj <</Length 6 0 R /Filter /FlateDecode>> stream ***illegible characters*** endobj
How do I add a PostScript XObject to a PDF?
I am demoing an idea I have been playing around with, and while the Adobe specification says that including PS XObjects is not a good idea, some PDF readers should still support this functionality. Anyways, that is beside the fact. I have been using the Adobe PDF specification and have the following PDF object. This merely uses PostScript to generate a pseudo random value and then print it to the page. Ideally, each time this page is rendered a new value should display: 5 0 obj << /Type/XObject /Subtype/PS /Length 103 >> stream /Times findfont 10 scalefont setfont /str 32 string def 10 20 moveto rand str cvs show endstream endobj Each time any PDF viewer I have tested this against reads this object I get errors such as:"Error (741): Missing 'endstream'" And similarly for every token in that stream. I am sure my offsets are correct. And while I know my PDF viewer does support some PS for forms and such, is there anything obviously incorrect. If anyone has a sample PDF I can go from, that would be nice. The form examples that I tested my reader against have not been too helpful. If I run just the PS code from GhostView it works fine. Thanks for any insight.
I've scoured my back collection of PDF files and come up with 2 which contain PS XObjects (this really is deprecated). I can't, unfortunately, share tehm as they are customer data files :-( However, here is an extract from one of them: 74 0 obj << /Type /XObject /Subtype /PS /Filter /FlateDecode /Length 77 0 R /Name /Ps1 >> stream ....endstream Note 1, there is no EOL between the end of data and the 'endstream' token. 77 0 obj 4480 endobj The offset of the 0x0A following the 'stream' token is 0xdab15, the offset of the 'e' in endstream is 0xdbc96. That is 4481 bytes. SO it looks to me like the /Length should contain all the bytes after the EOL for the 'stream' token' right up to the last byte before the 'e' in the endstream token. I think it would be OK to insert a 0x0A after the stream data and before the endstream. That would come down to a whitespace after the stream data before the token, and PDF is supposed to be tolerant of whitespace. This is consistent with the description of the /Length entry for stream dictionaries in Table 3.4 (p62 of the 1.7 PDF reference): The number of bytes from the beginning of the line fol-lowing the keyword stream to the last byte just before the keyword endstream. (There may be an additional EOL marker, preceding endstream, that is not included in the count and is not logically part of the stream data.) See “Stream Extent,” above, for further discussion. I think (if I've counted correctly) that the /Length in your example should be 87, assuming one byte line terminators in the PostScript fragment.