PDF Cross Reference Streams

PDF Cross Reference Streams - pdf

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams.
My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it.
This works really well when I use the normal cross reference & trailer, as you can see in this file.
When I try to generate a cross reference stream object instead (which results in this file, Adobe Reader can't view it.
Has anyone experience with PDF's and can help me search what the Problem is?
Note that the cross reference is the ONLY difference between file 2 and file 3. The first 34127 bytes are the same.
If someone needs the content of the decoded reference stream, download this file and open it in a HEX editor. I've checked this reference table again and again but I could not find anything wrong. But the dictionary seems to be OK, too.
Thanks so much for your help!!!
Update
I've now completely solved the problem. You can find the new PDF here.

Two problems I see (without looking at the stream data itself.
"Size integer (Required) The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary."
your size should be... 14.
"Index array (Optional) An array containing a pair of integers for each subsection in this section. The first integer shall be the first object number in the subsection; the second integer shall be the number of entries in the subsection
The array shall be sorted in ascending order by object number. Subsections cannot overlap; an object number may have at most one entry in a section.
Default value: [0 Size]."
Your index should probably skip around a bit. You have no objects 2-4 or 7. The index array needs to reflect that.
Your data Ain't Right either (and I just learned out to read an xref stream. Yay me.)
00 00 00
01 00 0a
01 00 47
01 01 01
01 01 70
01 02 fd
01 76 f1
01 84 6b
01 84 a1
01 85 4f
According to this data, which because of your "no index" is interpreted as object numbers 0 through 9, have the following offset:
0 is unused. Fine.
1 is at 0x0a. Yep, sure is
2 is at 0x47. Nope. That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.
3 is at 0x101. Nope. 0x101 is still within "1 0"'s stream.
4 is at 0x170. Ditto
5 is at 0x2fd. Ditto
6 is at 0x76f1. Nope, and this time buried inside that image's stream.
I think you get the idea. So even if you had a correct \Index, your offsets are all wrong (and completely different from what's in resultNormal.pdf, even allowing for dec-hex confusion).
What you want can be found in resultNormal's xref:
xref
0 2
0000000000 65535 f
0000000010 00000 n
5 2
0000003460 00000 n
0000003514 00000 n
8 5
0000003688 00000 n
0000003749 00000 n
0000003935 00000 n
0000004046 00000 n
0000004443 00000 n
So your index should be (if I'm reading this right): \Index[0 2 5 2 8 5]. And the data:
0 0 0
1 0 a
1 3460 (that's decimal)
1 3514 (ditto)
1 3688
etc
Interestingly, the PDF spec says that the size must be BOTH the number of entries in this and all previous XRefs AND the number one higher than the highest object number in use.
I don't think the later part is ever enforced, but I wouldn't be surprised to find that xref streams are more retentive than the normal cross reference tables. Might be the same code handling both, might not.
#mtraut:
Here's what I see:
13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
stream
...
endstream
endobj

The "resultstream.pdf" does not have a valid cross ref stream.
if i open it in my viewer, he tries to read object " 13 0 " as a cross ref stream, but its a plain dictionary (stream tags and data is missing).
A little out of topic: What language are you developing in? At least in Java a know of three valuable choices (PDFBox, iText and jPod, where i personally as one of the developers opt for jPod, very clean implementation :-). If this does not fit your platform, maybe you can at least have a look at algorithms and data structures.
EDIT
Well - if "resultstream.pdf" is the document in question then this is what my editor (SCITE) sees
...
13 0 obj
<</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>>
endobj
startxref
34127
%%EOF
There is no stream.

Related

Fill and sign for PDF file not working using Acrobat Reader DC

I'm asking this here because given the searches I've done, it appears Adobe's support is next to non-existent. I have, according to this online validation tool:
https://www.pdf-online.com/osa/validate.aspx
A perfectly valid PDF, which is generated from code. However, when using Acrobat Reader DC I am unable to use Fill And Sign - when attempting to sign, it throws this error:
The operation failed because Adobe Acrobat encountered an unknown error
This is the offending PDF:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningNoWork.pdf
This is one which is very similar, where Fill and Sign works:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningWorks.pdf
Foxit Reader has no issue with either of them - Fill and Sign works without fail.
I would post the source of the files, however because they have binary data, I figure links to them is better.
The question is: why does the first one fail to work, but not the second?

In your non-working file all the fonts are defined with
/FirstChar 30
/LastChar 255
i.e. having 226 glyphs. Their respective Widths arrays only have 224 entries, though, so they are incomplete.
After adding two entries to each Widths array, Adobe Reader here does not run into that unknown error anymore during Fill And Sign.
As the OP inquired how exactly I changed those widths arrays:
I wanted the change to have as few side effects as possible, so I was glad to see that there was some empty space in the font dictionaries in question, so a trivial hex editing sufficed, no need to shift indirect objects and update cross references:
In each of those font definitions in the objects 5, 7, 9, and 11 the Widths array is the last dictionary entry value and ends with some white space, after the last width we have these bytes:
20 0D 0A 5D 0D 0A 3E 3E --- space CR NL ']' CR NL '>' '>'
I added two 0 values using the white space:
20 30 20 30 20 5D 3E 3E --- space '0' space '0' space ']' '>' '>'

Acrobat Reader DC - the free version, does not allow you to do the fill and sign anymore if your document has metadata attached to it.
You need to purchase the Pro DC version, which is like $14.99, in order to continue using the fill and sign on here.
I just got done with a 4 months support exchange of emails with Adobe, and that was their final answer.

How to calculate dPermissiions parameter in the Ghostscript command line?

I am looking for an online calculator, a tool or at least an understandable article, which lets me define the value of dPermissiions parameter of Ghostscript command line.
Please, advice!

Its documented in the VectorDevices.htm, where it says its a bit field and directs you to the PDF Reference Manual. The actual values are defined by Adobe.
The various access permissions are described under the Standard Security Handler (on p121 of the 1.7 PDF Reference) and the individual bits are described in Table 3.20 (p124 and 124 in the 1.7 PDF Reference Manual).
Bits 1 and 2 (the lowest 2 bits) are always defined as 0, as (currently) are bits 13-32. Bits 7 & 8, annoyingly are reserved and must be 1.
So lets say you want to grant permission to print the document, to do that you need to set bit 3. So bits 1-2 are 0 and bits 4-32 are also 0, bits 7 and 8 must be 1. In binary that corresponds to:
00000000 00000000 00000000 11000100
Which in hex is 00 00 00 C4 which in decimal is 196. So you would set -dPermissions=196
To take a more complex example, we might also want to set bit 12 to allow a high quality print (for revision 3 or better of the security handler). Now we want to set bits 3 and 12, in binary:
00000000 00000000 00001000 11000100
in hex 00 00 08 C4 which is decimal 2244 so you would set -dPermissions=2244
The Windows calculator, when set to programmer mode, has a binary entry configuration. If you enter the bitfield in binary, and then switch to decimal it'll convert it for you. Alternatively there's an online conversion tool here.
Just write out the bits you want set as binary, set bits 7 & 8, then convert to decimal, simples!
--EDIT--
So as Vsevolod Azovsky pointed out, the bits 12-32 should be 1. Using the same tool I pointed at above you can get the decimal signed 2's complement of the binary representation, which you can use as the value for Permissions.
However, if you do that, then Ghostscript's pdfwrite device will produce a warning. The reason is that some of the bits I've set above (anything above bit 8) are only compatible with the revision 3 (or better) security handler, and the default for pdfwrite is to use the revision 2 security encryption.
So if you want to use the bits marked in the Adobe documentation as 'revision 3' then you (obviously) need to set the revision to 3 using -dEncryptionR=3. This requires that the output PDF file be a 1.4 or greater file, you can't use revision 3 with a PDF 1.3 file.
Note that for the revision 2 security handler all the bits 1-2 and 7-32 must be 1.
Hopefully that also answers the questions in the last comment.

How to make sense of DICT data in CFF font format

Problem
I'm trying to parse a OTF/CFF font and is struggling with top DICT part, more specifically the top DICT data part.
CFF File
The beginning of CFF table looks like this in hex editor:
The top DICT starts from the second line from offset 0xC2 with 00 01 "top DICT INDEX count", 01 "top DICT INDEX offsetsize", 01 77 "top DICT INDEX offsets".
The large yellow section is the data part for the DICT, but I simply cannot make sense of it. I referenced: https://typekit.files.wordpress.com/2013/05/5176.cff.pdf
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/T1_SPEC.pdf
Things I tried
Since top DICT starts with version, Notice, Copyright which are SID, so I tried to look up the offsetted strings but they were way off the strings.
I tried to encode them using Table 3 in page 10 of the CFF reference pdf, essentially taking two bytes, b0, b1, and calculating the value, but the values seemed unrelated.
Further Information
It seems I'm having difficulty understanding Table 3 and Table 4. So the DICT data is supposed to be 1 or 2 byte operators and variable sized operands, and these are concatenated throughout the data? Some examples would be helpful.

I misunderstood the encoding procedure. You need to start from the beginning, and based on the first byte, need to find which encoding it uses, integer encodings, real encoding, or instructions etc.
Btw, this font has CIDFont Operator Extensions eg F8 1B F8 1C 8D 0C 1E meaning it is a CID font. So it doesn't have encoding offset, don't waste time like me trying to find one!

Decoding a FlateDecoded section of text in a PDF document

Using peepdf I am analyzing two simple pdf files. Both files contain a single line of text ("ZYXWVUTSRQQRSTUVWXYZ") and were created on Mac OS X.
The first file was created with TextEdit. There are only three streams, and looking at the first one (automatically decoded with peepdf) shows the text clearly.
PPDF> stream 4
q Q q 72 707.272 468 12.72803 re W n /Cs1 cs 0 sc q 0.9790795 0 0 -0.9790795 72 720
cm BT 0.0001 Tc 11 0 0 -11 5 10 Tm /TT1 1 Tf (ZYXWVUTSRQQRSTUVWXYZ) Tj ET
Q Q
The second file was created with MS Word. There are four streams but the decoded text is no where to be found. Looking at the corresponding stream in the Word doc does not reveal the decoded string:
PPDF> stream 4
q Q q 18 40 576 734 re W n /Cs1 cs 0 0 0 sc q 0.24 0 0 0.24 90 708.72 cm BT
-0.0004 Tc 50 0 0 50 0 0 Tm /TT2 1 Tf [ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\))
-1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ ET Q q 0.24 0 0 0.24 239.168 708.72
cm BT 50 0 0 50 0 0 Tm /TT2 1 Tf (+) Tj ET Q Q
It's not apparent to me where the string is in the file or what the information in this stream means. Any insights?

It's not apparent to me where the string is in the file
In general you won't see the clear text in the content stream because the encoding used there needs not be a standard encoding, nothing ASCII'ish.
[ (!") -1 (#) -1 ($) -1 (%&'\() -1 (\)) -1 (*) -1 (*) -1 (\)) -1 (\() -1 ('&%$) -1 (#) -1 (") -1 (!) ] TJ
This operation in its array operand contains your ZYXWVUTSRQQRSTUVWXYZ with some kerning corrections for certain pairs of characters.
It looks like an ad hoc encoding using the bytes from 33 (= 0x21 = '!') onwards. '!' is used for the first glyph needed, the Z, '"' for the second one needed Y, '#' for the third one X, etc. Your test string not only starts with these chars but also ends with them, and so does the array above, (!") -1 (#) ... (#) -1 (") -1 (!).
Inspect the definition of the font used (TT2). It may (or may not) include information helping you decoding this encoding.
or what the information in this stream means. Any insights?
To understand the contents of PDF content streams, you should read the relevant sections of the PDF specification ISO 32000-1, especially chapters 8 Graphics and 9 Text.
As your question is focused on the recognition of text content, e.g. read section 9.10.2 Mapping Character Codes to Unicode Values:
A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
Edit: Concerning the comment
One of the objects gave some font info. It is 'JJOWGO+Cambria' and references object 16 as the 'font file' which was also unreadable. I'll review the manual. Can't find anything online about 'JJOWGO'.
You wont find anything specific about JJOWGO because it most likely is a random key sequence prefixed to Cambria to indicate that not all of that font is embedded but only a subset. Cf. section 9.6.4 Font Subsets of ISO 32000-1:
PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow a conforming reader to recognize font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)
For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file shall have different tags.
EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font.
<<
/FontBBox [ -1475 -2463 2867 3117 ]
/StemV 0
/FontFile2 16 0 R
/Descent -222
/XHeight 467
/Flags 4
/Ascent 950
/FontName /JJOWGO+Cambria
/Type /FontDescriptor
/ItalicAngle 0
/AvgWidth 615
/MaxWidth 2919
/CapHeight 667
>>
This font descriptor contains no obvious encoding information. Have a look at the actual Font dictionary and look for a ToUnicode entry, cf. the quotation of section 9.10.2 above.

Comments from #mkl made it clear what is happening. The text in the pdf produced by MS Word was using a character map.
I tracked down the font dictionary by searching for objects with a ToUnicode entry:
<< /FirstChar 33
/Widths [ 538 570 571 921 604 648 593 496 621 653 220 ]
/Type /Font
/BaseFont /JJOWGO+Cambria
/LastChar 43
/Subtype /TrueType
/FontDescriptor 13 0 R
/ToUnicode 14 0 R >>
The ToUnicode entry referenced object 14, so I looked at that next:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
1 beginbfchar
<2b><0009 000d 0020 00a0>
endbfchar
10 beginbfrange
<21><21><005a>
<22><22><0059>
<23><23><0058>
<24><24><0057>
<25><25><0056>
<26><26><0055>
<27><27><0054>
<28><28><0053>
<29><29><0052>
<2a><2a><0051>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
Section 9.10.3 of ISO 32000-1 explains how beginbfrange maps character ranges to each other. Ranges of character codes are mapped to Unicode values. The "range" 21-21 contains a single character, which is "!". It is mapped to U+005a ("Z"). The mapping contains a line for every character in my test document, from Z to Q. (! to *)

Text object gets deleted in Adobe Reader 11

In adobe acrobat x i was inserting text objects and when it is opened in adobe reader 10 it was opening properly.but in adobe reader 11 when i click on that pdf file text objects gets deleted.why this happens? How to solve it?
The source pdf file click here
The pdf file which has problem when double clicking on it in adobe reader 11.
click here

In a nutshell:
You try to change the contents of a free text annotation by changing its normal appearance stream.
This is insufficient: A compliant PDF viewer may ignore this entry and provide their own appearances. So it's mere luck that older Adobe Reader versions chose to not ignore your change.
Thus, you also need to change the information a PDF viewer is expected to create their own appearance from, i.e. foremost the rich text value of RC (in the free text annotation dictionary) that shall be used to generate the appearance of the annotation, and also the Contents value which is the Text that shall be displayed for the annotation.
Furthermore there are defects in your PDFs:
the cross reference table in your first attempt result.pdf was broken;
the intent (IT value) of the free text annotation in your source files is spelled incorrectly.
In detail:
Your result.pdf is broken. Different PDF viewers may display broken PDFs differently.
Some details:
It has been created based on your Src.pdf in append mode but additionally the following change in the original revision has been made to its /Pages object:
In the source:
6 0 obj
<</Count 6
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R ]
>>
endobj
In the result:
6 0 obj
<</Count 3
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 12 0 R 11 0 R 10 0 R ]
>>
endobj
So the order of the last three pages was changed (which is ok) and the /Count was reduced from 6 to 3. This is inconsistent as there still are 6 child objects but according to the PDF specification ISO 32000-1, Count is
The number of leaf nodes (page objects) that are descendants of this node within the page tree.
Furthermore the cross reference stream of the appended revision is broken.
xref
0 1
0000000000 65535 f
24 1
0001465240 00000 n
57 1
0001466075 00000 n
66 1
0001466909 00000 n
73 1
0001467744 00000 n
93 1
0001473484 00000 n
131 1
0001478703 00000 n
The entries are 19 bytes long including their respectively ending single byte newline character According to the spec, though,
Each entry shall be exactly 20 bytes long, including the end-of-line marker.
The format of an in-use entry shall be: nnnnnnnnnn ggggg n eol
where [...] eol shall be a 2-character end-of-line sequence
There may be more errors in the PDF but you may want to start fixing these.
EDIT
Now with the new PDF Pay-in.pdf with a proper cross reference at hand, let's look at it more in-depth.
Adobe Preflight complains about many occurances of:
[...]
An unexpected value is associated with the key
Key: IT
Value: /FreeTextTypewriter
Type: CosName
Formal Representation: Annot.AnnotFreeText
Cos ID: 86
Traversal Path: ->Pages->Kids->[0]->Annots->[13]
[...]
Ok, let's look at that object 86:
86 0 obj
<< /P 8 0 R
/Type /Annot
/CreationDate (D:20130219194939+05'30')
/T (winman)
/NM (0f202782-2274-44b8-9081-af4010be86d4)
/Subj (Typewritten Text)
/M (D:20130219195100+05'30')
/F 4
/Rect [ 53.2308 33.488 552.088 826.019 ]
/DS (font: Helv 12.0pt;font-stretch:Normal; text-align:left; color:#000000 )
/AP <</N 107 0 R >>
/Contents (wwww)
/IT /FreeTextTypewriter
/BS 108 0 R
/Subtype /FreeText
/Rotate 90
/DA (16.25 TL /Cour 12 Tf)
/RC (<?xml version="1.0"?>
<body xmlns="http://www.w3.org/1999/xhtml"
xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:10.0.0"
xfa:spec="2.0.2"
style="font-size:12.0pt;text-align:left;color:#000000;font-weight:normal;
font-style:normal;font-family:Helv;font-stretch:normal">
<p dir="ltr">
<span style="line-height:16.3pt;font-family:Helvetica">wwww</span>
</p>
</body>)
>>
endobj
Preflight stated that it is unhappy about the line /IT /FreeTextTypewriter. Looking at the PDF specification again uncovers for annotations with /Subtype /FreeText, i.e. Free Text Annotations specified in section 12.5.6.6:
IT name
(Optional; PDF 1.6) A name describing the intent of the free text annotation (see also the IT entry in Table 170). The following values shall be valid:
FreeText The annotation is intended to function as a plain free-text annotation. A plain free-text annotation is also known as a text box comment.
FreeTextCallout The annotation is intended to function as a callout. The callout is associated with an area on the page through the callout line specified in CL.
FreeTextTypeWriter The annotation is intended to function as a click-to-type or typewriter object and no callout line is drawn.
Default value: FreeText
Thus, your value FreeTextTypewriter is invalid (remember, PDF names are case sensitive!). Therefore, the annotation is (slightly) broken which may already result in all kinds of problems.
But there are other important entries here, too, to understand your issue: All you do in your appended changes is to replace the appearance stream in object 107 (as per /AP <</N 107 0 R >>) of this annotation by a different one. But this annotation contains an RC value, too, which according to the specification is
A rich text string (see 12.7.3.4, “Rich Text Strings”) that shall be used to generate the appearance of the annotation.
Thus, any PDF viewer may regenerate the appearance from that rich text description, especially as the specification in section 12.5.2 says about the content of the AP dictionary
Individual annotation handlers may ignore this entry and provide their own appearances.
Thus, simply replacing the normal appearance stream does not suffice to permanently change the appearance of that annotation, you have to change the appearance dictionary and at least remove any alternative source for the appearance.
Furthermore the entry /Contents (wwww) is not replaced by your appended changes either. So a PDF viewer trying to decide whether to use the appearance stream or not will feel tempted to somehow create a new appearance as your appearance stream in no way represents that value.
Especially when starting to manipulate the free text (e.g. when clicking into the PDF in your case), the PDF viewer knows it eventually will have to create a new appearance anyways, and unless the current appearance is as it would have created it anyway, the viewer may prefer to begin anew starting with an appearance derived from the rich text or even the contents value.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas