How to map links in pdf to the bookmark format? - pdf

A pdf file contains links like the following.
<</05_preface [15 0 R /XYZ 27.999998 763.99994 0]
/sec1 [343 0 R /XYZ 27.999998 393.1575 0]
/fn1 [343 0 R /XYZ 204.5918 254.82751 0]
...
I want to set bookmarks pointing to these locations using cpdf.
I guess 343 is a page number. What is the meaning of 0 after 343? What does "R" mean?
How can I convert this data to a format understandable by cpdf to set pdf bookmarks? Solution in pdftk is also welcome.
EDIT: mkl commented on my error about "343". I see the following contents in the pdf referring to "343 0". How to figure out the page number of "343 0" then? What is the meaning of R?
343 0 obj
<</Type /Annot
/Subtype /Link
/F 4
/Border [0 0 0]
/Rect [34.005386 592.09595 226.92856 604.85718]
/Dest /sec1
/StructParent 100300>>
endobj
...
100300 0 obj
<</Type /StructElem
/S /NonStruct
/P 100299 0 R
/K [<</Type /MCR
/Pg 16055 0 R
/MCID 28>>]
/ID (node00083500)>>
endobj
...
16055 0 obj
<</Type /Page
/Resources <</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/ExtGState <</G3 3 0 R>>
/Font <</F7 7 0 R>>>>
/MediaBox [0 0 612 792]
/Contents 16056 0 R
/StructParents 1308
/Parent 16810 0 R>>
endobj
...
16810 0 obj
<</Type /Pages
/Count 8
/Kids [16047 0 R 16049 0 R 16051 0 R 16053 0 R 16055 0 R 16057 0 R 16059 0 R 16061 0 R]
/Parent 16868 0 R>>
endobj
...
16868 0 obj
<</Type /Pages
/Count 64
/Kids [16807 0 R 16808 0 R 16809 0 R 16810 0 R 16811 0 R 16812 0 R 16813 0 R 16814 0 R]
/Parent 16875 0 R>>
endobj
...
16875 0 obj
<</Type /Pages
/Count 512
/Kids [16864 0 R 16865 0 R 16866 0 R 16867 0 R 16868 0 R 16869 0 R 16870 0 R 16871 0 R]
/Parent 16877 0 R>>
endobj
...
16877 0 obj
<</Type /Pages
/Count 1604
/Kids [16873 0 R 16874 0 R 16875 0 R 16876 0 R]>>
endobj
...
I tried to use cpdf to get all objects of the type of /Pages. Why only the last one is printed?
$ cpdf -print-dict-entry /Pages chrome2pdf.pdf
{"/Type":{"N":"/Pages"},"/Count":{"I":1604},"/Kids":[16873,16874,16875,16876]}

Deep PDF surgery, this.
You say that you want to use cpdf to add some bookmarks which point to some destinations which already exist in the file. For example...
[15 0 R /XYZ 27.999998 763.99994 0]
... is a destination on the page whose page tree node has internal PDF object number 15.
So it sounds like you want a correspondence between page numbers and page tree node object numbers. This is not trivial. When you use -list-bookmarks or -list-bookmarks-json in cpdf, the format substitutes the internal PDF object number (say, 15, here) for the actual page number (say 10). It does this to make roundtripping possible - i.e save bookmarks, modify a file in some way, add the bookmarks back. So you are not exposed to the internal object numbers, which are subject to change.
Use
cpdf -output-json in.pdf -o out.json
to output the PDF file as JSON. Now you can read the page tree from that JSON, by following /Pages in the document catalog, as referenced from the document root. The pages in the page tree, when read in an in-order tree-traversal represent the pages 1...n of the file.
Now you have a mapping from object numbers back to page numbers e.g [(15, 1), (27, 2), (9, 3) ... (m, n)].
Now, you can build a cpdf bookmarks file using those page numbers, and the destination coordinates e.g /XYZ 27.999998 763.99994 0 that you have already extracted.
(Side note: the -print-dict-entry example you give does look like it might be a bug. Can you report it, and supply the file, please?)

Related

Trying to embed simple UTF16 character into manually created PDF but failing

I'm trying to manually create a PDF document (using the PDFGen C code on github). This is on a small footprint device with limited storage.
All works fine until I want to embed (say) the Unicode Ohms character (U+2126).
Below is the test file I'm using, which should show "Hello" with an Ohms symbol after the 'H'.
However, it actually shows "H!&ello".
%PDF-1.4
<hex chars removed>
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
2 0 obj
<< /Count 1 /Kids [ 3 0 R ] /Type /Pages >>
endobj
3 0 obj
<< /Contents 4 0 R /MediaBox [ 0 0 500 800 ] /Parent 2 0 R /Resources 5 0 R /Type /Page >>
endobj
4 0 obj
<< /Length 57 >>
stream
BT /F1 24 Tf 175 720 Td <FEFF004821260065006C006C006F> Tj ET
endstream
endobj
5 0 obj
<< /Font << /F1 6 0 R >> >>
endobj
6 0 obj
<< /BaseFont /Courier /Subtype /Type1 /Type /Font >>
endobj
xref
0 7
0000000000 65535 f
0000000015 00000 n
0000000064 00000 n
0000000123 00000 n
0000000229 00000 n
0000000335 00000 n
0000000378 00000 n
trailer << /Root 1 0 R /Size 7 /ID [<89311a609a751f1666063e6962e79bd5><89311a609a751f1666063e6962e79bd5>] >>
startxref
448
%%EOF
I can only assume my Unicode hex string <FEFF004821260065006C006C006F> is badly formatted.
Or is the Font definition incorrect ?
Or is my understanding of how to embed Unicode wrong ?
I'm ultimately not wanting to embed any fonts as I don't have the storage space or processing power. I just want to add Unicode characters and rely on the PDF renderer to work out how to display them using the default Courier font.
Is that even possible ?
Thanks in advance for any help/advice/comments.
UPDATE
After some useful advice below, I've now managed to achieve what I needed.
I modded my code to switch fonts on a per-character basis between Courier and Symbol and now support (nearly) all the standard characters.
I also added some character scaling to keep the Symbol characters aligned with the Courier font but the end result works for me :)
Here's an image of my test PDF ...
Oddly the original PC IBM 437 code set included Ω wiki note i [03A9] (234) but did not make it to Courier ??
You could try coding those few characters you need as an embedded sub-setted symbol font and quite possibly do that using ascii(7bit) or ansi(8bit) but the overheads would be tremendous for your few characters.
Simpler try switching fonts (as required per target characters) to Symbol font and it could look like this
P.S. the codes dont need to be "word" doubles there are only 256 chars.
<< /BaseFont /Symbol /Subtype /Type1 /Type /Font >>
BT /F2 24 Tf 175 720 Td <4857657C7C6F20766FC27C64> Tj ET
By alternating courier and symbol you will get your desired
In your code it could look something like (with included transforms)
BT
/F0 24 Tf 1 0 0 1 0 .0675 Tm (H) Tj
ET
BT
/F1 24 Tf 1 0 0 1 14.4 .0675 Tm <003a> Tj
ET
BT
/F0 24 Tf 1 0 0 1 32.832 .0675 Tm (ello) Tj
ET
Note my editor used F0 for Courier and F1 for Symbol (base 0 is more normal)
Also it used a slightly different code method of defining Omega as <003a>
Here I am tweaking the text in Windows Notepad to watch how compiling (Ctrl+S) moves the Omega character spacing whilst watching it slide sideways live in the Previewer. Also note that Upper case Omega is W in the raw symbol font !!
So my replacement fix for your code looks like this (You can easily make it look closer to yours, and leaner, by removing white space and line feeds)
%PDF-1.4
%µ¶
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>>
endobj
2 0 obj
<<
/Count 1
/Kids [ 3 0 R ]
/Type /Pages
>>
endobj
3 0 obj
<<
/Contents 4 0 R
/MediaBox [ 0 0 500 800 ]
/Parent 2 0 R
/Resources <<
/Font <<
/F1 5 0 R
/F2 6 0 R
>>
>>
/Type /Page
>>
endobj
4 0 obj
<<
/Length 133
>>
stream
q
BT
/F1 24 Tf
1 0 0 1 175 720 Tm
(H) Tj
ET
BT
/F2 24 Tf
1 0 0 1 189 720 Tm
(W) Tj
ET
BT
/F1 24 Tf
1 0 0 1 206 720 Tm
(ello) Tj
ET
Q
endstream
endobj
5 0 obj
<<
/BaseFont /Courier
/Subtype /Type1
/Type /Font
>>
endobj
6 0 obj
<<
/BaseFont /Symbol
/Subtype /Type1
/Type /Font
>>
endobj
xref
0 7
0000000000 65536 f
0000000016 00000 n
0000000070 00000 n
0000000136 00000 n
0000000307 00000 n
0000000494 00000 n
0000000569 00000 n
trailer
<<
/Size 7
/Root 1 0 R
/ID [ <89311A609A751F1666063E6962E79BD5> <EE408A115072E92E3A34C8BB8BDC6AE6> ]
>>
startxref
643
%%EOF
You cannot do it.
Note: you want to insert a Unicode character (not a UTF-16, which it is just one of many representation/encoding of Unicode).
No fonts includes all glyphs, and as far I know, only few Latin-1 fonts are safe (and required) for PDF. Note: such fonts requires a Latin-1 encoding (contrary of all other fonts, this is just a portability issue, for "pre Unicode epoch"). An additional problem. Type1 uses glyph indices, which may not be the same as Unicode Codepoints (in fact, I think they are always different). IIRC Adobe has some documentation about this. And type1 is nearly out of support, maybe it is better not to use it for 2021 programs.
You may assume people will have Microsoft Windows, and so you can use Symbol font (and using Omega, instead of Ohm, which may be represented with the same glyph). But in this case you are creating a "Non-Portable" Portable Document format (PDF).

PDF generation — How to merge multiple stream objects?

I'm currently into generating PDF documents without the use of an external library and it has been going well so far. I've written the document exposed below with a text editor (vim) and it renders the expected results using at least two PDF distinct viewers (evince & gsview, running Linux).
This document produces three squares at top of the page, coming in different sizes, widths and colors.
My question is : is there a way to merge two stream objects into a new single one or, in other words, is there a way to compose sophisticated objects starting from simple ones, so we can easily refer to these composite objects, multiple times if needed ?
In the given example, object 5 0 obj is drawing a square, and following ones are just applying colors and coordinates transformations (through a matrix).
The PDF reference manual states that multiple stream contents passed as an array to page object's /Contents parameter are concatenated and processed as a single continuous stream, which totally does the trick… as long as the document remains small and simple!
In this same example, the /Contents array is indirectly passed through object 4 0 obj, which refers three times to 5 0 R, to draw the squares.
The ideal here would be to define three differents objects, each refering to 5 0 R by themselves, then invoke only these objects, a single time each, from the Contents array.
I tried adding subarrays inside it, which could in turn be embedded into dedicated objects and referenced indirectly, but it unfortunately doesn't work. :-(
A lot of thanks to any people that could/try-to help !
PS: I'm doing it because I'm interested in the format itself and would like to produce some autogenerated documents from small scripts. Also, I'll probably embed them into a weakly powered appliance and I cannot afford relying on dozens of megabytes in dependencies.
But before this, I still tried to do that too, using PHP with TCPDF. If there's already some facilities dedicated to this that I would have missed, this is relevant to my interests too. :-)
Small.pdf (hand made PDF file)
%PDF-1.7
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 3 0 R ]
>>
endobj
3 0 obj
<<
/Type /Page
/MediaBox [ 0.000000 0.000000 1000.000000 1414.213562 ]
/Contents 4 0 R
>>
endobj
4 0 obj
% A simple array, just to avoid embedding it directly in /Page object (3 0 R here)
[
6 0 R 5 0 R % Red square
7 0 R 5 0 R % Green square
8 0 R 5 0 R % Blue square (tilted)
]
endobj
5 0 obj
% Draws a square, centered by default on lower left corner
<<
/Length 43
>>
stream
+20 +20 m
+20 -20 l
-20 -20 l
-20 +20 l s Q
endstream
endobj
6 0 obj
<<
/Length 63
>>
stream
/DeviceRGB CS
q
1.0 0.0 0.0 SC
2.0 w
1 0 0 -1 60 1354.213562 cm
endstream
endobj
7 0 obj
<<
/Length 49
>>
stream
q
0.0 1.0 0.0 SC
1.0 w
2 0 0 -2 190 1334.213562 cm
endstream
endobj
8 0 obj
<<
/Length 83
>>
stream
q
0.0 0.0 1.0 SC
5.0 w
0.707106781 0.707106781 -0.707106781 0.707106781 110 1250 cm
endstream
endobj
xref
0 9
0000000000 65535 f
0000000010 00000 n
0000000079 00000 n
0000000168 00000 n
0000000296 00000 n
0000000513 00000 n
0000000674 00000 n
0000000796 00000 n
0000000905 00000 n
trailer
<<
/Size 9
/Root 1 0 R
/ID [ <0000000000> <0000000001> ]
>>
startxref
01047
%%EOF
What you are looking for are form XObjects.
The pdf specification ISO 32000-1 characterizes them like this:
A form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects. A form XObject may be painted multiple times - either on several pages or at several locations on the same page - and produces the same results each time, subject only to the graphics state at the time it is invoked.
For details please read section 8.10 of the specification.

(Manually created Simple PDF using PDF Reference-1.7 )Adobe Reader XI asking to save when closing the PDF?

I have generated a PDF Using the following PDF code its working fine but when i am trying to close ,its asking me to save.I have analyzed my PDF code to detect the problem. I have identified there is a problem in startxref offset size and xref offset position.I have done enough changes but i couldn't solve this problem(Do you want to save changes 'xxx.pdf' before closing).
here is my PDF CODE
%PDF-1.4
%âãÏÓ
1 0 obj
<<
/Type/Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type/Pages
/MediaBox[0 0 612.0 792.0]
/Count 1
/Kids [ 3 0 R ]
>>
endobj
3 0 obj
<<
/Type/Page
/Parent 2 0 R
/Resources 4 0 R
/Contents 5 0 R
>>
endobj
4 0 obj
<<
/ExtGState <</GS1 7 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]
/Font<< /F1 8 0 R >>
>>
>>
endobj
5 0 obj
<</Length 44>>
stream
BT
/F1 18 Tf
0 g
1 0 0 1 100.0 400.0 Tm
(kersom) Tj
ET
endstream
endobj
6 0 obj<</Producer(Xxxxxxxx XXX Xxxxxxxx - 1.1)>>
endobj
7 0 obj
<</ca 0.35/CA 0.35>>
endobj
8 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
endobj
xref
0 9
0000000000 65535 f
0000000015 00000 n
0000000063 00000 n
0000000148 00000 n
0000000228 00000 n
0000000340 00000 n
0000000442 00000 n
0000000499 00000 n
0000000535 00000 n
trailer
<<
/Info 6 0 R
/Root 1 0 R
/Size 9
>>
startxref
606
%%EOF
Having received the sample PDF in its original form, the issue immediately becomes clear: The offsets in the cross reference table are correct but that table itself is incorrectly built.
Let's look at a hex dump:
Obviously each entry in the cross reference table is 19 bytes in size.
Now let's look at the PDF specification:
Each entry shall be exactly 20 bytes long, including the end-of-line marker. [...] The format of an in-use entry shall be:
nnnnnnnnnn ggggg n eol
where:
nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence
[...] a 2-character end-of-line sequence consisting of one of the following: SP CR, SP LF, or CR LF. Thus, the overall length of the entry shall always be exactly 20 bytes
(section 7.5.4 Cross-Reference Table of ISO 32000-1)
Thus, the issue in the OP's PDF is that each cross reference table entry has an end-of-line sequence of only one byte, a LF, while it must have a 2-byte end-of-line sequence, either SP CR, SP LF, or CR LF.
This makes each entry one byte too short which in turn results in look-ups from that table returning utterly broken byte sequences.
Save the form with Adobe Reader and compare it at a binary level. You will discover a slight difference. For instance: the cross-reference table was rebuilt because you didn't take into account 'carriage return' characters, there was white space where you didn't expect it, etc...
Adobe Reader also fixes errors such as this one:
4 0 obj
<<
/ExtGState <</GS1 7 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]
/Font<< /F1 8 0 R >>
>>
>>
You have a double dictionary ending here (remove >>) once. That's at least one error in the PDF you've copy/pasted.

PDF document is modified by another revision?

I use PDFbox in order to sign PDF. It works very well. I can add several signature to one document, and everything works well.
Now, someone sign me a document(she sign by another software), this signature was working too. but when I add another revision (by pdfbox) to his document now Adobe reader tells me that PDF was modified.
that is original document: this
this is signed document which was done by another software: link
when I add another revision to the signed pdf, I get this document, which have problems: link
If I add another revision to the PDF that was signed by my software, there is no problem link
In Short:
Your code applies unnecessary changes to existing PDF objects.
Some changes merely are structural, not changing the actual content. Acrobat Reader might or might not ignore those structural changes. But in the process you introduce rounding errors, and they definitively constitute a change.
The structural changes probably are caused by the quirk of PDFBox to force its preference of which kinds of objects should be direct or indirect onto existing objects it touches.
And the rounding errors while in practice hardly relevant are definitively a no-go when security features are concerned.
When you sign a document twice with PDFBox, the initial signing process already forces PDFBox' preferences into the document and, thus, the second signing process does not destroy anything by again forcing the same preferences into its result.
The Details:
The original from original-signed - old.pdf:
3 0 obj
<<
/DefaultGray 11 0 R
/Type/Catalog
/DefaultRGB 12 0 R
/AcroForm
<<
/Fields[15 0 R]
/DR<</Font<</Helv 16 0 R/ZaDb 17 0 R>>>>
/DA(/Helv 0 Tf 0 g )
/SigFlags 3
>>
/Pages 5 0 R>>
endobj
11 0 obj
[
/CalGray
<<
/WhitePoint [0.9505 1 1.0891 ]
/Gamma 0.2468
>>
]
endobj
12 0 obj
[
/CalRGB
<<
/WhitePoint [0.9505 1 1.0891 ]
/Gamma [0.2468 0.2468 0.2468 ]
/Matrix [0.4361 0.2225 0.0139 0.3851 0.7169 0.0971 0.1431 0.0606 0.7141 ]
>>
]
endobj
Your re-signed original-signed-signed -old new.pdf
3 0 obj
<<
/DefaultGray [/CalGray 18 0 R]
/Type /Catalog
/DefaultRGB [/CalRGB 19 0 R]
/AcroForm
<<
/Fields [15 0 R 20 0 R]
/DA (/Helv 0 Tf 0 g )
/SigFlags 3
>>
/Pages 5 0 R
>>
endobj
18 0 obj
<<
/WhitePoint [0.9505000114 1 1.0891000032]
/Gamma 0.2468000054
>>
endobj
19 0 obj
<<
/WhitePoint [0.9505000114 1 1.0891000032]
/Gamma [0.2468000054 0.2468000054 0.2468000054]
/Matrix [0.4361000061 0.2224999964 0.0138999997 0.3851000071 0.716899991 0.0970999971 0.1430999935 0.0606000014 0.7141000032]
>>
endobj
So in essence your code changed indirect arrays (objects 11 and 12) of direct dictionaries into direct arrays (in your new object 3) of indirect dictionaries (your new objects 18 and 19). This is unnecessary and, therefore, there is no need for Adobe Reader to accept it. But it probably would accept this (I don't know, one has to check) if the replacements were identical.
But they indeed are not identical! Your code introduces rounding errors in these color definitions. And, therefore, it changes the content.
Additionally your code also introduces structural changes to
4 0 obj
<<
/Parent 5 0 R
/Contents 9 0 R
/Type/Page
/Resources<</ProcSet 2 0 R/Font<</F0 6 0 R/F1 7 0 R>>>>
/MediaBox[0 0 612 792]
/Annots[15 0 R]
>>
endobj
2 0 obj
[ /PDF /Text ]
endobj
which you change to
4 0 obj
<<
/Parent 5 0 R
/Contents 9 0 R
/Type /Page
/Resources<</ProcSet [/PDF /Text] /Font 23 0 R >>
/MediaBox [0 0 612 792]
/Annots [15 0 R 20 0 R]
>>
endobj
23 0 obj
<<
/F0 6 0 R
/F1 7 0 R
>>
endobj
Here you change an indirect array of names into a direct one and a direct dictionary into an indirect one.

Offset values in cross reference table

I am new to pdf creation and I don't understand an issue with offset values in cross reference table.
This is very basic pdf:
%PDF-1.5
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 700] /Contents 6 0 R>>
endobj
4 0 obj<</Font <</F1 5 0 R>>>>
endobj
5 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
6 0 obj
<</Length 44>>
stream
BT /F1 24 Tf 100 100 Td (This is test)Tj ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000056 00000 n
0000000111 00000 n
0000000212 00000 n
0000000250 00000 n
0000000317 00000 n
trailer <</Size 7/Root 1 0 R>>
startxref
406
%%EOF
No matter what values I set in cross reference table PDF is still getting opened without any error.
WHY?
In general PDF viewers try to be quite lax and correct errors without mentioning it. There simply are very many broken PDFs in the wild which the respective producer claims to be correct, and PDF viewers don't want to argue on that with those producers.
BTW, I just copied your data into an editor, saved it as a PDF, and opened that file in the Adobe Reader. While opening it did not complain, but while closing it asks whether it should save the changes. These "changes" are the afore mentioned corrections done unter the hood.
Having repaired the original, Adobe Reader uses its preferred way to construct PDFs, so the document structure may look very different...
If you want to check a PDF file for correctness, you should look for a preflight tool (e.g. the one included with Adobe Acrobat).