PostScript - Preserve internal hyperlinks in PDF

PostScript - Preserve internal hyperlinks in PDF - pdf

With my original question ps2pdf - Unable to open initial device thankfully answered by #KenS, I ran into another problem, where my internal hyperlinks (e.g. "see Figure 1") are lost when converting my PDF using gswin64. This is my command:
gswin64 -dPDFSETTINGS=/ebook -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o output.pdf input.pdf
I uploaded a minimum example of the original PDF here (will be deleted after 2 weeks) and the converted version here. I found this answer, also from #KenS, on the potential possibility to exctract or preserve the links, but it just says "some PostScript programming". Is there another or "simple" way of achieving this? I found online PDF converters which are able to do so, so there must be a way.

The 'some PostScript programming' refers to extracting the Link information, as the answer says the PDF interpreter already does this for the benefit of the pdfwrite device.
Your problem is that the Link annotation uses a Named Destination:
30 0 obj
<<
/Type /Annot
/Subtype /Link
/Border [ 0 0 1 ]
/H /I
/C [ 1 0 0 ]
/Rect [ 387.470001 700.413025 394.916992 713.314026 ]
/A <<
/S /GoTo
/D (figure.caption.1)
>>
>>
endobj
The Names tree contains :
52 0 obj
<<
/Names [ (Doc-Start) 34 0 R (figure.caption.1) 36 0 R (page.1)
33 0 R ]
/Limits [ (Doc-Start) (page.1) ]
>>
endobj
The named destination figure.caption.1 points to object 36:
36 0 obj
<<
/D [ 29 0 R /XYZ 117.828003 696.228027 null ]
>>
endobj
Now that could instead have been written much more simply by putting the content of object 36 in place of the figure.caption.1 in the original Destination, eg:
30 0 obj
<<
/Type /Annot
/Subtype /Link
/Border [ 0 0 1 ]
/H /I
/C [ 1 0 0 ]
/Rect [ 387.470001 700.413025 394.916992 713.314026 ]
/A <<
/S /GoTo
/D [ 29 0 R /XYZ 117.828003 696.228027 null ]
>>
>>
endobj
I think that the latter, simpler construct would work, but indirection through the names tree does not. I think this is because the pdfwrite devcie doesn't preserve the Names tree, so it can't preserve any links which rely on the Names tree.
In fact, I'm not convinced the current code preserves Link annotations at all, which it should, so I'm looking at that now. I'll edit this answer when I know more.
[EDIT]
OK so this is a wrinkle I had forgotten....
The PDF interpreter has to treat annotations in two different ways, depending on whether the PDF is being printed or not. See the PDF 1.7 Reference, section 8.4.2 Annotation Flags, bit position 3.
If the file is being 'Printed' then there is no point in preserving Link annotations (how on earth would you click a link on the printed output ?).
So when Printed is true, which is the default value, then the PDF interpreter doesn't preserve certain kinds of annotations. You can alter this quite easily by setting -dPrinted=false on the command line.
NOTE Some annotations have the 'Print' flag set, which is what this is all about. If you set Printed to 'false' then annotations which have the 'Print' flag set will not be preserved. If you set Printed to true then those annotations will be preserved, but annotations which have the Print flag set to 0 will not be preserved. There is currently no way to have the PDF interpreter preserve both annotations with Print true and ones with Print false. This is likely to be changed in a future release because people do ask for it.
If you set -dPrinted=false, your Link annotation will be preserved. I should note that it will not be the same construction as was in your original PDF file. It will use the simpler construction where the destination is explicitly stated in the Link annotation itself, rather than indirecting through the Names tree.
The effect is the same, but its an example of the kind of thing which is described in the documentation. I presume this won't be a problem for you though.
Given the way the original file is constructed, I'm not surprised that the pdfwrite output is smaller! For some reason this file contains eight Forms, eight shadings and two colour spaces (one of which is empty) none of which appear to actually be used....

Related

PDF Signature: "Expected a dict object"

I'm creating a library for digitally signing a PDF document. During my quest I stumbled upon an other problem.
In Acrobat I'm getting the error:
Error during signature verification.
Adobe Acrobat error.
Expected a dict object.
I know it expects a dictionary object somewhere. But I have no idea where.
This problem shows up when I add the image to the AP of the signature.
For this I'm basing my implementation on the spec, and " Insert multiple digital approval signatures without invalidating the previous one "
Most of this seems to work correctly, but when the image is present it results in the error. The image is correctly visible.
Current working:
(This is a very short overview of the part where the error is, it might be slightly different, but hope this helps)
I update the signature annotation. Add link to object that contains normal appearance.
16 0 obj
<<
/Type/Annot
/Subtype/Widget
...snip...
/AP<<
/N 21 0 R
>>
>>
Add image as XObject
20 0 obj
<<
/Type/XObject
/Subtype/Image
...snip...
/Length 29569
>>
stream
...snip...
endstream
endobj
Add XObject (Normal appearance)
21 0 obj
<<
/Type/XObject
/Subtype/Form
/Resources<<
/XObject<<
/UserSignature272 20 0 R
>>
>>
/BBox[0 0 135 37.5]
/Length 44
>>stream
q
135 0 0 37.5 0 0 cm
/UserSignature272 Do
Q
endstream
endobj
I think the problem happens somewhere in obj (21 0), but I'm not sure.
Here is a minimal file that can be used for testing.
https://drive.google.com/file/d/17sdz2xJy3VhN6i9YiuPrJ6x2s5kU2sra/view?usp=sharing
Any help, or hints would be welcome.
(This post is a continuation of PDF Digital Signature has "Bad parameter" in Acrobat, but is about a different problem, same subject area.)

You're running into a bug of Adobe Acrobat here: If you display a XObject from inside your signature appearance stream, it expects that XObject to have a Resources entry. This may make sense in case of form XObjects but it doesn't for image XObjects like in your case.
A work around is to add an empty Resources dictionary to your image XObject.
I checked this by replacing the /BBox[1 0 0 1 0 0] in your image XObject (which is not needed there anyways) by /Resources<< >>.
When Adobe Acrobat creates its own signature appearances, it creates a hierarchy of form XObjects here with Resource dictionaries all over including those for the "layers". I assume Adobe Reader, seeing the Do operator attempts to collect information on such "layers", not expecting to immediately be confronted with an image XObject.

How to correctly declare FontSet in PSD EngineData?

I'm trying to generate PSD file in my application, with some text layers (TySh). Its EngineData format is pretty simple, but unfortunately have no documentation, and i got stuck with FontSet field:
/FontSet [
<<
/Name (ADomIno)
/Script 8
/FontType 1
/Synthetic 3
>>
<<
/Name (ADomIno)
/Script 8
/FontType 1
/Synthetic 0
>>
<<
/Name (AdobeInvisFont)
/Script 0
/FontType 0
/Synthetic 0
>>
<<
/Name (MyriadPro-Regular)
/Script 0
/FontType 0
/Synthetic 0
>>
]
This is Photoshop-generated data, i only omitted UTF-16 for easy-reading.
So... I can't understand, why font a_DomIno written as ADomIno. Why some fonts has "-Regular" suffix, but some has not. What means "Script", "FontType" and "Synthetic" fields. Why some fonts has two records with different fields, but other ones - only one.
There's no info in Adobe PSD format documentation, Photoshop Scripting Reference and Photoshop SDK. Projects like psd.rb or psd.js targets to parse file and has no useful info too.
Maybe someone knows?

How do I crop pages 3&4 in a multipage pdf using ghostscript

I would like to crop just some pages in a multipage pdf keeping all pages, some cropped, others not. I tried the following but it "deletes" the non cropped pages...
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -dFirstPage=3 -dLastPage=4 -c "[/CropBox [24 72 1559 1794]" -c " /PAGES pdfmark" -f input.pdf
I've seen the posts on different cropping on odd and even pages, but I could not figure out how to apply this to a certain page in a multipage document.
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -c "<</EndPage {0 eq {2 mod 0 eq {[/CropBox [0 0 1612 1792] /PAGE pdfmark true}{[/CropBox [500 500 612 792] /PAGE pdfmark true} ifelse}{false}ifelse}>> setpagedevice" -f input.pdf
This does crop all pages according to the settings of the second CropBox. If anybody is wondering about the large margins... I apply this do large drawings.
I have also tried to substitute some operators to only apply the crop to a certain page number: "sub 4" instead of "2 mod" was one attempt to attain the " 0 eq" condition only when the current page number reaches 4.

OK first things first, Ghostscript and the pdfwrite device do not 'modify' an input PDF file. For regular readers; standard lecture here, if you've read it before you can skip the following paragraph.
The way this works is that the input file is completely interpreted into a sequence of graphics primitives which are sent to the device. Rendering devices then call the graphics library to render the primitives to a bitmap, which is then output. High level (vector) devices, such as pdfwrite, translate the primitives into equivalent operations in some high level page description language, and emit that.
So, when you select -dFirstPage and -dLastPage, those are only pages for the input file you are choosing to process. So pdfwrite isn't 'deleting' your pages, you never sent them to the device in the first place.
Now, Ghostscript is a PostScript interpreter, and therefore its action can be affected by writing PostScript programs. In your case you probably want to actually process all the pages (so drop -dFirstPage and -dLastPage), but only write the pdfmark on selected pages.
The way to do this is via a BeginPage or EndPage procedure. If you search here or in the PostScript tag you'll find a number of examples. Fundamentally both procedures are called with a reason code and a count of pages so far.
From memory you will want to check the reason code is 2. If it is, then you want to check the count of pages, and it it matches your criteria (in the case here, count is 3 or 4), execute the /PAGE pdfmark. In any case you want to return 'true' so that the page is emitted.
[EDIT added here]
Hmm, OK I see the problem. What's happening is that the PDF interpreter is calling 'setpagedevice' to set the page size for each page, in case the page size has altered. The problem is that this resets the page count back to 0 each time.
Now, I wouldn't normally suggest the following, because it relies on some undocumented aspects of Ghostscript's PDF interpreter. However, I happen to know that the PDF interpreter tracks the page number internally using a named object called /Page#.
So, if I take the code you wrote, and modify it slightly:
<<
/EndPage {
0 eq {
pop /Page# where {
/Page# get
3 eq {
(page 3) == flush
[/CropBox [0 0 1612 1792] /PAGE pdfmark
true
}
{
(not page 3) == flush
[/CropBox [500 500 612 792] /PAGE pdfmark
true
} ifelse
}{
true
} ifelse
}
{
false
}
ifelse
}
>> setpagedevice
Couple of things to note; there's some debug in there, the lines with '== flush' print out some stuff on the back channel so you know how each page is being handled. If /Page# isn't defined, then the code simply leaves everything alone, this is just some basic safety-first stuff.
Rather than type all this on the command line (which also loses indenting and is hard to read) I stuck it in a file, called test.ps, then invoked GS as:
gswin32c -sDEVICE=pdfwrite -sOutputFile=out.pdf test.ps input.pdf
Its not the neatest solution in the world, but it works for me.

PostScript PDF (1.7), manually writing code

I'm trying to manually write a simple PDF file that contains a title, some text, and an image. I found one example of a manually written "Hello world" and managed to change some things, but I cant get it working for another text object. I have looked for help on the internet but with no luck, I guess not many people write their own PDF files.
This is what I have so far:
%PDF-1.7
1 0 obj % entry point
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/MediaBox [ 0 0 200 200 ]
/Count 1
/Kids [ 3 0 R ]
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/Font <<
/F1 4 0 R
>>
>>
/Contents 4 0 R
>>
endobj
4 0 obj % page content
<<
/Length 20
>>
stream
BT
80 180 TD
/F1 14 Tf
(PDF) Tj
ET
endstream
endobj
5 0 obj % page content
<<
/Length 20
>>
stream
BT
50 70 TD
/F1 14 Tf
(this is a pdf) Tj
ET
endstream
endobj
trailer
<<
/Size 6
/Root 1 0 R
>>
startxref
492
%%EOF
I have tried adding another text object with "this is a pdf" text but it wont show up, I don't know what could be wrong, I tried changing a few things but with no luck. The image part I don't have it either, so some help with that would be nice.
This is a wiki about the "hello world" pdf I found:
http://www.gnupdf.org/Introduction_to_PDF
Adobe offers some explanation on how the pdf works but I cant find anything that would fix my problem:
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

This is not a valid PDF. If Acrobat opens it at all it's because it's given up on the xref table and done a full scan of the file, but your PDF is invalid. 4 0 obj is not a font, as you specified, and 5 0 obj is not accessed from anywhere.
PDF specification requires an xref table which points to the exact position in the file for each object. You can't realistically write this by hand unless you intend to manually update the entire xref table every time you add or remove even 1 byte from the file.
You can write a PDF from scratch like this from code easily enough but it will not work to just open a PDF in notepad and start changing things because the index (xref) immediately becomes corrupt.
I'd also advise against putting comments throughout the file unless the comments start on new lines. Otherwise some PDF parsers will get confused as this is generally not expected. Usually PDF files do not contain comments (with the exception of the second line, which is recommended by Adobe to be a comment of some non-ASCII characters so FTP recognizes the file as binary) seeing as they are virtually impossible to write manually anyway.
http://www.adobe.com/devnet/pdf/pdf_reference.html

A few years ago, I wrote a book which covers exactly this sort of thing:
http://www.amazon.com/PDF-Explained-John-Whitington/dp/1449310028/
No free online version, I'm afraid. You can get all the same information from Adobe's own documentation, which is free, but it's a rather long document!

pdf, appearance streams, checkbox not shown correctly after focus lost

I'm working on a program generating interactive forms into PDF files.
The generated file is here (source is readable). The checkbox is on the bottom of the page. After it gets focus it is rendered correctly (white square with red/blue border), after it lose the focus the square disappears and the default appereance is shown (thats incorrect for me).
in Acrobat 9, X, XI
in build-in chrome pdf viewer it works fine
Adobe XI Pro - preflight - shows warning "Form field has multiple appearances"
I can not find the mistake.
Thanks for your help.
the same (similar) problem described there:
http://forums.adobe.com/message/5144579#5144579
---- here is a part of a pdf file I expect the mistake
2 0 obj
<<
/Type /Catalog
/Pages 1 0 R
/OutputIntents [7 0 R]
/Metadata 8 0 R
/PageLabels 10 0 R
/AcroForm 14 0 R
>>
endobj
14 0 obj
<<
/Fields [13 0 R]
>>
endobj
13 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [20.0 20.0 120.0 120.0]
/FT /Btn
/F 4
/T (name)
/AS /Yes
/V /Yes
/AP <<
/N <<
/Yes 11 0 R
/Off 12 0 R >>
>>
>>
endobj
11 0 obj
<<
/Type /XObject
/SubType /Form
/BBox [20.0 20.0 120.0 120.0]
/Length 19 0 R
>>
stream
....
endstream
endobj
12 0 obj
<<
/Type /XObject
/SubType /Form
/BBox [20.0 20.0 120.0 120.0]
/Length 20 0 R
>>
stream
....
endstream
endobj

My observations with your PDF are somewhat different but interesting nonetheless:
Adobe Acrobat 9 Pro v9.5.4 (with PDF/A r/o view disabled) here does exactly what you originally seem to have expected: It only uses the red or blue framed box. If one toggled the check box, though, even if toggling back on again, it wants to save a new revision with some changes to your field.
Adobe Reader X! v11.0.2 starts in PDF/A read-only mode and displays the red frame. After leaving that r/o mode, though, it shows the default cross appearance. When it gets the focus it again uses the red and blued frames. When it loses focus, it goes back to the default appearances.
The behavior I observed in Adobe Reader X! seems to be what you observed in more cases.
Thus in essence the issue is that under certain circumstances (for me: not in PDF/A r/o mode, focus not on form field) some PDF vewers (for me: Adobe Reader XI) don't use your custom check box appearances but some standard ones, and you think that this is incorrect.
Unfortunately there is a hint in the PDF specification ISO 32000-1:2008 according to which viewers may (perhaps even shall) act just so. Table 189 in section 12.5.6.19 Widget Annotations explains the entries in an appearance characteristics dictionary (value of /MK in the widget dictionary; you do not provide one, thus defaults apply), among them /CA:
text string (Optional; button fields only) The widget annotation’s normal caption,
which shall be displayed when it is not interacting with the user.
Unlike the remaining entries listed in this Table, which apply only to
widget annotations associated with pushbutton fields (see Pushbuttons in
12.7.4.2, “Button Fields”), the CA entry may be used with any type of
button field, including check boxes (see Check Boxes in 12.7.4.2, “Button
Fields”) and radio buttons (Radio Buttons in 12.7.4.2, “Button Fields”).
In particular check boxes, therefore, whenever not interacting with the user, shall be displayed using their normal captions, not their appearances.
When there is no focus on a form field, Adobe Reader seems to think that the form is not interacting with the user, and therefore switches to display of caption instead of appearance.
Unfortunately the normal caption you can define for a button is but a text string which by default seems to be interpreted in the context of the Zapf Dingbats font (try /MK<</CA(1)>> for example). This is, though, where you should continue looking, maybe you can make it use some Type 3 font of your design containing a blue and a red square frame.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas