Build a TOC when Concatenating PDFs

Build a TOC when Concatenating PDFs - pdf

I have a dozen essays as PDFs which I want to combine to one concatenated master PDF with a table of content where each entry is a clickable link to the first page of each essay. The TOC could be either a page with internal links or a proper PDF TOC.
The best would be a command line solution on Linux and macOS. So far I have used QPDF, which works great for concatenating the essay PDFs, but it does not build a TOC.
It is a one-off problem, so I am happy to write some (bash, Python or other) scripting code to generate this TOC. For utility it is important that the links are clickable.
Any idea how to do this?

As I already noted, you can create TOC page manually and append/prepend it to the file.
To make TOC clickable, you need to add link annotations to it. After quick googling I made the following example using GhostScript:
gs -o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf an.txt
And an.txt file contains the following:
[ /Subtype /Link
/SrcPg 1
/Rect [10 10 50 50]
/Page 2
/ANN pdfmark
Here SrcPg is page number to put annotation on; Rect is the area to make clickable; Page is destination page number.
You can find more details on annotation syntax here and here. Hope it helps.

Related

ghostscript settings to open PDF as full page

I know from some colleagues, who are desiging our leaflets in Indesign and store as PDF's that there is a setting to view it in full page mode, when opening the file.
I did a script to "merge" some of these docs using ghostscript device -pdfwriter and option -dPDFFitPage (edited after KenS' answer)
here my full command:
gs -dBATCH -sDEVICE=pdfwrite -dNO_PDFMARK_OUTLINES -dPDFFitPage -o output.pdf cover.pdf input1.pdf input2.pdf input3.pdf pdfmarks
But "-dPDFFitPage" does not do what I am expecting. The pagewidth is fit on screen, but I would like the whole page to fit on screen. I also heard using "/FIT" in the pdfmarks would help but it also doesn't.
If anybody can help me, I would be very thankful.
Best regards
Mike

Ok, after some other hours of reading the pdfmark reference and asking google various questions, I came across my ultimate and satisfying solution:
[ /PageMode /UseOutlines /Page 1 /View [ /Fit] /PageLayout /SinglePage /DOCVIEW pdfmark
So I simply added /PageLayout /SinglePage and it opens in full page mode in the reader window, showing bookmarks (/UseOutlines) and when scrolling, it scrolls pagewise, so every step of the mouswheel is one page. This works perfectly now.

I found a solution to my problem, perhaps it will help other's so I am posting it as an answer. KenS' answer was a big help to solve my problem. Thanks to him.
[ /PageMode /UseOutlines
/Page 1 /View [/Fit]
/DOCVIEW pdfmark
This sets the magnification of the PDF file to "windows size". With Acrobat Reader and Acrobat standard, it works pretty well. Other readers are not tested.
Best regards
Mike

There is no option -dpdfwriter. The fact that PDFFitPage doesn't do what you want isn't surprising, it has no effect on what a PDF viewer will do. This option (which is described in the documentation) only has any effect when used with a pre-defined fixed media size. It creates a new PDF where the content of the original PDF is scaled so that fits onto the fixed media size.
If you want to include directions to PDF viewers on how to open PDF documents then you need to look at the pdfmark operator. Specifically you will need to construct a DOCVIEW pdfmark as described on page 29 and 30 of the version 9 pdfmark reference.

Is there a way to fix cut-off text in a PDF file?

I have a library manual that the creator changed some of the LaTeX code and changed the page position and size, but didn't check it before compiling, distilling and sending it off. He is currently unavailable, so if I want to print it I have to fix it myself.
I was able to use some ghostscript commands to push the entire text down to something approaching centered on the page, the command is show below:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/user/shiftdown.pdf -dPDFSETTINGS=/prepress -c "<</PageOffset [0 -35]>> setpagedevice" -f /home/user/brokendoc.pdf
The issue is that while the page is now printable without hitting hardware margins, the chapter titles are still halfway cut off at the top. If I open the PDF in Acrobat or Reader, I can select the chapter title and copy it and it pastes the full text in the program of my choosing. When I tried printing it on a Xerox MFP with a partially incompatible driver it printed the header, but it wouldn't duplex and I didn't want to print 700+ pages and then use the copy to 1 -> 2 function.
Does anyone know of a way to fix these cut off headers such that they either appear correctly in the PDF file or at least reliably print correctly? I have ghostscript easily available, TeX relatively easily available and the standard version of Acrobat X.
[update:]
After downloading the demo of Acrobat Pro XI, I was able to go to the "Print Production" tab and click on "Edit Object". When I clicked on the cut off chapter titles it showed me two bounding boxes that covered the entire page with one just a little taller than the other. When I right clicked on it I got the option to Add Clip and Delete Clip. When I click on Delete Clip it shows the entire chapter title. If I click on Add Clip it says, "One or more of the selected regions already have a clipping region. Proceed with setting the clipping regions for the selected objects? [No] [Yes]"
With that added information, I know there has to be a way to in a batch mode fix the issue, anyone know what command translates into this?

Without seeing the 'brokendoc.pdf' it's hard to know. If I see the file, I can tell you what's going on, and (probably) how to fix it or work around it.
I don't need the entire file, so just a shortened version that only has a few pages that shows the problem will suffice. You might be able to get this from the complete brokendoc.pdf using:
gs -sDEVICE=pdfwrite -o part.pdf -dLastPage=10 brokendoc.pdf
Also, you may want to try:
gs -sDEVICE=pdfwrite -o fitted.pdf -dPDFFitPage -sPAPERSIZE=letter -dFIXEDMEDIA brokendoc.pdf
The above will scale (and center) the page on to the specified page size. You can specify 'letter' or 'a4' or use -dMEDIAWIDTHPOINTS=_ -dMEDIAHEIGHTPOINTS=_ to get a specific output page size. The -dFIXEDMEDIA option causes gs to ignore the MediaBox in the file.

Undo Pdfnup Operation

I have a Pdf file which contains several slides per page, including text (not only images).
This pdf was probably created using pdfnup.
Can I revert the pdfnup operation so that each slide is shown on one page?

As far as I know, there is no simple to be used 'undo' operation.
However, the following answers show you the approach principle, how you can achieve the undo-equivalent operation using Ghostscript:
Convert PDF 2 sides per page to 1 side per page (Superuser)
How can I split a PDF's pages down the middle? (Superuser)
Cropping a PDF using Ghostscript 9.01 (Stackoverflow)
PDF - Remove White Margins (Stackoverflow)
(Should these not help you to find the final solution, ask again. But then to come up with a fully working commandline, I'd need the complete output of the following command first: pdfinfo -f 1 -l 100 -box your.pdf.)

Adding internal hyperlink to a pdf

I have a PDF document to which I want to add internal hyperlinks.
Specifically, page 1 contains a table of contents which I want to make clickable.
My idea is to create rectangular boxes in predetermined locations on page 1, which should link to pages 2, 3, ...
I found this post which talks about adding internal hyperlinks using the method I described above.
http://bugs.ghostscript.com/show_bug.cgi?id=691531
However, when I try to use this technique in my file, the script just ADDS pages with the rectangle and hyperlink.
I need it to overlay the hyperlink on the existing contents of my first page.

You can do this with Ghostscript, using the pdfmark operator.
For some introduction to the pdfmark topic, see also Thomas Merz's PDFmark Primer.
For an example to achieve a similar thing, see this answer: Merge PDF's with PDFTK with Bookmarks?
Alternatively, you could...
...use qpdf to expand all (compressed) internal PDFstreams into ASCII,
...edit the PDF source code (using the knowhow acquired from the PDFmark Primer),
...use qpdf again to re-compress the PDF streams.

This is what I used:
Ghostscript function call from MATLAB:
-o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress original.pdf script.ps
Postscript code saved in script.ps:
[ /Rect [10 10 50 50]
/Page 2
/SrcPg 1
/Subtype /Link
/ANN pdfmark

There is currently (as of 2020) a piece of freeware for Windows that allows adding hyperlinks. PDF X-Change Editor, which has a free demo version, allows manually drawing hyperlinks on the page (arbitrary rectangles) and setting the target location (page). It is offered at no cost but it is not "free as in libre" software.

Combining PDF with GhostScript: Using Original Bookmarks with corrected page numbers

I am using
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=book.pdf -f front-matter.pdf fulltext-0.pdf fulltext-1.pdf back-matter.pdf
to create a single PDF document from a series of pdf documents. I was going to include a new made-up table of content and include it using the pdfmark mechanism. Then I notice that the original files already have bookmarks in them - they are however referenced to the original page numbers, not the ones in the combined document.
I am looking for two possible solutions. Remove the orginal bookmarks or make use of the original bookmarks but somehow update their page references...

As so often the case, someone has walked the same path before you...
unfolding disasters has worked out a solution to this very problem. His python script pdf-merge.py first invokes pdftk with its dump_data switch to retrieve all the pdfmark information. It then keeps track of the total number of pages for each merged document and does the math to offset the new page number pointer in the pdfmark instruction by the sum total of page counts of all the PDF documents included before the current PDF document. So it is close but not the same as the 2-pass approach of KenS. It first discovers bookmarks using pdftk and then creates a new bookmark file with correct page numbers. It also manages to turn the original pdfmark instruction (that would normally be preserved by gs into noop). I won't pretend I understand how that last part worked ...
However, the script does all I need including the option of tweaking the bookmark file before the final writing. Very neat and hat tip to Trevor King.

In general pdfwrite doesn't know you are appending files, so it preserves bookmark and other 'metadata' information on the assumption that you will want it in the output.
However, when you are combining PDF files, preserving the information won't work, as the page numbers for the second and subsequent files will be incorrect.
So you need a 2-pass approach, first merge all the files, discarding the bookmarks, then 'convert' the merged file and add pdfmarks to set the correct bookmarks.
There is currently no option (with pdfwrite) to not preserve bookmarks. You will need to modify the Ghostscript PDF interpreter PostScript files to achieve this I think. You might try setting -dDOPDFMARKS=false, but I doubt that will work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas