Creating acrobat links in source document - acrobat

I've used Acrobat Pro for many years but only scraped the surface of its capabilities.
Here's the problem I'm facing, I'm sure that Acrobat can deal with it, I just can't figure out how...
So... I have a wiki based content management system from which I need to regularly produce packs of content (consisting of up to 250 articles), packaged as a single PDF.
I have the Webkit html - pdf conversion engine up and running and have written code (javascript/jquery) that selects the relevant articles (based on user input), sorts them alphabetically and PDFs them into a single acrobat document. The process creates simple bookmarks, one for each article, based on its title.
So far, so good...
I also dynamically create a second document that contains a Table of Contents and index.
An example... The wiki contains pages:
All about Foo
All about Bar
All about Baz
These are exported into a single PDF, in the order
All about Bar
All about Baz
All about Foo
And bookmarks are created that link to each 'page'
The ToC and index are then created as temp wiki pages (using Javascript and JQuery) and exported to PDF eg:
1 All about Bar
2 All about Baz
3 All about Foo
And
Bar 1
Baz 2
Foo 3
Systems 1,2,3
Kittens 2,3
Etc etc
This has worked fine and everyone has been happy. However, in the nature of users everywhere, they want more. Specifically, they want to merge the ToC and Index into the PDF (that part's easy) and link on the article numbers.
That's where my knowledge and the results I can pull out of Google come to a halt. I can do it manually of course but with three or four of these packs per week, each containing between 20 and 350 articles and well over a thousand index entries - that would be a world of pain.
As the ToC and Index are generated by scripts, I can wrap the numbers in any sort of html markup.
I wondered about something like
4
But don't see how to pick up on that on merge so that it converts the html links to pdf ones.
Any ideas or pointers?

Related

does pandoc have option for putting multiple "pages" per "sheet"?

I've been using a shell script with pandoc to create multiple page pdf files. I can specify the size of the page... but if we consider this "page" to be a sheet of paper (the pdf gets printed onto paper)... I want to actually have multiple pages per sheet - 2 pages on each side of the side of the paper. 4 pages get printed to each sheet of paper (and the paper gets folded in half).
Depending on the style of printing, the ordering of pages is different:
for "booklet" printing where all the pages are stacked together and then folded once together, the page ordering of an 8-page (2 sheets of paper) document would be pages 8 & 1, on the back of that pages 2 & 7, on the next sheet pages 6 & 3, and on its back 4 &5.
for "book" style printing, where each sheet gets folded on its own, and then the folded sheets placed together, the order is different: page 4&1 with 2&3 on the back, and the next sheet pages 8 & 5 on one side with 6 & 7 on the other.
The desktop publishing program called Scribus (among other design-to-print softwares) has functionality for ordering the pages like this (for reference of the intent which I describe this article describes the situation). But I don't want to use a GUI like Scribus. I'm writing the pages in markdown in Vim and generating the pdf from the command line.
Does pandoc have a way of ordering pages like this?
As you state in comments "It might be that pandoc can NOT do this" without some TeX fettling.
I had previously given an answer with visuals of flat in TeX eXchange to a versatile 32-Up page way imposition can be done in LaTeX (and thus an 8 page should not be a problem by trimming down the answer)
see https://tex.stackexchange.com/questions/494047/a-tex-script-to-impose-multiple-layout-signatures-optionally-saddle-stitch-fro/494232#494232
Simple Imposition (booklets and n-up print compositing) can be done by other PDF orientated CLI tools that Pandoc uses but the last time I built a LaTeX program (Linked Above) it was a pain to get right.
More recently I wrote similar function to work inside SumatraPDF reader and for that I used simple CMD batch commands via either N-Up-PDF or cPDF as they do the basic stuff like rotate join 2 and reorder to booklets but you may need to adapt to suit your own use.
My RTFManual (Rich Text Format) for install/usage are in PDF #
https://github.com/GitHubRulesOK/MyNotes/raw/master/AppNotes/SumatraPDF/Addins/N-Up-PDF/N-Up-PDF.pdf

Split by bookmark in middle of PDF page

For a little open-source project I am trying to split PDFs based on the bookmark information. I found several ways to split them by page, but sadly the PDFs I need to split have bookmarks in the middle of the page. So if I have 3 pages and there are two bookmarks, 1 at the first page, one at the middle of the 2nd page, I need to extract 2 PDFs with 2 pages each and no overlap in text in both PDFs.
Does something like that exist?

How can I add an interactive "table of contents" to a scanned pdf?

I'm trying to go from a paper document to a searchable pdf with a table of contents.
Sometimes you will download a pdf book or document, (like for example the Intel Manual which can be seen below) This document is searchable and it also has a table of contents. Now, when you put this same document on Google Drive and then open it up with PDF Expert on an ipad, it is still searchable with a table of contents. This is what I'd like to do with all my scanned pdfs.
Now a more concrete example. Shown below is a document that I've scanned with the Fujitsu ScanSnap. It's also searchable thanks to some software that comes with the ScanSnap. So now I have a searchable pdf that can be opened up locally or on my ipad, but it doesn't have a table of contents. So my main question is: How can I add a table of contents like the one in for the Intel Manual to a scanned pdf
It seems like there's tons of people doing different things with "table of contents". Like people who are designing documents use InDesign. I think that what I'm trying to do must be simpler than that. I'm thinking that there has to be an easy way to do this using say Adobe Acrobat Pro? Something about adding "bookmarks" or "links" or "tags" to the existing table of contents. Do you know of a clear and concise way to do this using acrobat or some other software?
thanks for the help
Jpdfbookmark can work for scanned books
Watch tutorial video ≫
Step 1: Prepare the table of content
Save the TOC in a .txt file in this format:
Chapter 1. The Beginning/23
Para 1.1 Child of The Beginning/25,FitWidth,96
Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
Para 2.1 Child of The Beginning/32,FitPage
You can ORC the TOC and use regex to fix it.
Step 2: Load that TOC
Step 3: Prepare for step 4
This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset
Step 4: Apply page offset
This step should be self-explained. Don’t forget to save.
That’s it. You are done. For more information, you can read its its manual. The program has command line mode and can work on Linux, Mac.
If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.
I also have a complete guide to process scanned books, you may want to check it out: The ultimate guide to process scanned books.
FYI:
• How to OCR tables of contents to proper outputs?
• How can I split in half a double-page scanned PDF in a single pass?
I have done this before by combining multiple "booklets". Each "Chapter" was a series of pages combined in Adobe Acrobat Pro. I would combine chapters into separate "booklets" and then name them a chapter name, and then combine all chapters into a new booklet.

Is there a way to programmatically remove all blank pages from a PDF file?

Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.

Batch Decollating Directory of PDFs and Bar Code Imposing

I have a directory of PDFs. They are all different, but they all have 5 pages. I need to insert a bar code on each page for each PDF. After this process I need to combine and decollate every PDF. Essentially there would be 5 different PDFs created. The first would contain all page ones from every PDF, the second the second page, etc.
I need to find a tool, or a toolset, that would allow me to accomplish this. I'm willing to program my own solution but I'm not even sure what would be the most efficient language to attack it with.
What I ended up doing was using Perl with the PDF::Reuse and PDF::Reuse::BarCode libraries to get all PDFS in the directory, pull the pages I wanted, put the barcode and save out to a new PDF.