A Table of Contents Page for a Scanned PDF - pdf

I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.

I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!

Related

ghostscript extract pages containing a text string

i need to programmatically extract from a multipage pdf, only the pages containing a text string. Is it possible or i need some other tools? I'm working on aix.
thanx in advance
OK firstly Ghostscript doesn't extract pages from PDF files. It creates brand new PDF files whose visual appearance should be the same as the original, but whose content will be different.
There is no way to do this with Ghostscript in a single pass. You could use the txtwrite device to extract the text then grep through the output files for the text you want, note the page numbers and then run another pass to get those pages into new files.
Be aware that extracting text from a PDF file is far from guaranteed to work! That was not the intent of the original PDF format.
Also note that GHostscript currently only allows for handling a single range of pages, First->Last, so if you have a discontinuous set (eg pages 1, 3, 5, 7 etc) then you will have to run this step multiple times.

How to delete first page from muliple PDF's

I have a collection of PDF's that sometimes have a info page for the first page of the document that I want to remove.
If there a quick way to delete this info page from all of my pdf's or at least a way to show all pdf's that have more than one page so I can better find the ones that need to be fixed?
Do you know of any program that can do this? Or way to do this with python?
Note: The info page has text on it that that always remains the same "LAND TITLE OFFICE"
Using Windows 7 OS
Thanks
Some Research turned up the following:
http://www.python.org/workshops/2002-02/papers/17/index.htm
http://www.unixuser.org/~euske/python/pdfminer/index.html
https://pypi.org/project/pypdf/
You can try these two ways:
PdfTK is an utility to manipulate PDFs. Check this link, they are doing something similar to what you need (in the comments someone also posted a script for windows)
PDFsam is a graphical powerful tool to manipulate PDFs in bulk. The split+merge sections should do the trick.
Both of them are free, I'd suggest to study the first if you want to write a "recipe" that you can use often, but the later if you have to do it once.
You can use the opensource PDFBox as a command line utility to split PDF's.
The link for PDFBox is here: link
The documentation for splitting a PDF using PDFBox is here: link
You could use the PDFBox extract text functionality from a batch script and combine with grep to identify pages that contain the text you are looking for. The extract text documentation is here: link

Extract "cover image" from CHM and EPUB files

How can I programmatically and reliably create PNG images from CHM and EPUB files? The page that's needed is only the first one, as in "cover image thumbnail generation".
Could this even be done just from the command line?
I have already looked at the open-source CHM QuickLook plug-in for MacOSX for source that does this and at Calibre, the latter to no avail.
In CHM, the default page is a webpage (a .html file). Of course it can only contain one page.
An extracter program is easy to do based on chmlib or Free Pascal's libs, but it will need the html parsed to also find names of other programs to extract. Roughly the algorithm would be:
use some "list" function of a cmdline extract tool to get the default page's name. (this is stored in an internal record)
extract it, and parse it for img and other referencing tags.
extract those.
The biggest picture downloaded in the previous step is probably "it"!

Is there a way to programmatically remove all blank pages from a PDF file?

Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.

Batch Decollating Directory of PDFs and Bar Code Imposing

I have a directory of PDFs. They are all different, but they all have 5 pages. I need to insert a bar code on each page for each PDF. After this process I need to combine and decollate every PDF. Essentially there would be 5 different PDFs created. The first would contain all page ones from every PDF, the second the second page, etc.
I need to find a tool, or a toolset, that would allow me to accomplish this. I'm willing to program my own solution but I'm not even sure what would be the most efficient language to attack it with.
What I ended up doing was using Perl with the PDF::Reuse and PDF::Reuse::BarCode libraries to get all PDFS in the directory, pull the pages I wanted, put the barcode and save out to a new PDF.