How can I add an interactive "table of contents" to a scanned pdf? - pdf

I'm trying to go from a paper document to a searchable pdf with a table of contents.
Sometimes you will download a pdf book or document, (like for example the Intel Manual which can be seen below) This document is searchable and it also has a table of contents. Now, when you put this same document on Google Drive and then open it up with PDF Expert on an ipad, it is still searchable with a table of contents. This is what I'd like to do with all my scanned pdfs.
Now a more concrete example. Shown below is a document that I've scanned with the Fujitsu ScanSnap. It's also searchable thanks to some software that comes with the ScanSnap. So now I have a searchable pdf that can be opened up locally or on my ipad, but it doesn't have a table of contents. So my main question is: How can I add a table of contents like the one in for the Intel Manual to a scanned pdf
It seems like there's tons of people doing different things with "table of contents". Like people who are designing documents use InDesign. I think that what I'm trying to do must be simpler than that. I'm thinking that there has to be an easy way to do this using say Adobe Acrobat Pro? Something about adding "bookmarks" or "links" or "tags" to the existing table of contents. Do you know of a clear and concise way to do this using acrobat or some other software?
thanks for the help

Jpdfbookmark can work for scanned books
Watch tutorial video ≫
Step 1: Prepare the table of content
Save the TOC in a .txt file in this format:
Chapter 1. The Beginning/23
Para 1.1 Child of The Beginning/25,FitWidth,96
Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
Para 2.1 Child of The Beginning/32,FitPage
You can ORC the TOC and use regex to fix it.
Step 2: Load that TOC
Step 3: Prepare for step 4
This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset
Step 4: Apply page offset
This step should be self-explained. Don’t forget to save.
That’s it. You are done. For more information, you can read its its manual. The program has command line mode and can work on Linux, Mac.
If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.
I also have a complete guide to process scanned books, you may want to check it out: The ultimate guide to process scanned books.
FYI:
• How to OCR tables of contents to proper outputs?
• How can I split in half a double-page scanned PDF in a single pass?

I have done this before by combining multiple "booklets". Each "Chapter" was a series of pages combined in Adobe Acrobat Pro. I would combine chapters into separate "booklets" and then name them a chapter name, and then combine all chapters into a new booklet.

Related

How can I edit the search text of a searchable PDF?

I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.

Adobe Acrobat X Pro - Export PDF to Text accessible

I'm in a PDF reading project through C # 2012 with Aspose. But I'm struggling to read pages that have several columns.
For example, a PDF can have on page 1 (one) and two columns on page 3 (three) and 5 columns in the sixth page onwards have only one (1) column.
I can not previously know which page will come and how many columns.
I thought about using the SDK ADOBE ACROBAT X PRO to export in handy .txt (with page break marker), but do not know if it's the best solution.
Using the SAVE AS (not via sdk) Acrobat realized that does not export correctly, am I doing something wrong?
Which can guide me?
PS: I used iTextSharp initially, however Aspose proved to be an extremely better tool than iTextSharp.

A Table of Contents Page for a Scanned PDF

I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.
I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!

Accessibility concerns for website providing massive amounts of PDFs

I am working on a website providing massive amount of PDFs for download and I am trying to improve the website accessibility. All I can think of is:
Provide equivalent content for the PDFs when possible (text or HTML for example).
Provide description for the PDF documents before the use can download them.
Make it possible to search within the PDF files when the users use the website search.
Make the links to the PDFs labelled by a nice icon.
Inform the users that they will need a third party application (Acrobat or other PDF viewers) in order to open the documents.
Are there other ways to improve it?
Like Jared said, assistive technology works decently with PDFs. The question is what kind of quality control do you have. There is a few different ways of putting together a PDF. One way is scanning a document and the result is a PDF made out of images. When assistive technology hits it, all it says is image image image, great help right?
Now Adobe built in an Optical Character Recognition ability (second way), which has improved over the years, but is far from quality. For example, I was given a PDF that had OCR on it. One of the first lines had the word Articles, in italics, the OCR spit out Art/e5. The third way is to produce PDFs containing actual text. Now Office 2007/2010, have the ability to save as a PDF. Before hitting save, click the options button and ensure the "document tags for accessibility" box is checked.
PDFs have a tag structure, like HTML, found via the Tags panel/pane. The output in 2010, is a bit cleaner than 2007, but I still recommend something like Commonlook Office to create your PDFs.
4.Make the links to the PDFs labelled by a nice icon.
You could put an icon within the link. Some people do:
Link text <img src=".." alt="PDF icon"/>
Some people using assistive tech just browse via links, so they won't know it is a PDF before they open it. So, it is better to do:
Link text <img src="" alt="PDF"/>
5.Inform the users that they will need a third party application (Acrobat or other PDF viewers) in order to open the documents.
It is a good idea to do this, in fact Section 508 requirements say to do this. I recommend linking to Adobe Reader for two reasons.
1- if the person does not have a PDF viewer, they'll probably call their "computer expert" who probably heard of Adobe Reader, and knows the site isn't pushing some ad-ware.
2- Adobe Reader has the most built-in accessibility of the readers out there, to my knowledge. So, why would you not give the best.
There are several things you can do to improve the accessibility of the PDFs themselves.
Provide "Alternate Descriptions" for images
Provide "Replacement Text" for items such as equations or abbreviations
Replacement Text can also be used to hint at the pronunciation of names
Mark the language, especially if it is mixed
This will assist a screen reader in properly understanding the PDF. This isn't crucial for pages that contain only text in regular paragraph layout - the reader can usually figure things out. If there are pictures, captions, jargon, names, etc, this will greatly improve the reader's performance.

How to delete first page from muliple PDF's

I have a collection of PDF's that sometimes have a info page for the first page of the document that I want to remove.
If there a quick way to delete this info page from all of my pdf's or at least a way to show all pdf's that have more than one page so I can better find the ones that need to be fixed?
Do you know of any program that can do this? Or way to do this with python?
Note: The info page has text on it that that always remains the same "LAND TITLE OFFICE"
Using Windows 7 OS
Thanks
Some Research turned up the following:
http://www.python.org/workshops/2002-02/papers/17/index.htm
http://www.unixuser.org/~euske/python/pdfminer/index.html
https://pypi.org/project/pypdf/
You can try these two ways:
PdfTK is an utility to manipulate PDFs. Check this link, they are doing something similar to what you need (in the comments someone also posted a script for windows)
PDFsam is a graphical powerful tool to manipulate PDFs in bulk. The split+merge sections should do the trick.
Both of them are free, I'd suggest to study the first if you want to write a "recipe" that you can use often, but the later if you have to do it once.
You can use the opensource PDFBox as a command line utility to split PDF's.
The link for PDFBox is here: link
The documentation for splitting a PDF using PDFBox is here: link
You could use the PDFBox extract text functionality from a batch script and combine with grep to identify pages that contain the text you are looking for. The extract text documentation is here: link