How to delete first page from muliple PDF's - pdf

I have a collection of PDF's that sometimes have a info page for the first page of the document that I want to remove.
If there a quick way to delete this info page from all of my pdf's or at least a way to show all pdf's that have more than one page so I can better find the ones that need to be fixed?
Do you know of any program that can do this? Or way to do this with python?
Note: The info page has text on it that that always remains the same "LAND TITLE OFFICE"
Using Windows 7 OS
Thanks
Some Research turned up the following:
http://www.python.org/workshops/2002-02/papers/17/index.htm
http://www.unixuser.org/~euske/python/pdfminer/index.html
https://pypi.org/project/pypdf/

You can try these two ways:
PdfTK is an utility to manipulate PDFs. Check this link, they are doing something similar to what you need (in the comments someone also posted a script for windows)
PDFsam is a graphical powerful tool to manipulate PDFs in bulk. The split+merge sections should do the trick.
Both of them are free, I'd suggest to study the first if you want to write a "recipe" that you can use often, but the later if you have to do it once.

You can use the opensource PDFBox as a command line utility to split PDF's.
The link for PDFBox is here: link
The documentation for splitting a PDF using PDFBox is here: link
You could use the PDFBox extract text functionality from a batch script and combine with grep to identify pages that contain the text you are looking for. The extract text documentation is here: link

Related

Does anyone know of a technology that allows one to edit the tags on pdfs?

I am looking to programmatically edit the tags in a pdf document.In particular I would like to be able to copy tags from one document to another, and edit them as I copy them over.
I have looked at coherent pdf, pythons pdfrw and pythons pdfedit and not been sucessful. I am creating the pdfs in Latex so any Latex based solution would be amazing, but i have not come up with anything that allows me to create tags).
Any advise?

A Table of Contents Page for a Scanned PDF

I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.
I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!

How can I add an interactive "table of contents" to a scanned pdf?

I'm trying to go from a paper document to a searchable pdf with a table of contents.
Sometimes you will download a pdf book or document, (like for example the Intel Manual which can be seen below) This document is searchable and it also has a table of contents. Now, when you put this same document on Google Drive and then open it up with PDF Expert on an ipad, it is still searchable with a table of contents. This is what I'd like to do with all my scanned pdfs.
Now a more concrete example. Shown below is a document that I've scanned with the Fujitsu ScanSnap. It's also searchable thanks to some software that comes with the ScanSnap. So now I have a searchable pdf that can be opened up locally or on my ipad, but it doesn't have a table of contents. So my main question is: How can I add a table of contents like the one in for the Intel Manual to a scanned pdf
It seems like there's tons of people doing different things with "table of contents". Like people who are designing documents use InDesign. I think that what I'm trying to do must be simpler than that. I'm thinking that there has to be an easy way to do this using say Adobe Acrobat Pro? Something about adding "bookmarks" or "links" or "tags" to the existing table of contents. Do you know of a clear and concise way to do this using acrobat or some other software?
thanks for the help
Jpdfbookmark can work for scanned books
Watch tutorial video ≫
Step 1: Prepare the table of content
Save the TOC in a .txt file in this format:
Chapter 1. The Beginning/23
Para 1.1 Child of The Beginning/25,FitWidth,96
Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
Para 2.1 Child of The Beginning/32,FitPage
You can ORC the TOC and use regex to fix it.
Step 2: Load that TOC
Step 3: Prepare for step 4
This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset
Step 4: Apply page offset
This step should be self-explained. Don’t forget to save.
That’s it. You are done. For more information, you can read its its manual. The program has command line mode and can work on Linux, Mac.
If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.
I also have a complete guide to process scanned books, you may want to check it out: The ultimate guide to process scanned books.
FYI:
• How to OCR tables of contents to proper outputs?
• How can I split in half a double-page scanned PDF in a single pass?
I have done this before by combining multiple "booklets". Each "Chapter" was a series of pages combined in Adobe Acrobat Pro. I would combine chapters into separate "booklets" and then name them a chapter name, and then combine all chapters into a new booklet.

Get text from a pdf in NSString

I am trying to make an iOS app which would extract plain text from a pdf file and display it in a UITextView. Its simply not a pdf reader to view a pdf file but i would later wish to perform certain operations on that text.
I have already googled a lot but still not able to get an exact solution.
i already tried using https://github.com/zachron/pdfiphone
but the files are using ARMV6 architecture which seems obsolete with xcode 4.5
And if anyone can suggest some exact and non-confusing code using Quartz-2d framework of iOS then it would be great.
Here is An Sample code to Extract text from PDF Hope this Might Help You.
https://github.com/zachron/pdfiphone
This is a library to get the text out of a PDF for the iPhone.
Another Demo is there Which uses OCR technology find the link below
https://github.com/nolanbrown/Tesseract-iPhone-Demo
Also Check this page of the Quartz 2D Programming Guide, it covers everything you need to open and parse a PDF file in iOS. Note that it is not a simple task, since there's no method to extract the full text in one line. You have to work with the data as an input stream, using a CGPDFScanner
Two Other Libraries
https://github.com/KurtCode/PDFKitten/
https://github.com/mobfarm/FastPdfKit
This question comes up all the time. It is VERY hard to extract text from PDF in general. The PDF specification is not designed with text extraction in mind. There are many libraries that try to do the job, essentially by reconstructing the text from the geometric placement of the individual glyphs. These libraries have varying degrees of success, but will all fail on certain PDF documents. In fact, some PDF documents have Glyphs but no way to associate the glyph with a character. For these documents it is simply not possible to extract text, short of using some kind of OCR approach.
PDF is designed as a read-only format that is portable in the sense that a PDF document will be rendered identically on any platform. That is what it is best at, and what it should be used for.
If text is to be edited, do not use PDF.
Here (Extracting text from pdf using objective-c), I found an answer to your question and it works. But not so fine as i need it :(
it can extract only ascii
it return me only one paragraph
Good luck.

web based form to collect data and populate to a fillable PDF

Is there a script that anyone can suggest that would allow me to create a HTML or PHP web based form to collect data and save that data. the call the data to be populated in a fillable pdf?
If you have an existing PDF that you want to populate, and that PDF just has text fields (no checkboxes or radio buttons) then CAM::PDF may be able to help you. You can use it as a Perl library directly, or use its command-line interface. CAM::PDF is not useful for generating PDFs from scratch, however. Furthermore, if you have embedded fonts, then you need to ensure that all of the characters you plan to insert are represented in the embedded font.
Use a normal web page to get the data. If not sure how to do it, look for "php forms" on google, there are plenty of tutorials.
Then use a php pdf generator, like this one, to create the PDF file. If you look hard enough, you will probably find a pdf generator that will let you use a template with placeholders where the entered data should be.