ghostscript extract pages containing a text string - pdf

i need to programmatically extract from a multipage pdf, only the pages containing a text string. Is it possible or i need some other tools? I'm working on aix.
thanx in advance

OK firstly Ghostscript doesn't extract pages from PDF files. It creates brand new PDF files whose visual appearance should be the same as the original, but whose content will be different.
There is no way to do this with Ghostscript in a single pass. You could use the txtwrite device to extract the text then grep through the output files for the text you want, note the page numbers and then run another pass to get those pages into new files.
Be aware that extracting text from a PDF file is far from guaranteed to work! That was not the intent of the original PDF format.
Also note that GHostscript currently only allows for handling a single range of pages, First->Last, so if you have a discontinuous set (eg pages 1, 3, 5, 7 etc) then you will have to run this step multiple times.

Related

How to get text from all pages in a PDF using textract?

I have to use python to automate getting text from files, which could have any number of different file extensions, which I don't know ahead of time. Textract seems to be very functional for this purpose, as it accepts a lot of file extensions, but when I use it to read PDFs, it only returns text from the first page. How do I get it to return the text from all pages in the PDF in one string?

A Table of Contents Page for a Scanned PDF

I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.
I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!

Recover text from PDF file when normal methods fail

I have a few hundred PDF files from which I need to extract sections of text. For many, pdftotext works fine, but for others, it misses large sections of text. If I open the PDF in Acrobat and select that text by hand and copy/paste into emacs and then view the file without an encoding, I get stuff like this:
Husband \364\200\200\272\364\200\201\213\364 etc.
How can I extract the text correctly?
I should mention that I've tried saving as text from Acrobat; also tried applying Acrobat's Document=>OCR feature before copying.
Why not convert the PDF to doc or txt first? See the guide:
http://www.aolor.com/pdf-converter/user-guide.html

OCR within an x,y window of a pdf

I need to find an open source or linux based utility that allows me to set an x,y coordinate in a setup file. I would like to then sequentially open pdf's and look in the documents for first, last name and account number and save the file with a file name consisting of last name and file number.
You may want to read some of these answers first :
A Java Library for text extraction from PDF documents preserving empty spaces and lines
How to extract text from a PDF?
How-to extract text from a pdf doc within a specific rectangular region?
The answers above are not Linux specific.
Most PDF documents do not need to be OCR'ed as the text is contained within the PDF. The hard part is extracting in. The Java version of iText (http://itextpdf.com/) is probably the best toolkit under Linux to extract the PDF text strings. Another option may be http://pdfbox.apache.org/
If the text you need to extract is actually an image then you will probably need to convert the whole PDF page to image format such as TIFF and pass that into an OCR engine such as Google Tesseract OCR.

create two pdfs from one .ps file?

I need to reformat a text file into a PDF. Using Perl, I am modifying an existing PostScript template file based on what is in the text file. Sometimes this text file will be long enough to require a two page PDF.
Can I create a two page PDF file from one .ps file using GhostScript? If so, what tells GhostScript where the page break should occur?
Maybe I need to use two template files. One for a one page pdf and another for a two page PDF.
PostScript doesn't directly have the concept of text flows or page breaks. The showpage operator renders the page to the device, clears the page and starts a new one. PS to PDF conversion will create a new page in the PDF on this operator. If you want to chop up a PostScript file into pages, psutils is a series of programs for manipulating PostScript files.
It's down to whatever is converting your text file to create appropriate PostScript commands to handle the page break.
A page break will happen if (and only if) your PostScript template invokes showpage.
I would guess it depends on what's in your PostScript template. A PostScript file is a computer program, and page breaks are determined by the logic in the PostScript. If the two-page format is substantially the same as the one-page format, you could have your Perl script split the data up, then create two single-page files concatenated together. GhostScript should render that file correctly.