How to get text from all pages in a PDF using textract? - pdf

I have to use python to automate getting text from files, which could have any number of different file extensions, which I don't know ahead of time. Textract seems to be very functional for this purpose, as it accepts a lot of file extensions, but when I use it to read PDFs, it only returns text from the first page. How do I get it to return the text from all pages in the PDF in one string?

Related

Generating dynamic hyperlinks in pdfs with xdocreport, odt and velocity

Good day,
I´m trying to convert some openoffice .odt files to pdf and i need to fill out some elements dynamically. I use normal inputfields which works great for just text. However, I have some text that needs to be clickable and point to a certain URL. Both the text and the url needs to be inserted dynamically, it can´t be hard coded in the .odt.
I haven´t been able to find any documentation that lets me do this. There was some references on how to do it with .docx files, but none regarding .odt.
Is it even possible to dynamically create hyperlinks in an odt that gets converted to pdf?

Using PDFBox or something else, is it possible to know if a pdf contains no scanned pages?

I'm looking for a solution to detect if a pdf document contains some non-searchable text, I'm thinking about a scenario where a multi-page pdf contains some plain text pages, with or without images it doesn't matter, and one or some pages containing non-searchable texts.
So I would like a method returning true/false which is able to detect if a pdf contains some non-searchable text (or viceversa), in your opinion is it possible with PDFBox or something else?
Thx

ghostscript extract pages containing a text string

i need to programmatically extract from a multipage pdf, only the pages containing a text string. Is it possible or i need some other tools? I'm working on aix.
thanx in advance
OK firstly Ghostscript doesn't extract pages from PDF files. It creates brand new PDF files whose visual appearance should be the same as the original, but whose content will be different.
There is no way to do this with Ghostscript in a single pass. You could use the txtwrite device to extract the text then grep through the output files for the text you want, note the page numbers and then run another pass to get those pages into new files.
Be aware that extracting text from a PDF file is far from guaranteed to work! That was not the intent of the original PDF format.
Also note that GHostscript currently only allows for handling a single range of pages, First->Last, so if you have a discontinuous set (eg pages 1, 3, 5, 7 etc) then you will have to run this step multiple times.

Recover text from PDF file when normal methods fail

I have a few hundred PDF files from which I need to extract sections of text. For many, pdftotext works fine, but for others, it misses large sections of text. If I open the PDF in Acrobat and select that text by hand and copy/paste into emacs and then view the file without an encoding, I get stuff like this:
Husband \364\200\200\272\364\200\201\213\364 etc.
How can I extract the text correctly?
I should mention that I've tried saving as text from Acrobat; also tried applying Acrobat's Document=>OCR feature before copying.
Why not convert the PDF to doc or txt first? See the guide:
http://www.aolor.com/pdf-converter/user-guide.html

create two pdfs from one .ps file?

I need to reformat a text file into a PDF. Using Perl, I am modifying an existing PostScript template file based on what is in the text file. Sometimes this text file will be long enough to require a two page PDF.
Can I create a two page PDF file from one .ps file using GhostScript? If so, what tells GhostScript where the page break should occur?
Maybe I need to use two template files. One for a one page pdf and another for a two page PDF.
PostScript doesn't directly have the concept of text flows or page breaks. The showpage operator renders the page to the device, clears the page and starts a new one. PS to PDF conversion will create a new page in the PDF on this operator. If you want to chop up a PostScript file into pages, psutils is a series of programs for manipulating PostScript files.
It's down to whatever is converting your text file to create appropriate PostScript commands to handle the page break.
A page break will happen if (and only if) your PostScript template invokes showpage.
I would guess it depends on what's in your PostScript template. A PostScript file is a computer program, and page breaks are determined by the logic in the PostScript. If the two-page format is substantially the same as the one-page format, you could have your Perl script split the data up, then create two single-page files concatenated together. GhostScript should render that file correctly.