Using PDFBox or something else, is it possible to know if a pdf contains no scanned pages? - pdf

I'm looking for a solution to detect if a pdf document contains some non-searchable text, I'm thinking about a scenario where a multi-page pdf contains some plain text pages, with or without images it doesn't matter, and one or some pages containing non-searchable texts.
So I would like a method returning true/false which is able to detect if a pdf contains some non-searchable text (or viceversa), in your opinion is it possible with PDFBox or something else?
Thx

Related

What is a format for static documents like PDF but not divided into pages?

PDF is for static documents, so a document is shown the same in different applications, even if it has an unusual layout. But PDF documents are divided into pages because the format is designed for documents to be printed.
I would like to have a document with static content but with no page breaks. Which document format can do that? I guess that it could be achieved with PDF with a single page as long as it needs to be, but I don't know that any software could do that, and it seems like abuse of PDF.
I create PDF documents in LATEX, and they almost never are printed, and the page layout is in the way when they are read on a screen. So I'm looking for how I could have documents where the layout is fixed because of hyphenation, mathematics and graphics, but more suitable for reading on screens.

Generating dynamic hyperlinks in pdfs with xdocreport, odt and velocity

Good day,
I´m trying to convert some openoffice .odt files to pdf and i need to fill out some elements dynamically. I use normal inputfields which works great for just text. However, I have some text that needs to be clickable and point to a certain URL. Both the text and the url needs to be inserted dynamically, it can´t be hard coded in the .odt.
I haven´t been able to find any documentation that lets me do this. There was some references on how to do it with .docx files, but none regarding .odt.
Is it even possible to dynamically create hyperlinks in an odt that gets converted to pdf?

Itext: insert PDFs *while* generating a PDF

All the examples I can find seem to assume you are merging existing PDFs, and in particular sticking a bunch of PDFs at the end of another PDF.
In my situation I am generating something analogous to a bibliography but if the reference is to a file of PDF format, they want the contents of that PDF inserted inline, immediately after the citation, not out of place at the end of the file.
Note that there is no guarantee that the external PDFs use the same page size, rotation, etc, as the PDF document I'm generating.
Is there a way to do this? I've tried modifying the itext example on how NOT to do this (with a PdfWriter) but I get an unbalanced save/restore state message. I'm also considering doing this by post-processing as all the examples do, but I'm not quite sure how I'd go about looping through the bibliography PDF and determining where to insert the pages of the external PDFs.
Thanks.

Parse Body Text from PDF

I have just recently been experimenting with parsing the text data from a PDF document using iTextSharp in a VB2010 app. the document doesn't contain any images or other fancy elements, just text. Ive read some articles and used some code snippets and it looks promising. However, what Ive been trying to do is just parse out the body of each page, minus a header or footer. I haven't found any guidance for that particular function.
Currently using the snippet found here Reading PDF content with itextsharp dll in VB.NET or C# but it parses all text in a page. There's got to be a way to just get the body. Or at least I hope so.
PDFs generally do not contain information about logical structure of contained text.
So there are no headers, footers, body, paragraphs and anything like this in a PDF. There is only bunch of operations like "draw this glyph here", "move to this position and draw that group of glyphs there". I wrote glyph and not character because PDFs are not required to contain readable text. Only visual appearance required to be specified.
One exception is Tagged PDF but most of PDFs in the wild are not tagged.
Given all of the above you are probably left with following approach:
Extract all text from each page
Analyze text and find similar parts at the beginning / end of each page
Remove similar parts
This is a heuristic-based detection, so it probably won't always give excellent results.

How can I overlay text on a TIFF image, creating something like a searchable pdf?

I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?
Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.