How can I overlay text on a TIFF image, creating something like a searchable pdf? - pdf

I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?

Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.

Related

How to replace a specific image within a pdf?

I have a pdf with 3 images
I want to find each image and replace it with another image
I saw in the pdf the original paths under xmpMM:Ingredients:
I tried to change it via notepad++ but it looks like the images are already embedded and changing the path does nothing.
How can I find each image and replace it with another image?
The xmp stuff is information only. The actual images are embedded streams in the pdf file. Finding the correct streams to replace and replacing them isn't a simple problem, and can't be done with notepad. You'll need a library / toolkit that can modify PDFs, like https://pdf-lib.js.org/ or similar.
The PDF file looks like an Illustrator file, which adds another layer of weirdness - Illustrator can write PDFs that have both PDF and Illustrator versions of the content, and you see one in Acrobat and the other in Illustrator.
It's probably easier to recreate the PDF from whatever source produced it.

Text changed to graphics, still selectable in PDF?

I have this PDF ebook with selectable text - the handwriting - but there is no such font embedded and the letters are all different, so it's not actually a font. How is this possible?
I've worked with CorelDraw and Adobe Acrobat, but I can't understand how this works.
The left side of the picture shows the document properties, the right side shows a page of the PDF file and I selected the last 3 rows. I can copy and paste that to a text file, no problem. How was this achieved?
There are a few possibilities but the most likely is the text is being converted to outlines/paths or vectors. Some software such as Adobe InDesign and other print design apps allow you to 'flatten' a font based text into vector or paths, meaning the original font isn't required to be embedded or installed on the system. The original text data is however still present and able to be copied into a text field or word processor.

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?
If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.

Creating RTF text for clipboard and sharing DataPackage in WinRT

I'm sure this is just a google search away, but I can't find the right search terms to find what I'm looking for.
I've created a DataPackage that has both HTML annd plain text content. I've used this in my copy and my sharing code and it works fine. I now want to create RTF output as some apps don't seem to accept HTML clipboard content.
I'm looking for a good guide to making RTF text that can be added to the DataPackage. I just need simple formatting including changing the font family, font size, font weight and adding newlines. The data comes from a list of objects taht I want to serialise as RTF, not from a text control on the screen.
WordPad outputs fairly clean RTF and some other text editors do as well. If that's not enough, you can download the RTF Specification 1.9.1 although like any specification that's probably overkill for what you're doing.
You can also use the SaveToStream method on the Document property of a RichEditBox from a Metro style app to share out as well.

Programmatically replace text in PDF

I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version.
It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs and techniques like OCr are not needed. Also, it would be nice if font and other text attributes are kept.
Which libraries would you recommend for extracting the text to an easy to edit format (such as CSV) and put the new text back in again?
Assuming you are replacing text with a different language, you will have to choose a different font in most cases, and the font choice is non-trivial. I've used the Foxit libraries to change text or create PDFs with success.