How to find whether the text overlaps with the overlay in a PDF - pdfbox

I am using PDF box version 2.0.25 to manipulate pdf files. I add overlays to the pdf files. These overlays consists of various texts, images and shapes. And I want to check whether they overlap with the existing texts and images of the pdf.
I have found a way to extract the images and text locations of the PDF. But I am unable to find a way to extract overlay information.
Can you please help?

Related

Text changed to graphics, still selectable in PDF?

I have this PDF ebook with selectable text - the handwriting - but there is no such font embedded and the letters are all different, so it's not actually a font. How is this possible?
I've worked with CorelDraw and Adobe Acrobat, but I can't understand how this works.
The left side of the picture shows the document properties, the right side shows a page of the PDF file and I selected the last 3 rows. I can copy and paste that to a text file, no problem. How was this achieved?
There are a few possibilities but the most likely is the text is being converted to outlines/paths or vectors. Some software such as Adobe InDesign and other print design apps allow you to 'flatten' a font based text into vector or paths, meaning the original font isn't required to be embedded or installed on the system. The original text data is however still present and able to be copied into a text field or word processor.

PDFClown image extraction images inverted

I'm working with PDFClown and I'm trying to extract images from a pdf file. I use the example code provided by the source code that can be found at http://pdfclown.org.
ImageExtractionSample.java.
The problem is the images are negative and flipped horizontally. Does anyone know how to resolve this problem?
Check with other PDF files to see if other PDF files are also giving the rotated or flipped images. ImageExtractionSample.java is not checking rotation or matrix defined transformations for the image object but just writes the content to a file as is (so it will work for JPG images but not for CCIT encoded images for example).
So there are things to consider when you extract image from PDF:
image can be rotated using the attached transformation matrix (CTM);
image can be rotated/transformed as part of the form which is transformed;
image can be placed without transformation on a page but the page itself is rotated;
image may contain the overlaid Mask on top of it (and the Mask can be rotated and transformed);
JPG image is stored pretty much as is but there are other formats supported by PDF like CCIT compression, LZW compressed images etc;
But the general suggestion is that when you extract JPG image from PDF using PDFClown you should just flip and rotate extracted images like suggested on the SourceForge project discussion page.
if you could point to the particular PDF sample file then it would be easier to suggest the solution.
If you're on Windows then you may use this free PDF Multitool utility to compare non-transformed and transformed images from PDF using "Extract raw images (without transformation)" option in images extraction dialog.
Disclaimer: I work for ByteScout, the PDF Multitool utility is free for both commercial and non-commercial purposes.

How to clip and concatenate a page region in multiple pdf files with one page each?

I have a lot of pdf files each one with an image inside. I want to clip a rectangular region in each of these files and concatenate them into a single pdf file. Is it possible with ghostscript or similar?
I'll have a go at this. Try Briss if you want to crop rectangular regions in pdf files. It's free and cross-platform GUI.
If you have multiple pdf files you can concatenate/merge them first online using http://www.pdfmerge.com/ Then use Briss to crop the images out into a new pdf file. Or vice-versa depending on the location of your images inside the pdf files.
After you fire up Briss, load the merged pdf file containing the images. When you're asked if you want to exlude anything, just click "cancel" if you want to include all pages.
If your file has many pages, similar pages may be overlapping each other so you can draw a rectangle over the region you want to crop. Click Action -> Preview for previewing the output. Click Action -> Crop PDF to finalize your output pdf file. Cheers.

Apache POI - Word Files

I have been successfully able to read a word document containing images usiong POI.
I have even be able to extract a section from Word document including the images.
I am writing the extracted portion containg images to a new word document.
My problem is that I have to display this extracted portion (containing text, fonts, colors images) on the screen using any standard Java Swing component.
Please advise how can I do that?
I tried JText, Panel, editor but all would take only text and I loose my formatting and images.
Regards

How can I overlay text on a TIFF image, creating something like a searchable pdf?

I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?
Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.