Will PDFToImage Extract embedded images also - pdfbox? - pdfbox

Would PDFToImage command do the image extraction also ?
I want to convert the PDF files To Image (PDFBOX)
I am using PDFToImage command for pdf to image file conversion,but i'm missing All Embedded images in PDF when I ran PDFToImage.
Or Do I need to Run Extract Images separately to extract images from PDFFiles?
Is there any other way to Achieve this ???
thanks in advance ...

PDFToImage is converting PDF pages into images and output one image per page.
You are looking for ExtractImages which extracts all embedded images of a PDF document.
More information about ExtractImages can be found there:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/ExtractImages.html
To answer your question more specifically, these two programs do different things. You can recode a single one that gathers the two functionalities, or run them separately.

Related

How to replace a specific image within a pdf?

I have a pdf with 3 images
I want to find each image and replace it with another image
I saw in the pdf the original paths under xmpMM:Ingredients:
I tried to change it via notepad++ but it looks like the images are already embedded and changing the path does nothing.
How can I find each image and replace it with another image?
The xmp stuff is information only. The actual images are embedded streams in the pdf file. Finding the correct streams to replace and replacing them isn't a simple problem, and can't be done with notepad. You'll need a library / toolkit that can modify PDFs, like https://pdf-lib.js.org/ or similar.
The PDF file looks like an Illustrator file, which adds another layer of weirdness - Illustrator can write PDFs that have both PDF and Illustrator versions of the content, and you see one in Acrobat and the other in Illustrator.
It's probably easier to recreate the PDF from whatever source produced it.

How to convert text pdf to image pdf using ghostscript

I need to convert text in pdf file to images, so users cannot copy it from the pdf etc.
This should be equivalent to converting the entire pdf to a set of images and then merging them to one single document. I did so, but it seems slow, is there any way to do it with ghostscipt options?
Welp, looks like I only need to specify option -dNoOutputFonts.

ghostscript extract pages containing a text string

i need to programmatically extract from a multipage pdf, only the pages containing a text string. Is it possible or i need some other tools? I'm working on aix.
thanx in advance
OK firstly Ghostscript doesn't extract pages from PDF files. It creates brand new PDF files whose visual appearance should be the same as the original, but whose content will be different.
There is no way to do this with Ghostscript in a single pass. You could use the txtwrite device to extract the text then grep through the output files for the text you want, note the page numbers and then run another pass to get those pages into new files.
Be aware that extracting text from a PDF file is far from guaranteed to work! That was not the intent of the original PDF format.
Also note that GHostscript currently only allows for handling a single range of pages, First->Last, so if you have a discontinuous set (eg pages 1, 3, 5, 7 etc) then you will have to run this step multiple times.

PDFClown image extraction images inverted

I'm working with PDFClown and I'm trying to extract images from a pdf file. I use the example code provided by the source code that can be found at http://pdfclown.org.
ImageExtractionSample.java.
The problem is the images are negative and flipped horizontally. Does anyone know how to resolve this problem?
Check with other PDF files to see if other PDF files are also giving the rotated or flipped images. ImageExtractionSample.java is not checking rotation or matrix defined transformations for the image object but just writes the content to a file as is (so it will work for JPG images but not for CCIT encoded images for example).
So there are things to consider when you extract image from PDF:
image can be rotated using the attached transformation matrix (CTM);
image can be rotated/transformed as part of the form which is transformed;
image can be placed without transformation on a page but the page itself is rotated;
image may contain the overlaid Mask on top of it (and the Mask can be rotated and transformed);
JPG image is stored pretty much as is but there are other formats supported by PDF like CCIT compression, LZW compressed images etc;
But the general suggestion is that when you extract JPG image from PDF using PDFClown you should just flip and rotate extracted images like suggested on the SourceForge project discussion page.
if you could point to the particular PDF sample file then it would be easier to suggest the solution.
If you're on Windows then you may use this free PDF Multitool utility to compare non-transformed and transformed images from PDF using "Extract raw images (without transformation)" option in images extraction dialog.
Disclaimer: I work for ByteScout, the PDF Multitool utility is free for both commercial and non-commercial purposes.

OCR within an x,y window of a pdf

I need to find an open source or linux based utility that allows me to set an x,y coordinate in a setup file. I would like to then sequentially open pdf's and look in the documents for first, last name and account number and save the file with a file name consisting of last name and file number.
You may want to read some of these answers first :
A Java Library for text extraction from PDF documents preserving empty spaces and lines
How to extract text from a PDF?
How-to extract text from a pdf doc within a specific rectangular region?
The answers above are not Linux specific.
Most PDF documents do not need to be OCR'ed as the text is contained within the PDF. The hard part is extracting in. The Java version of iText (http://itextpdf.com/) is probably the best toolkit under Linux to extract the PDF text strings. Another option may be http://pdfbox.apache.org/
If the text you need to extract is actually an image then you will probably need to convert the whole PDF page to image format such as TIFF and pass that into an OCR engine such as Google Tesseract OCR.