PDFClown image extraction images inverted - pdf

I'm working with PDFClown and I'm trying to extract images from a pdf file. I use the example code provided by the source code that can be found at http://pdfclown.org.
ImageExtractionSample.java.
The problem is the images are negative and flipped horizontally. Does anyone know how to resolve this problem?

Check with other PDF files to see if other PDF files are also giving the rotated or flipped images. ImageExtractionSample.java is not checking rotation or matrix defined transformations for the image object but just writes the content to a file as is (so it will work for JPG images but not for CCIT encoded images for example).
So there are things to consider when you extract image from PDF:
image can be rotated using the attached transformation matrix (CTM);
image can be rotated/transformed as part of the form which is transformed;
image can be placed without transformation on a page but the page itself is rotated;
image may contain the overlaid Mask on top of it (and the Mask can be rotated and transformed);
JPG image is stored pretty much as is but there are other formats supported by PDF like CCIT compression, LZW compressed images etc;
But the general suggestion is that when you extract JPG image from PDF using PDFClown you should just flip and rotate extracted images like suggested on the SourceForge project discussion page.
if you could point to the particular PDF sample file then it would be easier to suggest the solution.
If you're on Windows then you may use this free PDF Multitool utility to compare non-transformed and transformed images from PDF using "Extract raw images (without transformation)" option in images extraction dialog.
Disclaimer: I work for ByteScout, the PDF Multitool utility is free for both commercial and non-commercial purposes.

Related

Extract Image from PDF correctly

I have a PDF file that contains an image where this image is successfully displayed. When I try to extract the image from the PDF file using itextsharp or pdfsharp libs I get bytes, then decode them successfully (because there is /Filter/FlateDecode there). But when I try to convert these bytes to an image using different libs the exception occured (it looks like the bytes are actually not an image). As far as I understand the problem is processing these bytes, but the image in the Pdf is not corrupted because it is shown there correctly. PDF is here.
The images are most likely stored in the PDF image format which is documented in the PDF specification.
It is rather simple to convert them to the Windows BMP format. But still you must convert them and add headers with the specific information from the image attributes from the PDF file.
In PDF a new image line is byte-aligned, in Windows BMP it is DWORD-aligned.
Don't forget to extract the colour table if there is one.

How to transfer OCR text from one PDF to another PDF?

I have two versions of one same scanned PDF. One of them has an OCR layer. How can I transfer the layer to the other one? I already install Ghostscript, but I don't know what to do next.
How to Use Ghostscript
There's no such thing as an 'OCR layer' in PDF.
Most likely what you have is a PDF file which has a scanned image and the text extracted from that image using OCR which has been drawn as 'invisible' text (text rendering mode 3).
In general you can't copy and paste text between PDF files, so it's very hard to do what you are asking. I don't know of any tools which will help you here, I can say for certain that Ghostscript absolutely will not help you at all.
Most likely you will also need to copy the Font (or CIDFont) from the PDF file as well, and if it has a ToUnicode CMap you'll definitely also want that or search won't work (and there's little point in this sort of OCR otherwise).
Since you have a PDF file which includes the OCR'ed text, why not simply use that PDF ? I can't see any reason why you would want to 'transfer' it to another PDF file.

imagemagick - generated PDFs are blurry/hazy

I have an image of a website which I'm trying to convert to PDF. I have the image in several formats: PSD, PNG, JPG, TIFF, all saved losslessly.
I'm using the following command to convert the image to PDF:
convert -density 93 foo.jpg bar.pdf
Here is part of the original image:
And here is the same part, after converting to PDF:
As you can see, the second one is ever so slightly hazy. What's causing this, and how can I eliminate it? I've seen PDFs with crisp graphics, so I know it's possible.
If you are seeing the same results with multiple input types. The fuzziness is likely being caused by the anti-aliasing feature of your PDF Viewer. If using Acrobat, you can turn off image anti-aliasing by doing the following:
Go to Edit-->Preferences-->Page Display
Untick the option "Smooth Images" and hit "OK".
The crisp graphics you are seeing on other PDFs are likely due to the fact that they are vectorized graphics. Imagemagick is creating a PDF and embedding your image inside of it which may be subject to compression.
Also:
When using jpeg as input, add the "-quality 100" to your Imagemagick call to retain the highest quality possible.
Use a higher value for the "-density" parameter (I would recommend at least 150) to generate a higher resolution PDF.

Pdf real cropping

I need to crop a pdf document using the linux shell and then extract the text just in that cropped pdf.
My idea was to crop a pdf using pdfcrop linux tool and then use a txt2pdf text extractor tool to extract the text just in the cropped area, but i've realized that i'm thinking on images, and when i try to do this the result is the same than doing it over the original, not cropped, pdf.
I guess it's a layer problem. As the pdf format works with layers, if i don't "crop" all the layers, the result is gonna include all the information from all the layers, which i don't want.
I would appreciate so much if someone has any idea of how i could do a real "all layers cropping" in a pdf. If its possible or if i should start thinking on another solution.
TY
Its not layers, its the fact that cropping a PDF usually involves simply setting the CropBox, which doesn't alter the actual contents of the PDF (other than the CropBox) at all. Most text extraction code will ignore the CropBox and extract all the text....
You could, with some effort, use Ghostscript to produce a genuinely cropped PDF (though note that partially cropped glyphs will still be included) and then extract the text from that. But that's pretty ugly.
Alternatively Ghostscript and MuPDF can both extract text with co-ordinate information, which may be enough for your needs.

Import vector graphics from PDF to GIMP

I need to extract vector graphics from a PDF image and import them into GIMP, either as paths or as high-resolution raster images. Specifically, I need to get contour lines from USGS topographical maps and overlay them on satellite images. Any suggestions?
So far I have tried:
--Using GIMP's native PDF importing function to import them as raster images. Problem: To do so at high resolution crashes my computer. Possible solution would be to import only a selected area of a PDF, but as far as I can tell this is not possible.
--Using ImageMagick to convert the PDF to a raster image. Problem: Used with the "-scale" parameter, "convert" appears to rasterize the PDF and then upscale it, leading to a choppy image.
--Using InkScape to extract the necessary vector elements from the PDF. Problem: InkScape freezes when I try to open a moderately large (25 Mb) PDF.
Any other ideas?
Many thanks,
treacl
The option you didn't mention above is to try to use the ghostscript program directly to render your output - ghostscript is used internally by GIMP to import PDF files, so you likely have it installed already.
There are tens of command switches to pass ghostscript for it to render a file into another format - the switches you need to pass are for determining the output size, resolution and which page to print. I didn't find any switch to select a portion of the page to be rendered - so, if your document is a single page, it is possible the generated file will still be to big for GIMP - but you will likely be able to crop it with ImageMagick, at least.
I guess the relevant command line for you would be something along:
gs -dNOPAUSE -dBATCH -sDEVICE=png16m -sOutputFile=page.png -dFirstPage=<pagenumber> -dLastPage=<pagenumber> -r<dpiresolution> -f<filename.pdf>
If the resulting image is still too large to be generated or operated upon, you can try changing the output format to use a smaller color depth (this one is 3 bytes per pixel: png16m) . It should be possible to pass postscript commands to transform the device, so that the area of interest is scaled up to your page size (and the remaining parts are cropped out of the rendering) - that would be the definitive fix for you - but of the top of my head, I don't know how to do that with ghostscript.
Alternatively, you can try passing ImageMagick the -density parameter as suggested in the comments.