Getting the cropping and rotation information of an image in a PDF - pdf

I have a PDF with a page with an image. I'm using a command line tool to extract this image. The page in the PDF shows only a part of the image, because the extracted image as a lot more "contents" and they are slightly rotated. This happens, I assume, because some sort of cropping and/or rotation was applied to the image when the PDF was built.
Is there anyway, using iText, to figure out the offset and rotation applied to the image? That would allow me to crop the extracted image in the same way and end up with something similar to what's visible on the PDF page.

Related

Why the size of file with cropped image is the same as of initial one?

I have scanned my copybook and want to crop out extra white regions with Inkscape.
To achieve this, I import initial image (PDF) to Inkscape, draw appropriate rectangle, and use Object->Clip->Set to cut out needed region. Then I resize page to drawing and save obtained page as new PDF file through File->Save a Copy.
I expected that the size of the new PDF file (with cropped image) will be less than the size of the initial PDF (with image without crop), but they are the same.
What is the reason of this and may it be worked around?
I use Inkscape 0.91 at Linux Mint 18.2.
Thank you in advance.
Because the original image is still there, fully intact and with all its contents. The cropping rectangle are just instructions to the PDF viewer to crop out those regions when rendering the image.
However in Inkscape you can bake the crop rectangles and when exporting to PDF "apply raster effects" which should actually alter the contained image(s).

Zooming a picture vs zooming a pdf

Im rendering a pdf using pdf js library. There I can specify zoom (scale) property. Which is fine. I can define pretty high zoom , let's say 8x and still get decent quality of the rendered pdf. However if I were to try to same pdf but converted to graphic image format like jpeg. And then try to render it with high zoom the quality is very bad. Why is that so?
You are describing the difference between vector graphics and raster graphics. A vector graphic format contains contains commands telling how to draw an image. A raster format is an array that tells what the color is at each position in the image.
PDF is largely a raster format (Yes, you can embed a raster image in a PDF). A PDF that has in instruction to draw a line or draw a character can be zoomed to any degree and the drawing will be correct.
In a raster format, if you zoom, eventually you see the individual pixels in the array and they cannot be zoomed any more without distortion. Text in a JPEG or PNG file becomes jagged as you zoom.
On the other hand, try to create a photographic quality image just with drawing commands and you would get huge files.

Creating PDF from a single JPG file using Ghostscript - image placement issue inside PDF

I'm trying to output a PDF file from a JPG file using Ghostscript. The following command works fine:
gs -sDEVICE=pdfwrite -sPAPERSIZE=a4 -o /pdf_from_image.pdf /path/to/viewjpeg.ps -c \(/source_image.jpg\) viewJPEG
Based on existing threads and Ghostscript documentation I'm using -sPAPERSIZE=a4 to generate the output in A4 format. The PDF generates fine, but the PROBLEM is when the image dimensions don't match that of A4, GS puts the image at the bottom of the page with best "width" fit. I think it actually tries to put it in the lower left bottom. To add to it, at times the image is auto rotated.
My question is:
1) Is there any option to put the image on top left corner of the page.
2) Stop GS auto rotating the image.
Any help to put me in the right direction would be greatly appreciated. Thanks.
PDF and PostScript use a coordinate system with the origin (0,0) in the lower left corner, so Ghostscript is actually doing the 'correct' thing: putting the image at the origin. To place the image at the top, you'd have to subtract the image height from the page height and translate the image upwards by that amount.
As for why some images are being rotated, I can't say for sure. Some JPGs contain metadata that indicates the intended orientation of the image--however, not all software gets the value right. I don't know if Ghostscript respects that metadata, but you could check if your 'bad' images have the correct orientation tag (you can use Exif or similar to inspect them).

Discrepany between PDF cropbox and SVG created out of a PDF page

I am trying to extract the background image of a PDF page to an SVG (using xpdf library). The problem I am facing is that the PDF contains additional images/graphics (presumably outside the cropbox) that are not rendered by PDF readers, but the corresponding SVG contains these images/graphics. I tried setting the viewBox attribute of the SVG to correspond to the cropBox bounds of that PDF page but the resulting SVG still displays some of the graphics objects that are not rendered by PDF. I also tried adding a clip path to the SVG - a rectangular clipping region (with bounds corresponding to PDF cropbox), but this too did not eliminate some of the additional graphics elements no seen in PDF. Any idea on what could be the problem? What is the right way to carry over PDF cropbox to SVG? Btw, the SVGs generated in both the cases mentioned above (viewbox and clipping region approaches) were fairly close in dimensions to the viewable area of the PDF page, and the additional elements were seen only close to the edges. Is it that cropbox dimensions obtained from PDF should not be used directly in SVG?
Turns out that the problem was due to my code not transforming the PDF cropbox attribute (as given by xpdf) to user coordinates using CTM matrix (also obtainable through xpdf). After applying the transformation, the resulting SVG matches the rendered portion of the PDF page.

Parse Pdf to get the boudaries of image in the pdf page in Objective-c

I have an iPad app that displays pdf pages.I need to add annotations on the image (if exists on the pdf page) for which i need the coordinates at which the image is situated in the pdf page.I am able to get the image data from the XObject and the image width and height,but i also need the x and y coodrinate of the image.Any idea about how to obtain the coordinates of image by parsing pdf page?
Im assuming you have seen this apple developer page describing how to parse XObjects: http://developer.apple.com/library/mac/#documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html
XObjects do not contain any position data as they just describe image data that can be reused through the pdf.
From http://itext-general.2136553.n4.nabble.com/finding-the-position-of-xobject-in-an-existing-pdf-td2157152.html
"An XObject is a stream that can be reused in many different
other streams. For instance: you could have an image XObject
of a logo that appears on every page in the document.
Suppose that you have some pages in landscape and some in portrait.
Then the logo will have different coordinates on these different
pages. Therefore the position of the XObject IS NEVER STORED with
the XObject, the position can be found in the stream that refers
to the XObject.
Maybe your reaction is: "Oh right, then it's simple: I have to
look in the content stream of the pages using the XObject."
Yes and no. That's indeed where you should look, but it's not
simple. Because the actual position depends on the current
transformation matrix of the state at the moment the image is
added. It's quite some programming work to parse the content
stream and calculate the position of an XObject. "
I think you should find another option and avoid this all together.
If your still determined you will have to use CGPDFScanner and find the transforms through the page.