I'm using the following extension method I built on top of itext7's com.itextpdf.layout.Document type to apply images to PDF documents in my application:
fun Document.writeImage(imageStream: InputStream, page: Int, x: Float, y: Float, width: Float, height: Float) {
val imageData = ImageDataFactory.create(imageStream.readBytes())
val image = Image(imageData)
val pageHeight = pdfDocument.getPage(page).pageSize.height
image.scaleAbsolute(width, height)
val lowerLeftX = x
val lowerLeftY = pageHeight - y - image.imageScaledHeight
image.setFixedPosition(page, lowerLeftX, lowerLeftY)
add(image)
}
Overall, this works -- but with one exception! I've encountered a subset of documents where the images are placed as if the document origin is rotated 90 degrees. Even though the content of the document is presented properly oriented underneath.
Here is a redacted copy of one of the PDFs I'm experiencing this issue with. I'm wondering if anyone would be able to tell me why itext7 is having difficulties writing to this document, and what I can do to fix it -- or alternatively, if it's a potential bug in the higher level functionality of com.itextpdf.layout in itext7?
Some Additional Notes
I'm aware that drawing on a PDF works via a series of instructions concatenated to the PDF. The code above works on other PDFs we've had issues with in the past, so com.itextpdf.layout.Document does appear to be normalizing the coordinate space prior to drawing. Thus, the issue I describe above seems to be going undetected by itext?
The rotation metadata in the PDF that itext7 reports from a "good" PDF without this issue seems to be the same as the rotation metadata in PDFs like the one I've linked above. This means I can't perform some kind of brute-force fix through detection.
I would love any solution to not require me to flatten the PDF through any form of broad operation.
I can talk only about the document you`ve shared.
It contains 4 pages.
/Rotate property of the first page is 0, for other pages is 270 (defines 90 rotation counterclockwise).
IText indeed tries to normalize the coordinate space for each page.
That`s why when you add an image to pages 2-4 of the document it is rotated on 270 (90 counterclockwise) degrees.
... Even though the content of the document is presented properly oriented underneath.
Content of pages 2-4 looks like
q
0 -612 792 0 0 612 cm
/Im0 Do
Q
This is an image with applied transformation.
0 -612 792 0 0 612 cm represents the composite transformation matrix.
From ISO 32000
A transformation matrix in PDF shall be specified by six numbers,
usually in the form of an array containing six elements. In its most
general form, this array is denoted [a b c d e f]; it can represent
any linear transformation from one coordinate system to another.
We can extract a rotation from that matrix.
How to decompose the matrix you can find there.
https://math.stackexchange.com/questions/237369/given-this-transformation-matrix-how-do-i-decompose-it-into-translation-rotati
The rotation is defined by the next matrix
0 -1
1 0
This is a rotation on -90 (270) degrees.
Important note: in this case positive angle means counterclockwise rotation.
ISO 32000
Rotations shall be produced by [rc rs -rs rc 0 0], where rc = cos(q)
and rs = sin(q) which has the effect of rotating the coordinate system
axes by an angle q counter clockwise.
So the image has been rotated on the same angle in the counter direction comparing to the page.
Related
Consider the following PostScript file
[1 0 0.5 0.866 150 550] concat
<<
/ShadingType 2
/Coords [ 0 0 100 100]
/BBox [ 0 0 100 100]
/ColorSpace [ /DeviceRGB ]
/Function
<<
/FunctionType 0
/Domain [0 1]
/Range [0 1 0 1 0 1]
/BitsPerSample 8
/Size [2]
/DataSource <FFA0A0FFE0E0>
>>
/Extend [false false]
>>
shfill
Consider that we convert that file in PDF with GhostScript (ps2pdf) or Adobe Distiller.
The resulting PDF does not render the same way in the different PDF viewers :
In Adobe Reader or Firefox (which uses PDF.js), we have a parallelogram (not a rectangle).
In SumatraPDF (which uses MuPDF) and Chrome (which uses PDFium), we have a rectangle.
Who is right?
In my opinion Adobe Acrobat is right but the specification could be read differently, too.
Your PDF contains the following content stream:
/GS1 gs
q
1 0 .5 .866 150 550 cm
/Sh1 sh
Q
I.e. first the current transformation matrix is changed, it is sheared and squished a bit, and then the shading Sh1 is painted. That shading in turn is defined as
<</BBox[0 0 100 100]/ColorSpace/DeviceRGB/Coords[0 0 100 100]/Function 15 0 R/ShadingType 2>>
I.e. with a 100×100 square bounding box (interpreted as a temporary additional clipping path) and an axial shading along its (0, 0) to (100, 100) diagonal, matching your postscript definition.
The shading operator sh is specified as
Operands
Operator
Description
name
sh
(PDF 1.3) Paint the shape and colour shading described by a shading dictionary, subject to the current clipping path. The current colour in the graphics state is neither used nor altered. The effect is different from that of painting a path using a shading pattern as the current colour. name is the name of a shading dictionary resource in the Shading subdictionary of the current resource dictionary (see 7.8.3, "Resource dictionaries"). All coordinates in the shading dictionary are interpreted relative to the current user space. (By contrast, when a shading dictionary is used in a Type 2 pattern, the coordinates are expressed in pattern space.) All colours are interpreted in the colour space identified by the shading dictionary’s ColorSpace entry (see "Table 77 — Entries common to all shading dictionaries"). The Background entry, if present, is ignored.This operator should be applied only to bounded or geometrically defined shadings. If applied to an unbounded shading, it paints the shading’s gradient fill across the entire clipping region, which may be time-consuming.
(ISO 32000-2:2017, Table 76 — Shading operator)
In particular: All coordinates in the shading dictionary are interpreted relative to the current user space.
Thus, the square bounding box / temporary clip path is squished and sheared by the current transformation matrix to a non-rectangular parallelogram as can be viewed in Adobe Acrobat:
I mentioned above that the specification can be read differently, too: If one considers the BBox entry as the coordinates of two points, the lower left corner and the upper right corner of the box, and applied the transformation before making the result a box, one would get a squished, elongated rectangle as can be viewed in Chrome:
But the BBox here is specified as an array of four numbers giving the left, bottom, right, and top coordinates, respectively, of the shading’s bounding box (ibidem, Table 77 — Entries common to all shading dictionaries) and not as the coordinates of two endpoints of a diagonal. Thus, I'd favor the first interpretation also implemented by Adobe.
I don't have a copy of ISO 32000-2:2020 yet, so maybe this has been clarified one way or the other.
The situation would be different if the shading would have been used in a pattern which would have served as current color during a fill instruction. In that case the specification says:
A pattern’s appearance is described with respect to its own internal coordinate system. Every pattern has a pattern matrix, a transformation matrix that maps the pattern’s internal coordinate system to the default coordinate system of the pattern’s parent content stream (the content stream in which the pattern is defined as a resource). The concatenation of the pattern matrix with that of the parent content stream establishes the pattern coordinate space, within which all graphics objects in the pattern shall be interpreted.
(ISO 32000-2:2017, Section 8.7.2 — General properties of patterns)
In this case the square bounding box with the diagonal axial shading would not have been subject to the current transformation matrix.
The width of a line in PDF is defined in terms of distances in the user space. In my use case, the aspect ratio of the device space (e.g. 4:3) is different from the aspect ratio of the user space (e.g. 1:1), which causes the line widths in the device space to be different in vertical and horizontal directions.
For example, in this picture the horizontal and vertical lines should be of the same width, but they're not:
I would like to perform scaling that only results in line width uniformity and does not affect anything else.
I asked a similar question regarding PostScript here: How to ensure line widths are the same vertically and horizontally in PostScript?. A solution based in part on the answer to this question works for PostScript, but does not work in PDF after what seems to be an almost one-to-one translation.
I tried changing the stroke command S to q 1 0 0 1.5 0 0 cm S Q h, where q saves the graphics state, 1 0 0 1.5 0 0 cm scales the current transformation matrix, Q restores the graphics state, and h closes the current subpath. However, in addition to correctly scaling the line widths, this also scales the y-coordinates of the line endpoints by 1.5.
This is what I need to get:
But with q 1 0 0 1.5 0 0 cm S Q h, I get this instead:
How to make the line width uniform in the device space in PDF without affecting anything else?
Using ImageMagick or GhostScript or any PHP code how can I get the DPI value of PDF files?
Here is the link for two demo files
http://jmp.sh/O5g5wL4 -- of 72 DPI
http://jmp.sh/RxrnYrY -- of 300 DPI
I have used
$image = new Imagick();
$image->readImage('xyz.pdf');
$resolutions = $image->getImageResolution();
It gives the same result for two different PDF files having different DPI.
I have also used
pdfimages -list xyz.pdf
It gives a list of all information but how to fetch the DPI value from the list.
How to get the exact DPI value of a PDF?
As fmw42 says PDF files themselves have no resolution. However in your case both the files consist of nothing but an image. In one case the image is ~48 MB and in the other its around 200 MB.
The reason is that the images have a different effective resolution.
In PDF the image is simply a bitmap, a sequence of coloured pixels. These are then drawn onto the underlying media. At this point there is no resolution, the pixels are laid down in a specific media size. In your case 22 inches by 82 inches.
The effective resolution is given by dividing the dimension by the number of pixels in the image in that dimension.
So if I have an image which is 1000x1000 pixels, and I draw it in a 1 inch square, then the effective resolution of the image is 1000 dpi. If I change my mind and draw it in a square 4 inches by 4 inches, then the effective resolution is 250 dpi.
The image hasn't changed, just the area it covers.
Now consider I have two images drawn in 1 inch squares. the first image is 1000x1000, the second is 500x500. The effective resolution of the first image is 1000 dpi, the effective resolution of the second is 500 dpi.
So you can see that, in PDF, the effective resolution of the image is a combination of the dimensions of the image, and the dimensions of the media it covers.
That's a difficult thing to measure in a PDF file. The area covered is calculated using matrix algebra and can be a combination of several different matrices.
The actual dimensions of the image, by contrast are quite easy to determine, they are given in the image dictionary. Your images are: 1620x5868 and 3372x12225. In both cases the media is the same size; 22.5x81.5 inches.
Since the images cover the entire media, the effective resolutions are;
1620/22.5 = 72 by 5868/81.5 = 72
3372/22.5 = 149.866 by 12225/81.5 = 150
I think MuPDF will give you image dimensions and media dimensions, assuming all your PDF files are constructed like this you can then simply perform the maths, but note that this won't be so simple for ordinary PDF files where images don't cover the entire media.
Using mutool info -I -M 150-dpi.pdf gives:
Retrieving info from pages 1-1...
Mediaboxes (1):
1 (6 0 R): [ 0 0 1620 5868 ]
Images (1):
1 (6 0 R): [ DCT ] 3375x12225 8bpc DevCMYK (12 0 R)
So there's your image dimensions and your media size. All you need to do is apply the division of one by the other.
Note: In debian and related distros, mutool is contained in mupdf-tools package, not in mupdf package itself. It can by therefore installed by sudo apt install mupdf-tools.
I use pdfimages -list from the poppler library, gives you all the information about the images.
I have PDF document with many pages 595x420 ppi but I need this pages push in 595x210 but all text must be visible.
So.. Can I change scale of PDF pages unproportionally (no zoom) to fit custom size of page with ghostscript or I must to use some another program?
If you want scaling applied to one axis and not the other, then you will have to do some PostScript programming. In /ghostpdl/Resource/Init/pdf_main.ps is the code which calculates the matrix required:
/pdf_PDF2PS_matrix { % <pdfpagedict> -- matrix
matrix currentmatrix matrix setmatrix exch
% stack: savedCTM <pdfpagedict>
dup get_any_box
% stack: savedCTM <pdfpagedict> /Trim|Crop|Art|MediaBox <Trim|Crop|Art|Media Box>
oforce_elems normrect_elems fix_empty_rect_elems 4 array astore
//systemdict /PDFFitPage known {
PDFDEBUG { (Fiting PDF to imageable area of the page.) = flush } if
That code calculates the x and y scale values and makes them the same. If you want them to differ, that's what you will have to modify. Note you will also have to set a specific media size using -dDEVICEHEIGHTPOINTS and -dDEVICEWIDTHPOINTS and set -dFIXEDMEDIA to prevent the PDF file resizing the media.
I'm trying to highlight some text with a glyph width of 1000 (which corresponds to 1 unit of text space)and font size of 1; the transformation matrix is [50 0 0 50 0 0]. The result is text that is too big. But this is not the case. The text that is being displayed is not big at all; it's a normal size.
Any PDF reader I open the file with has no problems highlighting the word, which means that I'm missing something somewhere.
Currently I'm checking for the default font and the font array in the fonts dictionary, the font size, and the transformation matrix. Is there any other way to scale text in a PDF besides the ones I just mentioned?
This answer combines the comments to the original question:
Currently I'm checking for the default font and the font array in the fonts dictionary, the font size, and the transformation matrix. Is there any other way to scale text in a PDF besides the ones I just mentioned?
A few possibility coming to my mind immediately:
A new transformation matrix (argument to cm) does not replace the old one; instead it is multiplied to it (from the left).
In case of q ... Q you have to consider resets of the current transformation matrix.
(The current transformation matrix, line widths, colors, overprint settings, and much, much more are part of the graphics state. To get an impression, have a look at the entries in tables 57 and 58 of the PDF specification ISO 32000-1. At least all the properties described there are part of the graphics state and, therefore, saved during q and restored during Q.)
Furthermore there is the text matrix to consider.
Finally the UserUnit entry of the page might change the rules.
So there's more to look at than the text positioning operators.
For a good overview have a look at section 9.4.4 Text Space Details of the PDF specification, especially Note 2 therein. (Thanks to #plinth.)