PDF Rectangle[re] display position differs from object position in PDF document - pdf

There are two rectangles
on the page.
Page Contents:
/OC /MC0 BDC
0.087 0.963 0.488 0.002 k
0 0 0 0 K
/GS0 gs
118.442 63.791 61.046 133.721 re
B
92.977 141.837 21.744 55.674 re
B
EMC
The actual Y position of the left (little) rectangle is higher [141.837], than right (big) rectangle.
Why do they displays like they have similar Y position?
P.S.: transformation matrix [CTM] of the left rectangle is standard
I tried to get actual coordinates (from pdf page content stream) and then put it to the new file. The result is
I wish to know why left rectangle displays on Y=53.988 and not on Y=141.337

In PDF the default coordinate system is located in bottom left corner, the Y is relative to bottom margin, not top.
63+133 = 141 + 55 (same top Y)
Glad to see you are using our XFINIUM.PDF Inspector to look inside the PDF files. The PDF bounds are relative to standard PDF coordinate system, the Display bounds are relative to top left corner of the visible page area.

Related

How to find the " PDF page origin " for an existing Page...?

Hi I am trying to find the origin i.e x and y coordinates of a page is there any code examples "Using PDFBOX" and also theory that will help to find the origin of the page in the PDF.
By saying that i mean , we need to find wether the origin is
left bottom? right bottom? right top? left top ? or from the middle of the page ?
First of all, I assume we are talking about user space coordinates, not device space coordinates. When rendering a PDF, coordinates eventually are translated to the device space of the rendering target. But device space coordinates are device dependent and, therefore, not really appropriate for generic PDF processing tasks.
The default user space coordinate system of a page
The default user space coordinate system is in particular used for positioning annotations and is the initial user space coordinate system when starting to process the instructions of the page content stream.
This coordinate system is specified by the effective crop box of the page (which defaults to its media box):
The user space coordinate system shall be initialised to a default state for each page of a document. The CropBox entry in the page dictionary shall specify the rectangle of user space corresponding to the visible area of the intended output medium (display window or printed page). The positive x axis extends horizontally to the right and the positive y axis vertically upward, as in standard mathematical practice (subject to alteration by the Rotate entry in the page dictionary).
(ISO 32000-2, section 8.3.2.3 "User space")
Thus, even without considering the page rotation, the origin may be anywhere inside, on the edge, or outside the visible page area, e.g. for the following CropBox values:
[ 0 0 612 792 ] - origin in the lower left
[ 0 -792 612 0 ] - origin in the upper left
[ -306 -396 306 396 ] - origin in the center of the page
[ -1612 1000 -1000 1792 ] - origin off page to the right and below
If you also take page rotation into account, the origin rotates with the page:
Key
Type
Value
Rotate
integer
(Optional; inheritable) The number of degrees by which the page shall be rotated clockwise when displayed or printed. The value shall be a multiple of 90. Default value: 0.
(ISO 32000-2, Table 31 "Entries in a page object")
So e.g. for the crop box [ 0 0 612 792 ] for the following Rotate values:
0 - origin in the lower left
90 - origin in the upper left
180 - origin in the upper right
270 - origin in the lower right
and for the crop box [ -1612 1000 -1000 1792 ]:
0 - origin off page to the right and below
90 - origin off page to the left and below
180 - origin off page to the left and above
270 - origin off page to the right and above
Of course also the directions of the coordinate axis change matching the rotation:
0 - x coordinates increase to the right, y coordinates upwards
90 - x coordinates increase downwards, y coordinates to the right
180 - x coordinates increase to the left, y coordinates downwards
270 - x coordinates increase upwards, y coordinates to the left
The current user space coordinate system of a page
While processing the instructions of a page content stream, the user space may be transformed along, in particular by the cm instruction:
Operands
Operator
Description
a b c d e f
cm
Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.
(ISO 32000-2, Table 56 "Graphics state operators")
One use case for this is to have the current coordinate system "the right side up" after rotation.
For example for the crop box [ 0 0 612 792 ] and the page rotation 90, the coordinate system has its origin in the upper left, x coordinates increase downwards, and y coordinates increase to the right. To straighten this out, you'll often find a cm instruction like this at the start of the page content stream:
0 1 -1 0 612 0 cm
After this instruction the origin on the rotated page in our example is again in the lower left, and x coordinates increase to the right and y coordinates upwards.

Images rotated when added to PDF in itext7

I'm using the following extension method I built on top of itext7's com.itextpdf.layout.Document type to apply images to PDF documents in my application:
fun Document.writeImage(imageStream: InputStream, page: Int, x: Float, y: Float, width: Float, height: Float) {
val imageData = ImageDataFactory.create(imageStream.readBytes())
val image = Image(imageData)
val pageHeight = pdfDocument.getPage(page).pageSize.height
image.scaleAbsolute(width, height)
val lowerLeftX = x
val lowerLeftY = pageHeight - y - image.imageScaledHeight
image.setFixedPosition(page, lowerLeftX, lowerLeftY)
add(image)
}
Overall, this works -- but with one exception! I've encountered a subset of documents where the images are placed as if the document origin is rotated 90 degrees. Even though the content of the document is presented properly oriented underneath.
Here is a redacted copy of one of the PDFs I'm experiencing this issue with. I'm wondering if anyone would be able to tell me why itext7 is having difficulties writing to this document, and what I can do to fix it -- or alternatively, if it's a potential bug in the higher level functionality of com.itextpdf.layout in itext7?
Some Additional Notes
I'm aware that drawing on a PDF works via a series of instructions concatenated to the PDF. The code above works on other PDFs we've had issues with in the past, so com.itextpdf.layout.Document does appear to be normalizing the coordinate space prior to drawing. Thus, the issue I describe above seems to be going undetected by itext?
The rotation metadata in the PDF that itext7 reports from a "good" PDF without this issue seems to be the same as the rotation metadata in PDFs like the one I've linked above. This means I can't perform some kind of brute-force fix through detection.
I would love any solution to not require me to flatten the PDF through any form of broad operation.
I can talk only about the document you`ve shared.
It contains 4 pages.
/Rotate property of the first page is 0, for other pages is 270 (defines 90 rotation counterclockwise).
IText indeed tries to normalize the coordinate space for each page.
That`s why when you add an image to pages 2-4 of the document it is rotated on 270 (90 counterclockwise) degrees.
... Even though the content of the document is presented properly oriented underneath.
Content of pages 2-4 looks like
q
0 -612 792 0 0 612 cm
/Im0 Do
Q
This is an image with applied transformation.
0 -612 792 0 0 612 cm represents the composite transformation matrix.
From ISO 32000
A transformation matrix in PDF shall be specified by six numbers,
usually in the form of an array containing six elements. In its most
general form, this array is denoted [a b c d e f]; it can represent
any linear transformation from one coordinate system to another.
We can extract a rotation from that matrix.
How to decompose the matrix you can find there.
https://math.stackexchange.com/questions/237369/given-this-transformation-matrix-how-do-i-decompose-it-into-translation-rotati
The rotation is defined by the next matrix
0 -1
1 0
This is a rotation on -90 (270) degrees.
Important note: in this case positive angle means counterclockwise rotation.
ISO 32000
Rotations shall be produced by [rc rs -rs rc 0 0], where rc = cos(q)
and rs = sin(q) which has the effect of rotating the coordinate system
axes by an angle q counter clockwise.
So the image has been rotated on the same angle in the counter direction comparing to the page.

How to identify which clip paths apply to a path or fill in PDF vector graphics?

I am trying to extract vector graphics from a PDF file and create corresponding SVG files. I am using SVGOutputDev (https://github.com/immateriel/pdf2svg/blob/master/SVGOutputDev.cc‎) with xpdf library for this purpose. Now SVGOutputDev hasn't implemented clip path extraction and I am trying to implement the same. While I am able to extract the clip path definitions themselves, I am unable to determine which of these definitions apply to a normal stroke or fill region. For instance, please refer to http://pastebin.com/jTdzv3YZ for the SVG I extracted from a page of PDF, and the corresponding dump of the sequence of PDF graphics commands as seen during extraction. As seen from that SVG, there are multiple clip paths and one rectangular fill region. Even though there are multiple clip paths defined before the filled rectangle is defined, only the circular clip paths defined just before the rectangle definition are expected to be associated with the rectangle (going by how the PDF page was rendered on various PDF readers, which show only 2 black-filled circles in a white background). The question is how does one know which clip paths are associated with a regular fill/stroke region defined in a PDF? FYI, I went through the relevant section of the PDF specification document butit wasn't very clear to me ("A clipping path operation may appear after the last path construction operator and before the path-painting operator that terminates a path object. Although the clipping path operator appears before the painting operator, it does not alter the clipping path at the point where it appears. Rather, it modifies the effect of the succeeding painting operator"). Can someone explain how to identify the relevant clip paths to apply to any normal path?
The question is how does one know which clip paths are associated with a regular fill/stroke region defined in a PDF?
In a nutshell: The intersection of all those clip path areas which have been defined at the time the fill or stroke operation is executed, applies with the exception of those which were voided during a Q (restore state) operator.
Thus, your analysis for your sample file
Even though there are multiple clip paths defined before the filled rectangle is defined, only the circular clip paths defined just before the rectangle definition are expected to be associated with the rectangle (going by how the PDF page was rendered on various PDF readers, which show only 2 black-filled circles in a white background)
is wrong: Not the last clip area but the intersection of all clip areas before the rectangle definition defines the current one. As each of those clip areas is contained in the preceding one, the result of the intersection indeed consists of those two circles.
In the documentation:
The graphics state shall contain a current clipping path that limits the regions of the page affected by painting operators. The closed subpaths of this path shall define the area that can be painted.
The initial clipping path shall include the entire page.
[Clipping Path Operators] modify the current clipping path by intersecting it with the current path, using the [nonzero winding number rule / even-odd rule] to determine which regions lie inside the clipping path.
There is no way to enlarge the current clipping path or to set a new clipping path without reference to the current one. However, since the clipping path is part of the graphics state, its effect can be localized to specific graphics objects by enclosing the modification of the clipping path and the painting of those objects between a pair of q and Q operators (see 8.4.2, "Graphics State Stack"). Execution of the Q operator causes the clipping path to revert to the value that was saved by the q operator before the clipping path was modified.
(section 8.5.4 in the current PDF specification ISO 32000-1)
In action: Let's look at the content stream of the page of your document (which has a Mediabox [0, 0, 595, 842]):
q
q
Twice push the graphics state.
0 842 m
0 0 l
595 0 l
595 842 l
h
W
n
Defines a clip path equivalent with the whole media box.
1 w
2 J
0 j
10 M
[]0 d
Defines general graphics state properties (line width, line cap style, line join style, miter Limit, and dash pattern).
q
Pushes the graphics state again, this time with the explicitly set clip path and those other graphics properties.
0 718.5 m
595 718.5 l
595 123.5 l
0 123.5 l
0 718.5 l
h
W
n
Defines a clip path which contains a rectangle as wide as the whole media box but cutting off the top and bottom stripes of 124 user space units height. As this clip path is completely contained in the clip path set before, the intersection equals this clip path here. Thus, the currently effective clip area is this smaller rectangle.
0 718.5 m
595 718.5 l
595 123.5 l
0 123.5 l
0 718.5 l
h
W
n
Defines a clip path which is identical to the former one. Thus, intersecting them changes nothing.
148.75 668.92 m
93.98 668.92 49.58 624.52 49.58 569.75 c
49.58 514.98 93.98 470.58 148.75 470.58 c
203.52 470.58 247.92 514.98 247.92 569.75 c
247.92 624.52 203.52 668.92 148.75 668.92 c
h
347.08 470.58 m
292.32 470.58 247.92 426.18 247.92 371.42 c
247.92 316.65 292.32 272.25 347.08 272.25 c
401.85 272.25 446.25 316.65 446.25 371.42 c
446.25 426.18 401.85 470.58 347.08 470.58 c
h
W
n
Defines a clip path consisting of two circle subpaths. These two circles don't intersect; thus we don't have to deal with the differences between the "Nonzero Winding Number Rule" and the "Even-Odd Rule". Furthermore, the circles are contained inside the present clip area. Thus, the new clip area consists of these two circles.
0 0 0 rg
49.58 668.92 m
545.42 668.92 l
545.42 173.08 l
49.58 173.08 l
49.58 668.92 l
h
f
This draws a filled black rectangle which contains the current clipping area. Thus, the whole clipping area (i.e. the two circles) is painted black.
Q
q
This restores the graphics state to the last pushed one. I.e. the clipping path for any following operations is the first one which encompassed the whole media box. This graphics state is pushed again.
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Once again the clipping path clipping off bars at the top and the bottom is defined...
Q
q
... and immediately dropped by a restore state Operation; the state is again pushed.
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Q
q
The same again...
0 718.5 m
0 123.5 l
595 123.5 l
595 718.5 l
h
W
n
Q
q
... and again.
0 842 m
0 0 l
595 0 l
595 842 l
h
W
n
This once again defines a clipping path circumpassing the whole media box. As this is the current clipping path anyhow, nothing changes by intersecting.
Q
Q
Q
All graphics states formerly pushed onto the stack are removed again.

Text rotation in PDF

So I have this situation:
using pdftoxml.exe from sourceforge.net I got text tokens and their coordinates. If the pdf file was rotated (i.e. it has a /Rotate 90 written in its source) pdftoxml.exe swaps height and width of a given page and also x and y coordinates of any given object. That is what I understand.
I was happy with it, until I came across a pdf file which used re to draw thick lines. That is, for a thick line, 4 thin lines are drawn and the space is filled, like in this picture. On the left you see two thin lines (non colored), which are part of a bigger rectangle (highly zoomed in). I emptied the space inbetween which was actually filled with black, to see the lines:
Additionally, above pdf is rotated. So to get B upright in the end, this textmatrix was used: 0 1 -1 0 90.72 28.3705 Tm. The thin lines were drawn like this from 83.04 27.891 0.48 0.48 re (coordinates may vary here, but it was some re operation like that. The operation goes like x y width height re and re is for rectangle from adobe's pdf 1.7 page 133). What is relevant here is the calculation 27.891 + 0.48 = 28.371 which is not rounded or altered because of floating-point issues. It is the exact value for the line's x and unfortunately, it is bigger than the hard coded B's x which is 28.3705 :
83.52 27.891 m 92.39999999999999 27.891 l s
92.39999999999999 27.891 m 92.39999999999999 28.371 l s
92.39999999999999 28.371 m 83.52 28.371 l s
83.52 28.371 m 83.52 27.891 l s
The page's coordinates go like 842 x 595,2 according to PDFXChange viewer from upper left corner. Which seems natural since the page is rotated. Unrotated, it would be the lower left corner, so that ought to be ok.
When the text is altered with 1 0 0 1 90.72 28.3705 Tm into its original orientation, one can see the collapsing bottom line with the line on the left:
which is what I would expect, since B 's y is 28.3705 and and the line's horizontal position is 28.371 (as can be seen on the second line of above code lines). So probabyly B's bottom line falls beyond the 28.371 but I could not zoom that.
Now where does the gap between the line and the B come from in the first picture? This is important to me because I was trying to figure out which is the closest line on the left to B and was surprised by the two values, namely the suppsed x value of the text I get from pdftoxml.exe which is 28.3705 and the lines horizontal value 28.371. Since I knew the line is actually far beyond the left of the B that could not be correct, at least not in the sense of "take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left".
I can't locate the correct line with the x values. Instead I get the other line on the very left...like as if the text was falling inbetween them two.
This is the text drawing code:
BT
%0 7.5 -7.5 0 90.72 28.3705 Tm
0 1 -1 0 90.72 28.3705 Tm
%1 0 0 1 90.72 28.3705 Tm
/F1 1 Tf
1 Tr
q
0.01 w
(B) Tj
Q
ET
so, there is nothing fancy happening with the B's size or line thickness.
Can you help me figure out?
This is an updated picture with two I drawn on the same page, for the upper I using 0 1 -1 0 90.72 28.3705 Tm (rotated 90 degrees mathematically), for the lower one 1 0 0 1 90.72 28.3705 Tm. So I don't get it, how is the lower I rotated +90 and ends up being the upper one?
Here is the pdf code. It is rather big, but you should be able to copy it into your file and name it sth.pdf.
PDF Sample ( you have to actually zoom into the upper left corner real big to see the I )
EDIT
I actually found some interesting information about finding the glyph bounding box, but I could not yet put the pieces together.
Please have a look at
The glyph origin is the point (0, 0) in the glyph coordinate system. Tj and other text-showing operators shall position the origin of the first glyph to be painted at the origin of text space.
(shamelessly copied from Figure 39, section 9.2.4 of ISO 32000-1).
As you can see, the coordinates where the glyph is positioned, the glyph origin, is not necessarily where the actual glyph bounding box starts. This may explain the gap in your first image.
Thus, when you are trying to figure out which is the closest line on the left to B optically, it does not suffice to take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left, instead you also have to take the font data themselves into account and factor in the gap between glyph origin and glyph bounding box of the glyph represented by B.
For a more in-depth analysis please supply the font data.
EDIT concerning your double-I question... in your comment above you say you actually expected to see a common point - the rotation point - in both I characters, so you can get hands on a reliable horizontal coordinate for the left bounding box side of a character.
Isn't the point where the red lines cross, your rotation point? It should be the glyph origin for both Tj operations, and the I-glyphs have their origins there. Now you can measure from there on.

PDF Low-Level: Adding text as an invisible layer with each letter in specific position

I'm writing a PDF file directly from code, it's all working nicely, but I don't know how to add text into the content object of a page with each letter at a specific position.
I have the coordinates of each letter, something like this:
x0 y0 x1 y1
a = 345,200,350,210
n = 352,201,360,209
d = 365,200,371,212
I want to be able to put this onto the PDF page as an invisible layer so it can be searched or selected, but with each letter in the exact correct coordinates.
Alternatively I could do it with only the coordinates for each word, if this is better.
What is the format for writing this into the content object?
Thank you very much for your help!
There are many ways of doing this. You'll need to use a text block:
BT
%..you need to set a font...
/f1 10 Tf
%..you need to set the text matrix to include Tx and Ty (if not already done)..
1 0 0 1 345 200 Tm
(a) Tj % or (and) Tj to display the word in one go (position of chars depends on font selected)
1 0 0 1 352 201 Tm
(n) Tj
% etc.
ET
You also mentioned that you wanted the text to be invisible. If you are in complete control of the page content you can set the text stroke and fill colour to be the same as the background colour (which will probably be white)
1 1 1 RG
1 1 1 rg
Otherwise you can paint over the text, it will still be selectable.