Text rotation in PDF - pdf

So I have this situation:
using pdftoxml.exe from sourceforge.net I got text tokens and their coordinates. If the pdf file was rotated (i.e. it has a /Rotate 90 written in its source) pdftoxml.exe swaps height and width of a given page and also x and y coordinates of any given object. That is what I understand.
I was happy with it, until I came across a pdf file which used re to draw thick lines. That is, for a thick line, 4 thin lines are drawn and the space is filled, like in this picture. On the left you see two thin lines (non colored), which are part of a bigger rectangle (highly zoomed in). I emptied the space inbetween which was actually filled with black, to see the lines:
Additionally, above pdf is rotated. So to get B upright in the end, this textmatrix was used: 0 1 -1 0 90.72 28.3705 Tm. The thin lines were drawn like this from 83.04 27.891 0.48 0.48 re (coordinates may vary here, but it was some re operation like that. The operation goes like x y width height re and re is for rectangle from adobe's pdf 1.7 page 133). What is relevant here is the calculation 27.891 + 0.48 = 28.371 which is not rounded or altered because of floating-point issues. It is the exact value for the line's x and unfortunately, it is bigger than the hard coded B's x which is 28.3705 :
83.52 27.891 m 92.39999999999999 27.891 l s
92.39999999999999 27.891 m 92.39999999999999 28.371 l s
92.39999999999999 28.371 m 83.52 28.371 l s
83.52 28.371 m 83.52 27.891 l s
The page's coordinates go like 842 x 595,2 according to PDFXChange viewer from upper left corner. Which seems natural since the page is rotated. Unrotated, it would be the lower left corner, so that ought to be ok.
When the text is altered with 1 0 0 1 90.72 28.3705 Tm into its original orientation, one can see the collapsing bottom line with the line on the left:
which is what I would expect, since B 's y is 28.3705 and and the line's horizontal position is 28.371 (as can be seen on the second line of above code lines). So probabyly B's bottom line falls beyond the 28.371 but I could not zoom that.
Now where does the gap between the line and the B come from in the first picture? This is important to me because I was trying to figure out which is the closest line on the left to B and was surprised by the two values, namely the suppsed x value of the text I get from pdftoxml.exe which is 28.3705 and the lines horizontal value 28.371. Since I knew the line is actually far beyond the left of the B that could not be correct, at least not in the sense of "take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left".
I can't locate the correct line with the x values. Instead I get the other line on the very left...like as if the text was falling inbetween them two.
This is the text drawing code:
BT
%0 7.5 -7.5 0 90.72 28.3705 Tm
0 1 -1 0 90.72 28.3705 Tm
%1 0 0 1 90.72 28.3705 Tm
/F1 1 Tf
1 Tr
q
0.01 w
(B) Tj
Q
ET
so, there is nothing fancy happening with the B's size or line thickness.
Can you help me figure out?
This is an updated picture with two I drawn on the same page, for the upper I using 0 1 -1 0 90.72 28.3705 Tm (rotated 90 degrees mathematically), for the lower one 1 0 0 1 90.72 28.3705 Tm. So I don't get it, how is the lower I rotated +90 and ends up being the upper one?
Here is the pdf code. It is rather big, but you should be able to copy it into your file and name it sth.pdf.
PDF Sample ( you have to actually zoom into the upper left corner real big to see the I )
EDIT
I actually found some interesting information about finding the glyph bounding box, but I could not yet put the pieces together.

Please have a look at
The glyph origin is the point (0, 0) in the glyph coordinate system. Tj and other text-showing operators shall position the origin of the first glyph to be painted at the origin of text space.
(shamelessly copied from Figure 39, section 9.2.4 of ISO 32000-1).
As you can see, the coordinates where the glyph is positioned, the glyph origin, is not necessarily where the actual glyph bounding box starts. This may explain the gap in your first image.
Thus, when you are trying to figure out which is the closest line on the left to B optically, it does not suffice to take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left, instead you also have to take the font data themselves into account and factor in the gap between glyph origin and glyph bounding box of the glyph represented by B.
For a more in-depth analysis please supply the font data.
EDIT concerning your double-I question... in your comment above you say you actually expected to see a common point - the rotation point - in both I characters, so you can get hands on a reliable horizontal coordinate for the left bounding box side of a character.
Isn't the point where the red lines cross, your rotation point? It should be the glyph origin for both Tj operations, and the I-glyphs have their origins there. Now you can measure from there on.

Related

How to find the " PDF page origin " for an existing Page...?

Hi I am trying to find the origin i.e x and y coordinates of a page is there any code examples "Using PDFBOX" and also theory that will help to find the origin of the page in the PDF.
By saying that i mean , we need to find wether the origin is
left bottom? right bottom? right top? left top ? or from the middle of the page ?
First of all, I assume we are talking about user space coordinates, not device space coordinates. When rendering a PDF, coordinates eventually are translated to the device space of the rendering target. But device space coordinates are device dependent and, therefore, not really appropriate for generic PDF processing tasks.
The default user space coordinate system of a page
The default user space coordinate system is in particular used for positioning annotations and is the initial user space coordinate system when starting to process the instructions of the page content stream.
This coordinate system is specified by the effective crop box of the page (which defaults to its media box):
The user space coordinate system shall be initialised to a default state for each page of a document. The CropBox entry in the page dictionary shall specify the rectangle of user space corresponding to the visible area of the intended output medium (display window or printed page). The positive x axis extends horizontally to the right and the positive y axis vertically upward, as in standard mathematical practice (subject to alteration by the Rotate entry in the page dictionary).
(ISO 32000-2, section 8.3.2.3 "User space")
Thus, even without considering the page rotation, the origin may be anywhere inside, on the edge, or outside the visible page area, e.g. for the following CropBox values:
[ 0 0 612 792 ] - origin in the lower left
[ 0 -792 612 0 ] - origin in the upper left
[ -306 -396 306 396 ] - origin in the center of the page
[ -1612 1000 -1000 1792 ] - origin off page to the right and below
If you also take page rotation into account, the origin rotates with the page:
Key
Type
Value
Rotate
integer
(Optional; inheritable) The number of degrees by which the page shall be rotated clockwise when displayed or printed. The value shall be a multiple of 90. Default value: 0.
(ISO 32000-2, Table 31 "Entries in a page object")
So e.g. for the crop box [ 0 0 612 792 ] for the following Rotate values:
0 - origin in the lower left
90 - origin in the upper left
180 - origin in the upper right
270 - origin in the lower right
and for the crop box [ -1612 1000 -1000 1792 ]:
0 - origin off page to the right and below
90 - origin off page to the left and below
180 - origin off page to the left and above
270 - origin off page to the right and above
Of course also the directions of the coordinate axis change matching the rotation:
0 - x coordinates increase to the right, y coordinates upwards
90 - x coordinates increase downwards, y coordinates to the right
180 - x coordinates increase to the left, y coordinates downwards
270 - x coordinates increase upwards, y coordinates to the left
The current user space coordinate system of a page
While processing the instructions of a page content stream, the user space may be transformed along, in particular by the cm instruction:
Operands
Operator
Description
a b c d e f
cm
Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.
(ISO 32000-2, Table 56 "Graphics state operators")
One use case for this is to have the current coordinate system "the right side up" after rotation.
For example for the crop box [ 0 0 612 792 ] and the page rotation 90, the coordinate system has its origin in the upper left, x coordinates increase downwards, and y coordinates increase to the right. To straighten this out, you'll often find a cm instruction like this at the start of the page content stream:
0 1 -1 0 612 0 cm
After this instruction the origin on the rotated page in our example is again in the lower left, and x coordinates increase to the right and y coordinates upwards.

How to get a uniform line width in PDF regardless of the device space aspect ratio?

The width of a line in PDF is defined in terms of distances in the user space. In my use case, the aspect ratio of the device space (e.g. 4:3) is different from the aspect ratio of the user space (e.g. 1:1), which causes the line widths in the device space to be different in vertical and horizontal directions.
For example, in this picture the horizontal and vertical lines should be of the same width, but they're not:
I would like to perform scaling that only results in line width uniformity and does not affect anything else.
I asked a similar question regarding PostScript here: How to ensure line widths are the same vertically and horizontally in PostScript?. A solution based in part on the answer to this question works for PostScript, but does not work in PDF after what seems to be an almost one-to-one translation.
I tried changing the stroke command S to q 1 0 0 1.5 0 0 cm S Q h, where q saves the graphics state, 1 0 0 1.5 0 0 cm scales the current transformation matrix, Q restores the graphics state, and h closes the current subpath. However, in addition to correctly scaling the line widths, this also scales the y-coordinates of the line endpoints by 1.5.
This is what I need to get:
But with q 1 0 0 1.5 0 0 cm S Q h, I get this instead:
How to make the line width uniform in the device space in PDF without affecting anything else?

PDF Rectangle[re] display position differs from object position in PDF document

There are two rectangles
on the page.
Page Contents:
/OC /MC0 BDC
0.087 0.963 0.488 0.002 k
0 0 0 0 K
/GS0 gs
118.442 63.791 61.046 133.721 re
B
92.977 141.837 21.744 55.674 re
B
EMC
The actual Y position of the left (little) rectangle is higher [141.837], than right (big) rectangle.
Why do they displays like they have similar Y position?
P.S.: transformation matrix [CTM] of the left rectangle is standard
I tried to get actual coordinates (from pdf page content stream) and then put it to the new file. The result is
I wish to know why left rectangle displays on Y=53.988 and not on Y=141.337
In PDF the default coordinate system is located in bottom left corner, the Y is relative to bottom margin, not top.
63+133 = 141 + 55 (same top Y)
Glad to see you are using our XFINIUM.PDF Inspector to look inside the PDF files. The PDF bounds are relative to standard PDF coordinate system, the Display bounds are relative to top left corner of the visible page area.

Calculating the exact positions of(Td, TD, Tm, cm, T*) content stream in pdf?

Getting or calculating the exact positions of(Td, TD, Tm, cm, T*) content stream in pdf?
As a human I am able to calculate(whether it is replacing last Td or adding to last Td or multiplication with fontsize) the positions of tags in pdf content stream by comparing , where the glyphs are located in pdf and content stream position values. But I am unable to calculate perfect positions of glyph's programatically . Please see the screen short.
In above image left side box is pdf ui glyphs and right side box contains the related content stream. In content stream I highlighted two Td positions.
In first circle
3.321 -6.475999832 Td
The Td positions should add to the last Td positions. Assume x1, y1.
Current_x_pos = x1+3.321
Curent_y_pos = y1-6.475999832
then we can get the exact position of glyph "t".
In second highlighted circle the new Td positions (231.544 366.377990 Td) are completely replaced like
Current_x_pos = 231.544
Curent_y_pos = 366.377990
Along with that some times the parent tag is Tm at that case the formula might be like this
Current_x_pos = x1+(tdx1*font_size)
Curent_y_pos = y1+(tdy1*font_size)
When we need to multiply like above, and some times addition. Programatically how can I know this. To parse exact positions?(new screen short added for multiplication)
Any help ?
Thanks.
When we need to multiply like above, and some times addition. Programatically how can I know this. To parse exact positions?
It's quite simple, for a Td operation you always multiply, see the specification ISO 32000-1 (similarly in ISO 32000-2):
For a freshly initialized (i.e. identity) text line matrix Tlm this matrix multiplication looks like replacing its bottom row with tx ty 1.
For a text line matrix Tlm with only changes in the bottom row against an identity this matrix multiplication looks like an addition to the bottom row, e.g. x y 1 becomes x+tx y+ty 1.
For a text line matrix Tlm like in your second example
a 0 0
0 a 0
x y 1
this matrix multiplication looks like a multiplication with a followed by an addition to the bottom row, i.e. x y 1 becomes x+a·tx y+a·ty 1. If the font size parameter of the preceding Tf operation was 1, then a would effectively be the resultant font size giving rise to your assumption the font size is part of the formula.
In general, for an arbitrary, non-degenerate text line matrix Tlm
a b 0
c d 0
x y 1
this matrix multiplication looks even more complex, x y 1 becomes x+a·tx+c·ty y+b·tx+d·ty 1.
Thus, concerning your question
Programatically how can I know this. To parse exact positions?
your program should simply always use matrix multiplication and ignore what it looks like on the level of the separate coordinates.
What makes the second circled instruction look like a mere replacement, is that the prior text line matrix is the identity matrix. This is not due to the restore-state operation as assumed by François, though, but more simply to the start of text object operation BT:
As the text matrix and the text line matrix are reset at the start of a text object and the graphics state cannot be saved or restored in a text object, the save and restore graphics state operations are not to blame in this case.
(Screen shots are from the ISO 32000-1 copy shared by Adobe.)
When you say:
In second highlighted circle the new Td positions (231.544 366.377990
Td) are completely replaced
Actually, the positions Current_x_pos and Current_x_pos are not replaced. This Td command does exactly like always:
Current_x_pos = x1 + 231.544
Curent_y_pos = y1 - 366.377990
It is the Q from 3 line above that reloads previous graphic state, right after the current graphic state has been saved with q.

PDF Low-Level: Drawing a line in the content object?

I have searched extensively online and I have the PDF specification in which I have looked, yet I still can't figure out how to draw a simple black line on a PDF page from the content object's instructions (stream).
Let's say I just want to draw a 1-pixel thickness (assuming 72 dpi) black line at x 400, y 100-300.
This should in theory be a very simple operation, but the PDF spec goes on and on about all kinds of fancy things and appears to forget to explain how I would go about performing this simple operation.
Please can someone point me in the right direction?
In the PDF specification, have a look at chapter 8 (Graphics) and in there section 8.5, Path Construction and Painting.
To draw a simple straight path, you need a "move to" operation followed by a "line to" operation:
400 100 m
400 300 l
You can then stroke the path using the S operator so your code becomes
400 100 m
400 300 l
S
By default the color is black so you've already gotten a black line :-) But if you want to make sure you have to set some parameters in the graphics state.
0 G
1 w
400 100 m
400 300 l
S
The first line now sets the color space to "gray" and puts the shade of grey to 0 (black). The following line sets the line width of your stroked line to 1 user unit (what this comes out as is dependent on your current transformation matrix.
You can apply a neat trick if you really want 1 pixel (please don't for production files though!) and that is to set the width to zero:
0 w
This gives you "the thinnest line that can be rendered at device resolution: 1 device pixel wide".