I have some text that I want to edit (justified text, really annoying), so I was wondering if this:
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
...
is equivalent to this:
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some)-36.77199936*1000(text)-38.4280014*1000(stuff)-33.42799759*1000] TJ
...
Assuming horizontal text we determined in my answer to your previous question that the horizontal displacement tx corresponding to a number Tj in a TJ array can be calculated as
tx = (−Tj / 1000) × Tfs × Th
where Tfs is the current font size and Th is the current horizontal scaling factor.
Thus, if you have a horizontal displacement tx and want to calculate the corresponding number Tj for a TJ array, you simply resolve the equation above to:
Tj = -1000 × tx / (Tfs × Th)
BUT this is not exactly the situation in your case because Td does not simply shift the text matrix by its parameters but instead shifts the text line matrix by them and sets the text matrix to the new text line matrix value:
tx ty
Td
Move to the start of the next line, offset from the start of the current line by (tx, ty). tx and ty shall denote numbers expressed in unscaled text space units. More precisely, this operator shall perform these assignments:
(ISO 32000-1, Table 108 – Text-positioning operators)
Thus, the tx parameter of Td is not the tx to put into the equation above but you instead have to subtract the width of the text drawn since the last setting of the text line matrix.
So to transform your example
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
into a
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some) NUM1 (text) NUM2 (stuff) NUM3] TJ
form, you calculate the numeric values NUM1, NUM2, and NUM3 like this:
NUM1 = -1000 × (36.77199936 - width("Some")) / (Tfs × Th)
NUM2 = -1000 × (38.4280014 - width("text")) / (Tfs × Th)
NUM3 = -1000 × (33.42799759 - width("stuff")) / (Tfs × Th)
When calculating the widths of those strings remember to take the font size, the character spacing, and the horizontal scaling into account!
And even then the two forms are not identical because the text line matrix at the end differs.
Related
I've a basic example from a PDF i'm editing.
The code
/P <</MCID 0>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 56.64 759.96 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(n)4(a)4(m)4(e)] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 109.7 759.96 Tm
0 g
0 G
[( )] TJ
ET
Q
works prefectly, producing "name" without quotes when I open the pdf.
Sadly, if I change the n with a c, something happens:
Same thing happen if i write [(N)4(a)4(m)4(e)] TJ (capital N) or with [\(Name)] TJ
What am I doing wrong?
Perhaps your font is subset, and so does not have a glyph for c. Your PDF viewer may be substituting a glyph from another standard font, but obeying the given metadata width for the c glyph in the font dictionary for your subset font, which will be 0 for a missing glyph. Hence the overwriting.
Edit: this should have been a comment, not an answer. Sorry.
I am trying to calculate the Showing image coordinates. But the actual image is showing bigger than showing below(fig1). But we can able to see only part of the image only. I want to calculate how the matrices are transforming(Calculation for shown image coords).
fig1
The content stream looks like below
The coords I am getting when I multiplied first q cm with second q cm is
[-122.196, 356.535, 484.061, 759.372]
But these are full image coord. How the 're' will change the calculation for part of image?
File
original pdf
After removing the 're' and 'W*'
Need Another answer on the same scenario.
second file
What I tried
0.24 0 0
0 -0.24 0
0 850 1
the 're' calculate to above CTM and it will gives
[595.92 0 0 7.05]
the nex cm instruction become CTM and looks like
1 0 0
0 1 0
262 404 1
the resulted matrix will be what How can I calculate it?
For the sake of brevity I'm going to round the numbers a bit here.
Case 1
Let's simply analyze your content stream excerpt:
Let's assume that the preceding instructions left the user space coordinate system and the clip path in its default state, so we can assume an identity current transformation matrix (CTM) and a clip path encompassing the whole page.
The first instruction
.24 0 0 -.24 0 850 cm
then changes the CTM to
0.24 0 0
0 -0.24 0
0 850 1
Thus, the rectangle path defined and used as a clip path thereafter
169 349.49 1038.37 1670.15 re
has, in the default user space, the coordinates (lower left, upper right):
[40.56 365.29 289.77 766.12]
Then the next cm instruction
2517.74 0 0 -1670.15 -504.99 2019.64 cm
changes the CTM to
604.26 0 0
0 400.84 0
-121.2 365.29 1
So the following bitmap image
/Im0 Do
is drawn, in the default user space, at the coordinates (lower left, upper right):
[-121.2 365.29 483.06 766.13]
This area partially is outside the clip path, so we get the visible image area in user space coordinates by intersecting those coordinates with the clip path, resulting in
[40.56 365.29 289.77 766.12]
So these are the coordinates you appear to be looking for.
Beware, in general clip paths can have arbitrary forms, and the CTM at image drawing time may not only scale, mirror, or translate (resulting in a rectangle parallel to the axis) but also rotate or skew (resulting in a rhomboid or something not parallel to the axis). Thus, calculating the intersection and making sense out of the result in general is more complicated.
In a comment you ask
But still I need bit more clarity to, after 're' how you got [40.56 365.29 289.77 766.12] this. How the calculation is happening.
I got those by applying the CTM to two diagonally opposed corners of the rectangle.
To get two such corners of
169 349.49 1038.37 1670.15 re
I first took the anchor point at 169 349.49 and as second point the anchor point with width and height added 1207.37 2019.64.
Then I applied the CTM to those two points
0.24 0 0
[169 349.49 1] × 0 -0.24 0 = [40.56 766.12 1]
0 850 1
0.24 0 0
[1207.37 2019.64 1] × 0 -0.24 0 = [289.77 365.29 1]
0 850 1
So I get the transformed corners at 40.56 766.12 and 289.77 365.29.
Due to the mirroring the resulting points were not lower-left to upper-right but instead upper-left to lower-right. Thus, I normalized the rectangle to [40.56 365.29 289.77 766.12].
Beware, this calculation makes use of the fact that the CTM only scaled, mirrored, and translated. If it also rotated or skewed, I would have had to apply the CTM to all corners of the rectangle (or at least three of them) and then worked with the rhomboid spanned by them.
Case 2
In an edit to your question you added another case:
This example shows that one has to inspect the XObject in question first.
If one assumed that Fm0 was an image XObject, the image would be drawn in a .24×.24 default user space units square, a tiny dot.
But Fm0 is not an image XObject, instead it is a form XObject which in turn shows an image XObject from its own resources. Thus, here is another step in the calculations:
The first instruction
.24 0 0 -.24 0 850 cm
then changes the CTM to
0.24 0 0
0 -0.24 0
0 850 1
Thus, the rectangle path defined and used as a clip path thereafter
0 0 2483.33 3512.32 re
has, in the default user space, the coordinates (lower left, upper right):
[0 7.04 596 850]
Then the next cm instruction
1 0 0 1 262 404 cm
changes the CTM to
0.24 0 0
0 -0.24 0
62.88 750.04 1
Due to
/Fm0 Do
we then have to continue with the XObject Fm0. First of all it has a bounding box entry
[ 0 0 1959 1306 ]
Applying the CTM to this we get a bounding box in the default user space of
[62.88 436.6 533.04 750.04]
which has to be intersected with the clip path.
The relevant content of the Fm0 is
0.72 196.505 1957.892 913.266 re
W* n
q
/GS0 gs
1957.8926 0 0 -1304.21912 0.7203979 1305.13342 cm
/Im0 Do
Q
Thus, the rectangle path defined and intersected with the clip path thereafter
0.72 196.51 1957.89 913.27 re
has, in the default user space, the coordinates (lower left, upper right):
[63.05 483.69 532.95 702.88]
Then the next cm instruction
1957.89 0 0 -1304.22 0.72 1305.13 cm
changes the CTM to
469.89 0 0
0 313.01 0
63.05 436.81 1
So the following bitmap image
/Im0 Do
is drawn, in the default user space, at the coordinates (lower left, upper right):
[63.05 436.81 532.94 749.82]
The effective clip path at this time is the intersection of the rectangles
[0 7.04 596 850]
[62.88 436.6 533.04 750.04]
[63.05 483.69 532.95 702.88]
so it is the rectangle
[63.05 483.69 532.95 702.88]
Thus, the visible area of that drawn image is
[63.05 483.69 532.94 702.88]
(Well, I hope it is, but maybe I somewhere along the path erred in some calculation...)
I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.
Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).
The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.
I have the following page content:
1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...
I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?
The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf
First of all, your comparison with other text extractors is based on a misunderstanding:
none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.
Now, concerning your approach:
In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.
Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.
Eventually let's discuss what you actually are after.
As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.
But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.
You might mean the length of what an originally vertical line (one unit high) is transformed into.
You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.
The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial
Considering following operator sequence:
Tf: R8 9.96
Tm: 0 1.00057 -1 0 105.12 60.3506
TJ: line 1:
Tf: R8 9.96
Tm: 0 1.00057 -1 0 105.12 95.9906
TJ: value 1
Tm: 0 1.00057 -1 0 116.16 60.3505
TJ: line 2:
Tf: R8 9.96
Tm: 0 1.00057 -1 0 116.16 124.551
TJ: value 2
Tm: 0 1.00057 -1 0 127.2 60.3507
TJ: line 3:
Tf: R8 9.96
Tm: 0 1.00057 -1 0 127.2 106.671
TJ: value 3
Tm: 0 1.00057 -1 0 138.24 60.3508
TJ: line 4:
Tf: R8 9.96
Tm: 0 1.00057 -1 0 138.24 112.791
TJ: value 4
PDF displays it as:
line 1: value 1
line 2: value 2
line 3: value 3
line 4: value 4
Referencing to PDF documentation matrix consist of [a b c d e f], where e = Tx and f = Ty
From first two command blocks (which gives first line of text) I noticed that Tx and Ty actually switched places. 105.12 stays same which should state vertical position.
PDF reference also says about rotation:
Rotations are produced by [ cos θ sin θ −sin θ cos θ 0 0 ], which has
the effect of rotating the coordinate system axes by an angle θ
counterclockwise.
Seems to be because of that Tx changes vertical position and Ty changes horizontal as sin(90) = 1 cos(0) = 0. Meaning 90 counterclockwise
Questionы:
Why increasing e (Tx) which considering rotation changes vertical position in actual PDF document lines go in correct order? According to Translation e (Tx) should descend.
Why letters and words are not rotated? Only e (Tx) and f (Ty) switched and that is all.
You only consider text matrix settings. You don't tell us about the current transformation matrix at the time of those text objects, and neither do you tell us about the page rotation value.
Considering your observations I would assume the page globally is rotated 90° clockwise.
This would explain why your 90° counterclockwise rotated text appears upright (your second question).
Furthermore with that page rotation the x axis would be vertical with coordinate values rising downwards answering your first question.
Some references
Rotate - integer - (Optional; inheritable) The number of degrees by which the page
shall be rotated clockwise when displayed or printed. The value
shall be a multiple of 90. Default value: 0.
(Table 30 – Entries in a page object - ISO 32000-1)
CTM - array - The current transformation matrix, which maps positions from
user coordinates to device coordinates (see 8.3, "Coordinate
Systems"). This matrix is modified by each application of the
coordinate transformation operator, cm. Initial value: a matrix
that transforms default user coordinates to device
coordinates.
(Table 52 – Device-Independent Graphics State Parameters - ISO 32000-1)
I Have a PDF document made by Latex which contains a table.
What are the pdf operators that represents this table ? I think Latex draws the table. right ?
as I want to extract it using PDFBOX library
When I decoded the PDF table I found these lines related to graphical objects and text.
does the line between q and Q draws a lines or
for the table
stream
q
1 0 0 1 139.746 715.892 cm
[]0 d 0 J 0.398 w 0 0 m 100.9 0 l S
Q
q
1 0 0 1 139.746 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 148.795 707.324 Td [(aaaa)]TJ
ET
q
1 0 0 1 186.626 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 198.277 707.324 Td [(bbbb)]TJ
ET
The explanation for the commands can easily be found in Adobe's PDF Reference 1.7.
One command at a time, and remembering that PDF has postfix notation, we can find in Chapter 4 "Graphics":
q % save graphics state (§4.2.1)
1 0 0 1 139.746 715.892 cm % set transform matrix (§4.2.3)
% --this is a simple 'translate' to (139.746,715.892)
[]0 d % set dash pattern to solid (§4.3.3)
0 J % set line cap to Butt
0.398 w % set line width to 0.398 units
0 0 m % move "current point" (§4.4.1)
100.9 0 l % append straight line
S % stroke the path (§4.4.2)
Q % restore the graphics state