PDF Specification - Get Font Size in Points - pdf

I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.
Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).
The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.
I have the following page content:
1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...
I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?
The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf

First of all, your comparison with other text extractors is based on a misunderstanding:
none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.
Now, concerning your approach:
In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.
Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.
Eventually let's discuss what you actually are after.
As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.
But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.
You might mean the length of what an originally vertical line (one unit high) is transformed into.
You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.
The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial

Related

Some letters overlap using TJ

I've a basic example from a PDF i'm editing.
The code
/P <</MCID 0>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 56.64 759.96 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(n)4(a)4(m)4(e)] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 109.7 759.96 Tm
0 g
0 G
[( )] TJ
ET
Q
works prefectly, producing "name" without quotes when I open the pdf.
Sadly, if I change the n with a c, something happens:
Same thing happen if i write [(N)4(a)4(m)4(e)] TJ (capital N) or with [\(Name)] TJ
What am I doing wrong?
Perhaps your font is subset, and so does not have a glyph for c. Your PDF viewer may be substituting a glyph from another standard font, but obeying the given metadata width for the c glyph in the font dictionary for your subset font, which will be 0 for a missing glyph. Hence the overwriting.
Edit: this should have been a comment, not an answer. Sorry.

PDF commands - does the q - Q save the path?

My PDF file contains following commands:
1.0 0 0 -1.0 0 810.0 cm
1.0 0 0 1.0 0 0 cm
1.0 0 0 1.0 9.0 9.0 cm
-9.0 -9.0 m 621.0 -9.0 l 621.0 801.0 l -9.0 801.0 l h
q
1.0 1.0 1.0 rg f
Q
q
1.0 0 0 1.0 0 0 cm
/Div <</MCID 0>> BDC
q
/GS_0-0-0-0x0 gs
q
q
q
1.0 0 0 1.0 -25.98 -17.82 cm
Q
q
1.0 0 0 1.0 -25.98 -17.82 cm
0 0 m 666.0 0 l 666.0 144.2 l 0 144.2 l h
q
.2392 .2784 .3215 rg f
Q
Q
Q
Q
Q
A rectangle is drawn at lines 4 and 20. There is a fill command "f" at lines 6 and 22.
Calling "f" at line 6 clears the current vector path, however, "Q" on a line 7 should return it back. So the line 22 should paint two rectangles, but the PDF viewer draws only one rectangle. My question is, which command exactly clears the first rectangle before the line 22?
On one hand you can see that q does not save the current path by looking into the specification as KenS has proposed in his comment. If you use the ISO norm (either ISO 32000-1 or ISO 32000-2), you'll find the tables for "Device-Independent Graphics State Parameters" and "Device-Dependent Graphics State Parameters" in section 8.4.1. You'll see that the only path stored there is the clipping path.
But you can also see that q does not save the current path by considering when a q instruction is allowed: You'll find in particular that after defining a path the next instruction must be either a path painting instruction or or a clipping path instruction immediately followed by a path painting instruction. Thus, a path is already used (and, therefore, discarded) before the next q instruction! So q has no chance to save a current path. (For a normative reference have a look at Figure 9 – Graphics Objects – in ISO 32000-1 or ISO 32000-2).
As an aside, the second paragraph above tells you that your examples are invalid, neither the q nor the rg instructions you put between path definition and path painting are allowed.
So concerning your question:
My question is, which command exactly clears the first rectangle before the line 22?
As your example instruction sequence is invalid, the result is undefined. But as there is no element for the current path in the graphics state, you in particular should not expect the path to re-appear after Q.

How to calculate viewing BBox of an image in pdf?

I am trying to calculate the Showing image coordinates. But the actual image is showing bigger than showing below(fig1). But we can able to see only part of the image only. I want to calculate how the matrices are transforming(Calculation for shown image coords).
fig1
The content stream looks like below
The coords I am getting when I multiplied first q cm with second q cm is
[-122.196, 356.535, 484.061, 759.372]
But these are full image coord. How the 're' will change the calculation for part of image?
File
original pdf
After removing the 're' and 'W*'
Need Another answer on the same scenario.
second file
What I tried
0.24 0 0
0 -0.24 0
0 850 1
the 're' calculate to above CTM and it will gives
[595.92 0 0 7.05]
the nex cm instruction become CTM and looks like
1 0 0
0 1 0
262 404 1
the resulted matrix will be what How can I calculate it?
For the sake of brevity I'm going to round the numbers a bit here.
Case 1
Let's simply analyze your content stream excerpt:
Let's assume that the preceding instructions left the user space coordinate system and the clip path in its default state, so we can assume an identity current transformation matrix (CTM) and a clip path encompassing the whole page.
The first instruction
.24 0 0 -.24 0 850 cm
then changes the CTM to
0.24 0 0
0 -0.24 0
0 850 1
Thus, the rectangle path defined and used as a clip path thereafter
169 349.49 1038.37 1670.15 re
has, in the default user space, the coordinates (lower left, upper right):
[40.56 365.29 289.77 766.12]
Then the next cm instruction
2517.74 0 0 -1670.15 -504.99 2019.64 cm
changes the CTM to
604.26 0 0
0 400.84 0
-121.2 365.29 1
So the following bitmap image
/Im0 Do
is drawn, in the default user space, at the coordinates (lower left, upper right):
[-121.2 365.29 483.06 766.13]
This area partially is outside the clip path, so we get the visible image area in user space coordinates by intersecting those coordinates with the clip path, resulting in
[40.56 365.29 289.77 766.12]
So these are the coordinates you appear to be looking for.
Beware, in general clip paths can have arbitrary forms, and the CTM at image drawing time may not only scale, mirror, or translate (resulting in a rectangle parallel to the axis) but also rotate or skew (resulting in a rhomboid or something not parallel to the axis). Thus, calculating the intersection and making sense out of the result in general is more complicated.
In a comment you ask
But still I need bit more clarity to, after 're' how you got [40.56 365.29 289.77 766.12] this. How the calculation is happening.
I got those by applying the CTM to two diagonally opposed corners of the rectangle.
To get two such corners of
169 349.49 1038.37 1670.15 re
I first took the anchor point at 169 349.49 and as second point the anchor point with width and height added 1207.37 2019.64.
Then I applied the CTM to those two points
0.24 0 0
[169 349.49 1] × 0 -0.24 0 = [40.56 766.12 1]
0 850 1
0.24 0 0
[1207.37 2019.64 1] × 0 -0.24 0 = [289.77 365.29 1]
0 850 1
So I get the transformed corners at 40.56 766.12 and 289.77 365.29.
Due to the mirroring the resulting points were not lower-left to upper-right but instead upper-left to lower-right. Thus, I normalized the rectangle to [40.56 365.29 289.77 766.12].
Beware, this calculation makes use of the fact that the CTM only scaled, mirrored, and translated. If it also rotated or skewed, I would have had to apply the CTM to all corners of the rectangle (or at least three of them) and then worked with the rhomboid spanned by them.
Case 2
In an edit to your question you added another case:
This example shows that one has to inspect the XObject in question first.
If one assumed that Fm0 was an image XObject, the image would be drawn in a .24×.24 default user space units square, a tiny dot.
But Fm0 is not an image XObject, instead it is a form XObject which in turn shows an image XObject from its own resources. Thus, here is another step in the calculations:
The first instruction
.24 0 0 -.24 0 850 cm
then changes the CTM to
0.24 0 0
0 -0.24 0
0 850 1
Thus, the rectangle path defined and used as a clip path thereafter
0 0 2483.33 3512.32 re
has, in the default user space, the coordinates (lower left, upper right):
[0 7.04 596 850]
Then the next cm instruction
1 0 0 1 262 404 cm
changes the CTM to
0.24 0 0
0 -0.24 0
62.88 750.04 1
Due to
/Fm0 Do
we then have to continue with the XObject Fm0. First of all it has a bounding box entry
[ 0 0 1959 1306 ]
Applying the CTM to this we get a bounding box in the default user space of
[62.88 436.6 533.04 750.04]
which has to be intersected with the clip path.
The relevant content of the Fm0 is
0.72 196.505 1957.892 913.266 re
W* n
q
/GS0 gs
1957.8926 0 0 -1304.21912 0.7203979 1305.13342 cm
/Im0 Do
Q
Thus, the rectangle path defined and intersected with the clip path thereafter
0.72 196.51 1957.89 913.27 re
has, in the default user space, the coordinates (lower left, upper right):
[63.05 483.69 532.95 702.88]
Then the next cm instruction
1957.89 0 0 -1304.22 0.72 1305.13 cm
changes the CTM to
469.89 0 0
0 313.01 0
63.05 436.81 1
So the following bitmap image
/Im0 Do
is drawn, in the default user space, at the coordinates (lower left, upper right):
[63.05 436.81 532.94 749.82]
The effective clip path at this time is the intersection of the rectangles
[0 7.04 596 850]
[62.88 436.6 533.04 750.04]
[63.05 483.69 532.95 702.88]
so it is the rectangle
[63.05 483.69 532.95 702.88]
Thus, the visible area of that drawn image is
[63.05 483.69 532.94 702.88]
(Well, I hope it is, but maybe I somewhere along the path erred in some calculation...)

How would this Post Scripts commands from a PDF Stream paint?

I have this PostScript code from this PDF's first page:
0 804 624 -654 re
W* n
0 792 612 -792 re
0 792 m
W n
0 792.06 612 -792 re
W n
I'm trying to think why would a rectangle have negative height and how would that affect the painting of the path. I know W* and W is for clipping and n is just a no-op but I don't get why would you paint a negative height rectangle.
That's not PostScript, its PDF, the two are different. I've removed the PostScript tag.
The content you've posted here will not paint anything at all, since (as you note) it consists entirely of clip operations applied to rectangular paths.
Most probably the path is required to be constructed that way in order to get the winding correct (this is especially important since one of the clips uses the even-odd rule)
To put it more simply, the operands to the first re are :
0 804 624 -654 re
That could be constructed from paths as:
0 804 m
624 804 l
624 150 l
0 150 l
h
The code could have used :
0 150 624 654 re
But then the equivalent path would be:
0 150 m
624 150 l
624 804 l
9 804 l
h
If you draw those rectangles (including the direction of travel) you'll see that one proceeds clockwise, while the other proceeds anti-clockwise.

TJ and Td offset difference

I have some text that I want to edit (justified text, really annoying), so I was wondering if this:
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
...
is equivalent to this:
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some)-36.77199936*1000(text)-38.4280014*1000(stuff)-33.42799759*1000] TJ
...
Assuming horizontal text we determined in my answer to your previous question that the horizontal displacement tx corresponding to a number Tj in a TJ array can be calculated as
tx = (−Tj / 1000) × Tfs × Th
where Tfs is the current font size and Th is the current horizontal scaling factor.
Thus, if you have a horizontal displacement tx and want to calculate the corresponding number Tj for a TJ array, you simply resolve the equation above to:
Tj = -1000 × tx / (Tfs × Th)
BUT this is not exactly the situation in your case because Td does not simply shift the text matrix by its parameters but instead shifts the text line matrix by them and sets the text matrix to the new text line matrix value:
tx ty
Td
Move to the start of the next line, offset from the start of the current line by (tx, ty). tx and ty shall denote numbers expressed in unscaled text space units. More precisely, this operator shall perform these assignments:
(ISO 32000-1, Table 108 – Text-positioning operators)
Thus, the tx parameter of Td is not the tx to put into the equation above but you instead have to subtract the width of the text drawn since the last setting of the text line matrix.
So to transform your example
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
into a
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some) NUM1 (text) NUM2 (stuff) NUM3] TJ
form, you calculate the numeric values NUM1, NUM2, and NUM3 like this:
NUM1 = -1000 × (36.77199936 - width("Some")) / (Tfs × Th)
NUM2 = -1000 × (38.4280014 - width("text")) / (Tfs × Th)
NUM3 = -1000 × (33.42799759 - width("stuff")) / (Tfs × Th)
When calculating the widths of those strings remember to take the font size, the character spacing, and the horizontal scaling into account!
And even then the two forms are not identical because the text line matrix at the end differs.