How do Text Objects in PDF work?

How do Text Objects in PDF work? - pdf

I have a PDF document of which I would like to remove watermarks as automatically as possible to get better results from pdftotext.
After uncompressing it with pdftk I see the watermark almost in plain text:
BT
1 0 0 1 277.40012 755.2005 Tm
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
[()]TJ
0 0 Td
[(Abc)30(defghi K)30(lm)-40(no)]TJ
-5.423981 -9.600038 Td
[()]TJ
0 0 Td
[(Apr 01, 2017 12:34)]TJ
ET
The watermark is
Abcdefghi Klmno
Apr 01, 2017 12:34
After skimming through Document management — Portable document format (especially page 248f), I found the following:
BT: Begin Text
Tm: Text matrix - what is that?
x y Td: Move to the start of the next line with an offset of (x, y)
TJ: Text showing
Tf: Text state
ET: End Text
What I don't understand is all the numbers and why
[(Abc)30(defghi K)30(lm)-40(no)]TJ
Does it increase the space between Abc and defghi K and decrease the space between lm and no (seems so, looking at Figure 46 on page 259)? By what unit?
What does Tf do?
Could somebody please explain that?

What I don't understand is all the numbers and why
[(Abc)30(defghi K)30(lm)-40(no)]TJ
Does it increase the space between Abc and defghi K and decrease the space between lm and no (seems so, looking at Figure 46 on page 259)?
Nearly so, the positive value decreases and the negative value increases, cf. Table 109 – Text-showing operators in the PDF specification:
array
TJ :
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
The figure is misleading, obviously some type-setting program scrambled up the effect the author wanted to show. The actual source of the figure looks like this:
BT
/T1_2 1 Tf
0 Tc 8.7503 0 0 8.7503 118.989 450.2115 Tm
[([ \()11(A)53(W)57(A)79(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 440.2111 Tm
[(A)53(W)57(A)79(Y again)]TJ
8.7503 0 0 8.7503 118.989 403.2097 Tm
[([ \()11(A)9(\) 120 \()-50(W)-55(\) 120 \()11(A)9(\) 95 \()-41(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 392.2093 Tm
(AWAY again)Tj
ET
By what unit?
thousandths of a unit of text space, cf. the quote above.
Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space.
This often coincides with a single unit in glyph space
What does Tf do?
According to Table 105 – Text state operators in the PDF specification
font size Tf :
Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary; size shall be a number representing a scale factor. There is no initial value for either font or size; they shall be specified explicitly by using Tf before any text is shown.
The only thing I don't understand now is the line
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
Can you explain that, too?
The instruction
0.501961 0.501961 0.501961 rg
sets the fill color to a medium gray in an RGB color space.
Then
/R1 gs
sets additional graphics state parameters from the ExtGState resource named R1; probably here some transparency effect is defined.
Finally
/R2 8 Tf
sets the font to one defined by the Font resource named R2 and the font size to 8.

Partial answer
Tf
font size Tf
sets the font and the size (see page 244)
gs
dictName gs sets the graphic state:
(PDF 1.2) Set the specified parameters in the graphics state.
dictName shall be the name of a graphics state parameter
dictionary in the ExtGState subdictionary of the current resource
dictionary (see the next sub-clause).
I am not too sure what \R1 means.
rg
1.0 1.0 0.0 rg % Set nonstroking colour to yellow
Hence 0.501961 0.501961 0.501961 rg sets the color to some gray value.
Text matrix
The Text matrix is an affine transformation matrix as explained in this answer.
Hence
1 0 0 1 0 0 Tf
doesn't change anything.
The matrix 1 0 0 1 277.40012 755.2005 Tm moves the text right (?) by 277.40012 text units (?) and down by 755.2005 text units.

Related

PDF glyph spacing and TJ operator

I am new to PDF, and i want to manipulate the space between the characters in a file.
I have read the PDFReference and understood some of it.
Now, the problem I have is how to calculate the spaces for the text rendering.
I have for example:
1 0 0 1 0 188.28799438 cm
BT
/F2 11.04 Tf
1 0 0 -1 0 9.38000011 Tm
(Some)Tj ( )Tj
21.24200058 0 Td
(text)Tj ( )Tj
Which I want to turn into this:
1 0 0 1 0 188.28799438 cm
BT
/F2 11.04 Tf
1 0 0 -1 0 9.38000011 Tm
[(S)10(o)10(m)10(e)( )]TJ
21.24200058 0 Td
[(t)10(e)10(x)10(t)( )]TJ
To add the spaces and then be able to manipulate them. However I was wondering how to calculate the ctm and the line matrix with those added values.
I know that we concatenate cm with the previous one.
cm2 x cm1
The Tms are not concatenated Tm2 replaces Tm1.
I am stuck with the td operator and the new spaces I added. Any clue?

If you're working with horizontal text and only want to control the spacing between glyphs with the TJ operator, you don't need to worry about adding those values to the current transformation matrix or line matrix.
The CTM (current transformation matrix) is a master matrix that maps user space coordinates to output device coordinates; for each glyph, it is concatenated with other parameters to make a temporary text rendering matrix to position the glyph, but the CTM does not accumulate changes as glyphs are positioned (see 9.4.4 'Text Space Details' in the PDF 32000 reference)
The line matrix captures the value of the initial text matrix at the beginning of a line of a text; it's really only used for matching the vertical position of lines of text and isn't affected by the spacing between glyphs (see 9.4.2 'Text Positioning Operators')

As clarified in comments, the OP is not asking for effects of the TJ numbers on the current transformation matrix or text line matrix but instead on the text matrix Tm.
This is explained in the specification ISO 32000-1 (and equivalently in ISO 32000-2) in section 9.4.4 Text Space Details: After drawing a glyph (probably followed by a number in a TJ instruction array argument), the text matrix shall be updated as follows:
In horizontal mode tx is the displacement and ty is zero, in vertical mode tx is zero and ty is the displacement. The applicable value is calculated as
I.e. if you do this calculation while processing a TJ instruction and there is a number following the character code for the currently drawn glyph, that number is considered here as the parameterTj.
Thus, if you want to determine the displacement caused by a number element of a TJ array argument alone — e.g. if the first element in the TJ array argument is a number or if there are multiple consecutive number elements in the TJ array argument and you want to know the effect of each one — the above reduces to
tx = (−Tj / 1000) × Tfs × Th
ty = (−Tj / 1000) × Tfs

Clipping path seems to be outside of text

recently I wanted to construct a PDF document which should have text clipping: With 4 Tr I tried to define the text as clipping area. But when I wanted to fill the lower part of the text with red color, the result was reversed.
Do anyone knows, why?
Thanks for any answer!
stream
BT
4 8 Td
0.8 0.2 0.7 rg % Writing lila.
4 Tr % Fill & Use text as clipping area.
/TR 32 Tf
(Hallo Welt) Tj
1 0 0 rg % Fill in red.
0 0 200 20 re F % <- Mistake?
ET
What I wanted to have:
What I got:

Have a look at the specification ISO 32000-1:
The behaviour of the clipping modes requires further explanation. Glyph outlines shall begin accumulating if a BT operator is executed while the text rendering mode is set to a clipping mode or if it is set to a clipping mode within a text object. Glyphs shall accumulate until the text object is ended by an ET operator; the text rendering mode shall not be changed back to a nonclipping mode before that point.
(section 9.3.6 Text Rendering Mode )
In your sample you don't wait until the ET for the clipping path to take effect. So, when you are painting the red rectangle, your special clipping path is not yet in effect.
Furthermore your operation sequence actually is invalid! Neither path construction nor path painting operators (i.e. neither your 0 0 200 20 re nor your F) are allowed inside a text object, cf. Figure 9 – Graphics Objects in the specification:
Thus, strictly speaking your PDF viewer had better refuse to draw your content stream at all.

Text rotation in PDF

So I have this situation:
using pdftoxml.exe from sourceforge.net I got text tokens and their coordinates. If the pdf file was rotated (i.e. it has a /Rotate 90 written in its source) pdftoxml.exe swaps height and width of a given page and also x and y coordinates of any given object. That is what I understand.
I was happy with it, until I came across a pdf file which used re to draw thick lines. That is, for a thick line, 4 thin lines are drawn and the space is filled, like in this picture. On the left you see two thin lines (non colored), which are part of a bigger rectangle (highly zoomed in). I emptied the space inbetween which was actually filled with black, to see the lines:
Additionally, above pdf is rotated. So to get B upright in the end, this textmatrix was used: 0 1 -1 0 90.72 28.3705 Tm. The thin lines were drawn like this from 83.04 27.891 0.48 0.48 re (coordinates may vary here, but it was some re operation like that. The operation goes like x y width height re and re is for rectangle from adobe's pdf 1.7 page 133). What is relevant here is the calculation 27.891 + 0.48 = 28.371 which is not rounded or altered because of floating-point issues. It is the exact value for the line's x and unfortunately, it is bigger than the hard coded B's x which is 28.3705 :
83.52 27.891 m 92.39999999999999 27.891 l s
92.39999999999999 27.891 m 92.39999999999999 28.371 l s
92.39999999999999 28.371 m 83.52 28.371 l s
83.52 28.371 m 83.52 27.891 l s
The page's coordinates go like 842 x 595,2 according to PDFXChange viewer from upper left corner. Which seems natural since the page is rotated. Unrotated, it would be the lower left corner, so that ought to be ok.
When the text is altered with 1 0 0 1 90.72 28.3705 Tm into its original orientation, one can see the collapsing bottom line with the line on the left:
which is what I would expect, since B 's y is 28.3705 and and the line's horizontal position is 28.371 (as can be seen on the second line of above code lines). So probabyly B's bottom line falls beyond the 28.371 but I could not zoom that.
Now where does the gap between the line and the B come from in the first picture? This is important to me because I was trying to figure out which is the closest line on the left to B and was surprised by the two values, namely the suppsed x value of the text I get from pdftoxml.exe which is 28.3705 and the lines horizontal value 28.371. Since I knew the line is actually far beyond the left of the B that could not be correct, at least not in the sense of "take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left".
I can't locate the correct line with the x values. Instead I get the other line on the very left...like as if the text was falling inbetween them two.
This is the text drawing code:
BT
%0 7.5 -7.5 0 90.72 28.3705 Tm
0 1 -1 0 90.72 28.3705 Tm
%1 0 0 1 90.72 28.3705 Tm
/F1 1 Tf
1 Tr
q
0.01 w
(B) Tj
Q
ET
so, there is nothing fancy happening with the B's size or line thickness.
Can you help me figure out?
This is an updated picture with two I drawn on the same page, for the upper I using 0 1 -1 0 90.72 28.3705 Tm (rotated 90 degrees mathematically), for the lower one 1 0 0 1 90.72 28.3705 Tm. So I don't get it, how is the lower I rotated +90 and ends up being the upper one?
Here is the pdf code. It is rather big, but you should be able to copy it into your file and name it sth.pdf.
PDF Sample ( you have to actually zoom into the upper left corner real big to see the I )
EDIT
I actually found some interesting information about finding the glyph bounding box, but I could not yet put the pieces together.

Please have a look at
The glyph origin is the point (0, 0) in the glyph coordinate system. Tj and other text-showing operators shall position the origin of the first glyph to be painted at the origin of text space.
(shamelessly copied from Figure 39, section 9.2.4 of ISO 32000-1).
As you can see, the coordinates where the glyph is positioned, the glyph origin, is not necessarily where the actual glyph bounding box starts. This may explain the gap in your first image.
Thus, when you are trying to figure out which is the closest line on the left to B optically, it does not suffice to take x position of line, take x position of B, compare, and if the line's x is less then than B's x, the line is on the left, instead you also have to take the font data themselves into account and factor in the gap between glyph origin and glyph bounding box of the glyph represented by B.
For a more in-depth analysis please supply the font data.
EDIT concerning your double-I question... in your comment above you say you actually expected to see a common point - the rotation point - in both I characters, so you can get hands on a reliable horizontal coordinate for the left bounding box side of a character.
Isn't the point where the red lines cross, your rotation point? It should be the glyph origin for both Tj operations, and the I-glyphs have their origins there. Now you can measure from there on.

PDF Low-Level: Adding text as an invisible layer with each letter in specific position

I'm writing a PDF file directly from code, it's all working nicely, but I don't know how to add text into the content object of a page with each letter at a specific position.
I have the coordinates of each letter, something like this:
x0 y0 x1 y1
a = 345,200,350,210
n = 352,201,360,209
d = 365,200,371,212
I want to be able to put this onto the PDF page as an invisible layer so it can be searched or selected, but with each letter in the exact correct coordinates.
Alternatively I could do it with only the coordinates for each word, if this is better.
What is the format for writing this into the content object?
Thank you very much for your help!

There are many ways of doing this. You'll need to use a text block:
BT
%..you need to set a font...
/f1 10 Tf
%..you need to set the text matrix to include Tx and Ty (if not already done)..
1 0 0 1 345 200 Tm
(a) Tj % or (and) Tj to display the word in one go (position of chars depends on font selected)
1 0 0 1 352 201 Tm
(n) Tj
% etc.
ET
You also mentioned that you wanted the text to be invisible. If you are in complete control of the page content you can set the text stroke and fill colour to be the same as the background colour (which will probably be white)
1 1 1 RG
1 1 1 rg
Otherwise you can paint over the text, it will still be selectable.

Determine translate/position from PDF Tm operators

I'm trying to extract some textual data from a PDF file. To do this, I need a sense of where some text is printed on the page, so I can correlate locations of different pieces of data. However, I'm getting stuck because I don't fully understand the behavior of the text matrix set by the Tm operator.
Tm (0.0, -5.28, 5.28, 0.0, 429.7006, 803.9603)
rg (0.617, 0.098, 0.043)
Tj '\x01'
Tm (0.0, -9.0, 9.0, 0.0, 428.1406, 784.8203)
rg (0.0, 0.219, 0.512)
Tc (2.4756,)
Tj '4567'
This is some of the stream content. As you can see, it has two Tm calls, closely together. All the normal text is printed in the Tm (0.0, -9.0, 9.0, 0.0) space -- it appears like the -5.28/5.28 space is just used to print some special characters. Now, I know that the latter two parameters to Tm are used to set the current location to a new one, but it appears these numbers are dependent on more context (probably the 5.28 and 9.0 scales, somehow). I can't seem to figure out how all this fits together, though, and the spec (page 250 has the Tm "explanation") seems spectacularly unhelpful to me.
EDIT: extended example, why this has me flummoxed:
Tm 0 -27 27 0 545.5606 817.2203
(rg, Tc, Tw, Tj, Tf omitted)
TD 0.0156 -1.2556
Tm 0 -9 9 0 441.9406 677.4803
TD 10.6733 0 # more omitted, including other TD ops with second param 0
TD -82.7267 -1.5333 # start of a new line
Tc 0
Tj (3)
Tf /F2 1
Tm 0 -5.28 5.28 0 429.7006 803.9603
Tj ()
Tf /TT2 1
Tm 0 -9 9 0 428.1406 784.8203
Tc 2.4756
Tj (4567) # these appear on the same line as before the double Tm
In my initial code I assumed that the e and f parameters to Tm and the parameters to TD were in the same space, leading to organized coordinates. However, that fails here: the 4567 in the last Tj shows up in the same line as the earlier 3, while the y coordinate has gone from 677.4803 + -1.5333 = 675.947, but after the final Tm, the y axis coordinate seems to be set to 784.8203; suggesting that "4567" should be drawn above the 3.

The text matrix is combined with the current transformation matrix in order to set the text position. Your text is placed at (429.7006, 803.9603) and at (428.1406, 784.8203). The text size is 5.28 and 9 points. It is a common technique to set the font size to 1 using the Tf operator and set the actual font size by scaling the text matrix. Your text is also rotated.
A correct calculation of text position requires to parse the entire content stream and execute all q, Q, cm, Tf, Tm and all the other text related operators.

Generally its w,0,o,h,x,y where
x,y is starting co-ordinate in PDF cordinates, w,h are font size (you can technically have a different value for scaling effects.
You can also have 0,w,h,0,x,y - the minus values and different position of w,h show there is a sheer/rotation on the text. Its all Matrix maths.
You may also need to factor in the CTM to get the final text start location (text could be scaled within a Form element).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas