PDF Table Text Extraction - pdf

I know that Table Text Extraction not everyone's cup of tea. But while reading PDF Stream data about table, there are certain things that I don't understand.
The PDF Code Stream of Table is:
q % Graphic State Starting Point
0 292.5 595.3 442.8 re % Rectangle x y Width Height
W* % Clipping Even Odd Rule
n % End without Filling
0 0 0 rg % Color of Stroke
161 735 m % Move to New Path
160.8 734.7 l 89.3 734.7 l 89 735 l 88.8 735.3 l 161.3 735.3 l
161 735 l % straight line
h % Close the Current Path
f* % Fill Path with Even Odd Rule
Q
And Underline is:
q % Graphic State Starting Point
1 0 0 1 451.5 759.5 cm % Current matrix
0.5 w % Width of Stroke
0 0 0 RG % color
0 -0.8 m % Move to New Path
72 -0.8 l % Straight Line
S % Stroke Line
Q % End of Graphic State
In underline the m modifies the cm and sets that to 451.5 759.5 and straight line is drawn from current point i.e. 451.5 to 72 points i.e 493.5 which is less that 0.8 to cm, i.e 759.5
I don't understand now, how the table line would be drawn from which point to which point.

Consider the line drawing section, where m is the move to operator and l is line-to:
% command coordinates
% =====================
161 735 m % Move-to a(161, 735)
160.8 734.7 l % line-to b(161, 735 -.3)
89.3 734.7 l % line-to c(90, 735 -.3)
89 735 l % line-to d(90, 735)
88.8 735.3 l % line-to e(90, 735 +.3)
161.3 735.3 l % line to f(161, 735 +.3)
161 735 l % line to g(161, 735)
h % close-path
f % fill
(There's some strange minor variations in x around 160 and 89 which are too small to register visually - rendering quirks?).
[Under very high resolution line will have arrows at end
<============ ... =======>
]
Other than that it's drawing a very thin long box with corners (89, 734,7), (161, 734.7), (161, 735.3), and (89, 735.3). The affect of the +/- ).3 points on the y-axis is most likely to give the effect of a slightly thickened line rather than a rectangle.

Related

Drawing boundary of shape entirely inside the shape in PDF

I am using Path Construction in PDF to draw a shape, say a rectangle. For example:
0 0 m 0 1 l 1 1 l 1 0 l 0 0 l B
But now, the line connecting (0,0) and (0,1) has (0,0) and (0,1) in the center. Therefore, the boundary "leaves" the rectangle by half of the line width.
Is there a parameter, so that the boundary is drawn entirely inside the rectangle?
This is just the normal behaviour of the line drawing operation.
The thickness of the line is spread equally to both sides of the line. So if you have a 10pt think line from (0,0) to (10,0) and use the butt cap line style, you will have a filled rectangular area with the corners (0,-5), (10,-5), (10,5), (0,5).
Have a look at this PDF file - you can see this effect in the second row, second column. The inner white lines and the outer black lines have the same start and end points.
So if you want to have everything inside that rectangle, either using a clip path like mkl said or calculate the necessary end points, taking the line width and line cap/join style into account.
As already mentioned in a comment, using a clip path the size of that rectangle is an option.
As your path only consists of the rectangle in question, you can do so very easily, simply add the clipping path operator W before the path painting operator B:
0 0 m 0 1 l 1 1 l 1 0 l 0 0 l W B
If you don't want to keep the clip path, enclose all this in save-state/restore-state
q
0 0 m 0 1 l 1 1 l 1 0 l 0 0 l W B
Q

How would this Post Scripts commands from a PDF Stream paint?

I have this PostScript code from this PDF's first page:
0 804 624 -654 re
W* n
0 792 612 -792 re
0 792 m
W n
0 792.06 612 -792 re
W n
I'm trying to think why would a rectangle have negative height and how would that affect the painting of the path. I know W* and W is for clipping and n is just a no-op but I don't get why would you paint a negative height rectangle.
That's not PostScript, its PDF, the two are different. I've removed the PostScript tag.
The content you've posted here will not paint anything at all, since (as you note) it consists entirely of clip operations applied to rectangular paths.
Most probably the path is required to be constructed that way in order to get the winding correct (this is especially important since one of the clips uses the even-odd rule)
To put it more simply, the operands to the first re are :
0 804 624 -654 re
That could be constructed from paths as:
0 804 m
624 804 l
624 150 l
0 150 l
h
The code could have used :
0 150 624 654 re
But then the equivalent path would be:
0 150 m
624 150 l
624 804 l
9 804 l
h
If you draw those rectangles (including the direction of travel) you'll see that one proceeds clockwise, while the other proceeds anti-clockwise.

TJ and Td offset difference

I have some text that I want to edit (justified text, really annoying), so I was wondering if this:
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
...
is equivalent to this:
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some)-36.77199936*1000(text)-38.4280014*1000(stuff)-33.42799759*1000] TJ
...
Assuming horizontal text we determined in my answer to your previous question that the horizontal displacement tx corresponding to a number Tj in a TJ array can be calculated as
tx = (−Tj / 1000) × Tfs × Th
where Tfs is the current font size and Th is the current horizontal scaling factor.
Thus, if you have a horizontal displacement tx and want to calculate the corresponding number Tj for a TJ array, you simply resolve the equation above to:
Tj = -1000 × tx / (Tfs × Th)
BUT this is not exactly the situation in your case because Td does not simply shift the text matrix by its parameters but instead shifts the text line matrix by them and sets the text matrix to the new text line matrix value:
tx ty
Td
Move to the start of the next line, offset from the start of the current line by (tx, ty). tx and ty shall denote numbers expressed in unscaled text space units. More precisely, this operator shall perform these assignments:
(ISO 32000-1, Table 108 – Text-positioning operators)
Thus, the tx parameter of Td is not the tx to put into the equation above but you instead have to subtract the width of the text drawn since the last setting of the text line matrix.
So to transform your example
BT /FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
(Some) Tj
36.77199936 0 Td
(text) Tj
38.4280014 0 Td
(stuff) Tj
33.42799759 0 Td
into a
BT
/FAAABA 10 Tf
1 0 0 -1 0 9.38000011 Tm
[(Some) NUM1 (text) NUM2 (stuff) NUM3] TJ
form, you calculate the numeric values NUM1, NUM2, and NUM3 like this:
NUM1 = -1000 × (36.77199936 - width("Some")) / (Tfs × Th)
NUM2 = -1000 × (38.4280014 - width("text")) / (Tfs × Th)
NUM3 = -1000 × (33.42799759 - width("stuff")) / (Tfs × Th)
When calculating the widths of those strings remember to take the font size, the character spacing, and the horizontal scaling into account!
And even then the two forms are not identical because the text line matrix at the end differs.

PDF Specification - Get Font Size in Points

I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.
Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).
The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.
I have the following page content:
1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...
I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?
The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf
First of all, your comparison with other text extractors is based on a misunderstanding:
none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.
Now, concerning your approach:
In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.
Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.
Eventually let's discuss what you actually are after.
As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.
But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.
You might mean the length of what an originally vertical line (one unit high) is transformed into.
You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.
The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial

PDF : How a PDF table made by Latex is represented using PDF operators?

I Have a PDF document made by Latex which contains a table.
What are the pdf operators that represents this table ? I think Latex draws the table. right ?
as I want to extract it using PDFBOX library
When I decoded the PDF table I found these lines related to graphical objects and text.
does the line between q and Q draws a lines or
for the table
stream
q
1 0 0 1 139.746 715.892 cm
[]0 d 0 J 0.398 w 0 0 m 100.9 0 l S
Q
q
1 0 0 1 139.746 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 148.795 707.324 Td [(aaaa)]TJ
ET
q
1 0 0 1 186.626 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 198.277 707.324 Td [(bbbb)]TJ
ET
The explanation for the commands can easily be found in Adobe's PDF Reference 1.7.
One command at a time, and remembering that PDF has postfix notation, we can find in Chapter 4 "Graphics":
q % save graphics state (§4.2.1)
1 0 0 1 139.746 715.892 cm % set transform matrix (§4.2.3)
% --this is a simple 'translate' to (139.746,715.892)
[]0 d % set dash pattern to solid (§4.3.3)
0 J % set line cap to Butt
0.398 w % set line width to 0.398 units
0 0 m % move "current point" (§4.4.1)
100.9 0 l % append straight line
S % stroke the path (§4.4.2)
Q % restore the graphics state