Has anyone written a library (or just a program) that optimizes the contents of PDF page streams? I am talking about things like "delete q...Q blocks that have no overall effect", "merge adjacent BT...ET blocks", "track the graphics state and delete operators that set something to the value it already has", maybe even "reorder drawing operations to minimize graphics state changes, when this can be done without changing the appearance of the page". I ain't picky as to implementation language, but open source is very much preferred, as I may need to hack it up for my particular needs.
Here is a small fragment of an example of what I would like done. R's "grid" graphics + its PDF backend generate ridiculous numbers of pointless operations, like this:
1 J 1 j q
Q q
Q q
Q q
Q q
Q q
Q q
Q q
Q q
Q q
BT
0.000 0.000 0.000 rg
/F2 1 Tf 12.00 0.00 -0.00 12.00 168.43 14.40 Tm [(T) 120 (ask)] TJ
ET
Q q
BT
0.000 0.000 0.000 rg
/F2 1 Tf 0.00 12.00 -12.00 0.00 19.42 205.26 Tm
[(Quer) -15 (ies per min) 10 (ute)] TJ
ET
Q q
Q q 23.02 489.60 26.53 0.00 re W n
Q q
Q q 23.02 489.60 26.53 0.00 re W n
Q q
Q q
Q q
[...]
This could be crushed down to just
1 J 1 j
BT
/F2 1 Tf
12 0 0 12 168.43 14.40 Tm [(T) 120 (ask)] TJ
0 12 -12 0 19.42 205.26 Tm [(Quer) -15 (ies per min) 10 (ute)] TJ
ET
and possibly even further with more sophisticated use of the text operators, which I can't do in my head.
The Java "Multivalent Tools" has a "compress" tool that will do this:
http://multivalent.sourceforge.net/Tools/pdf/Compress.html
The Compress tool was removed from the latest Multivalent jar, but you can download an older version from the following location:
http://code.google.com/p/pdfsizeopt/downloads/detail?name=Multivalent20060102.jar&can=2&q=
That looks remarkably like the PDF output of iText's PdfGraphics2D interface, in a worse-than-usual case. The usual case isn't so hot either, but it's not THAT bad.
If I'm right, the answer is still no, but you can write one yourself, as you clearly have no fear of content streams:
ByteBuffer internalBuf = myPdfContentByte.getInternalBuffer();
String newContents = magic( internalBuf.toString() );
internalBuf.reset();
internalBuf.append( newContents );
magic() is a tad nebulous, but writing code to remove "q Q" pairs should be trivial. Yanking clipping regions with nothing inside them (line-line-line W n) shouldn't be all that much harder with a bit of regEx.
Getting rid of the line cap/line join settings (j & J) when they aren't used would be Harder. Ditto with combining text blocks or dumping redundant changes to the fill/stroke colors, font&size etc.
"Sophisticated use of the text operators" is going to start looking like black magic compiler optimization in short order.
And if this does happen to be iText, we'd all appreciate it if you'd share your code. We'll cheerfully accept just about any PdfGraphics2D output clean up, I assure you.
Related
I've a basic example from a PDF i'm editing.
The code
/P <</MCID 0>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 56.64 759.96 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(n)4(a)4(m)4(e)] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 12 Tf
1 0 0 1 109.7 759.96 Tm
0 g
0 G
[( )] TJ
ET
Q
works prefectly, producing "name" without quotes when I open the pdf.
Sadly, if I change the n with a c, something happens:
Same thing happen if i write [(N)4(a)4(m)4(e)] TJ (capital N) or with [\(Name)] TJ
What am I doing wrong?
Perhaps your font is subset, and so does not have a glyph for c. Your PDF viewer may be substituting a glyph from another standard font, but obeying the given metadata width for the c glyph in the font dictionary for your subset font, which will be 0 for a missing glyph. Hence the overwriting.
Edit: this should have been a comment, not an answer. Sorry.
My PDF file contains following commands:
1.0 0 0 -1.0 0 810.0 cm
1.0 0 0 1.0 0 0 cm
1.0 0 0 1.0 9.0 9.0 cm
-9.0 -9.0 m 621.0 -9.0 l 621.0 801.0 l -9.0 801.0 l h
q
1.0 1.0 1.0 rg f
Q
q
1.0 0 0 1.0 0 0 cm
/Div <</MCID 0>> BDC
q
/GS_0-0-0-0x0 gs
q
q
q
1.0 0 0 1.0 -25.98 -17.82 cm
Q
q
1.0 0 0 1.0 -25.98 -17.82 cm
0 0 m 666.0 0 l 666.0 144.2 l 0 144.2 l h
q
.2392 .2784 .3215 rg f
Q
Q
Q
Q
Q
A rectangle is drawn at lines 4 and 20. There is a fill command "f" at lines 6 and 22.
Calling "f" at line 6 clears the current vector path, however, "Q" on a line 7 should return it back. So the line 22 should paint two rectangles, but the PDF viewer draws only one rectangle. My question is, which command exactly clears the first rectangle before the line 22?
On one hand you can see that q does not save the current path by looking into the specification as KenS has proposed in his comment. If you use the ISO norm (either ISO 32000-1 or ISO 32000-2), you'll find the tables for "Device-Independent Graphics State Parameters" and "Device-Dependent Graphics State Parameters" in section 8.4.1. You'll see that the only path stored there is the clipping path.
But you can also see that q does not save the current path by considering when a q instruction is allowed: You'll find in particular that after defining a path the next instruction must be either a path painting instruction or or a clipping path instruction immediately followed by a path painting instruction. Thus, a path is already used (and, therefore, discarded) before the next q instruction! So q has no chance to save a current path. (For a normative reference have a look at Figure 9 – Graphics Objects – in ISO 32000-1 or ISO 32000-2).
As an aside, the second paragraph above tells you that your examples are invalid, neither the q nor the rg instructions you put between path definition and path painting are allowed.
So concerning your question:
My question is, which command exactly clears the first rectangle before the line 22?
As your example instruction sequence is invalid, the result is undefined. But as there is no element for the current path in the graphics state, you in particular should not expect the path to re-appear after Q.
I have this PostScript code from this PDF's first page:
0 804 624 -654 re
W* n
0 792 612 -792 re
0 792 m
W n
0 792.06 612 -792 re
W n
I'm trying to think why would a rectangle have negative height and how would that affect the painting of the path. I know W* and W is for clipping and n is just a no-op but I don't get why would you paint a negative height rectangle.
That's not PostScript, its PDF, the two are different. I've removed the PostScript tag.
The content you've posted here will not paint anything at all, since (as you note) it consists entirely of clip operations applied to rectangular paths.
Most probably the path is required to be constructed that way in order to get the winding correct (this is especially important since one of the clips uses the even-odd rule)
To put it more simply, the operands to the first re are :
0 804 624 -654 re
That could be constructed from paths as:
0 804 m
624 804 l
624 150 l
0 150 l
h
The code could have used :
0 150 624 654 re
But then the equivalent path would be:
0 150 m
624 150 l
624 804 l
9 804 l
h
If you draw those rectangles (including the direction of travel) you'll see that one proceeds clockwise, while the other proceeds anti-clockwise.
I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.
Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).
The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.
I have the following page content:
1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...
I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?
The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf
First of all, your comparison with other text extractors is based on a misunderstanding:
none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.
The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.
Now, concerning your approach:
In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:
21.33 * 0.75 = 15.9975
While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.
Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.
Eventually let's discuss what you actually are after.
As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.
But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.
You might mean the length of what an originally vertical line (one unit high) is transformed into.
You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.
The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial
I Have a PDF document made by Latex which contains a table.
What are the pdf operators that represents this table ? I think Latex draws the table. right ?
as I want to extract it using PDFBOX library
When I decoded the PDF table I found these lines related to graphical objects and text.
does the line between q and Q draws a lines or
for the table
stream
q
1 0 0 1 139.746 715.892 cm
[]0 d 0 J 0.398 w 0 0 m 100.9 0 l S
Q
q
1 0 0 1 139.746 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 148.795 707.324 Td [(aaaa)]TJ
ET
q
1 0 0 1 186.626 703.738 cm
[]0 d 0 J 0.398 w 0 0 m 0 11.955 l S
Q
BT
/F8 9.9626 Tf 198.277 707.324 Td [(bbbb)]TJ
ET
The explanation for the commands can easily be found in Adobe's PDF Reference 1.7.
One command at a time, and remembering that PDF has postfix notation, we can find in Chapter 4 "Graphics":
q % save graphics state (§4.2.1)
1 0 0 1 139.746 715.892 cm % set transform matrix (§4.2.3)
% --this is a simple 'translate' to (139.746,715.892)
[]0 d % set dash pattern to solid (§4.3.3)
0 J % set line cap to Butt
0.398 w % set line width to 0.398 units
0 0 m % move "current point" (§4.4.1)
100.9 0 l % append straight line
S % stroke the path (§4.4.2)
Q % restore the graphics state