Ghostscript txtwrite bbox limits

Ghostscript txtwrite bbox limits - pdf

When I use ghostscript with textwrite device, I'm getting an XML file that describes my pdf, i.e
<page>
<block>
<line>
<span bbox="95 97 357 97" font="..." size="9.0000">
<char bbox="95 97 106 97" c="a"/>
<char bbox="106 97 117 97" c="b"/>
<char bbox="117 97 126 97" c="c"/>
...
</span>
</line>
</block>
<block>
...
My question is if there is a known scale to the bbox (bounding-box) coordinates (X1,Y1,X2,Y2) or are they page dependent? in any case, can I fetch the page grid in any way to understand its height and width?
My main point here is too understand features like if the character was positioned beyond the center of the page etc.
My full command to convert pdf to XML:
ghostscript -q -sPAPERSIZE=a4 -r200 -sDEVICE=txtwrite" -sOutputFile=<output-path.xml> -dTextFormat=1 -dBATCH -dNOPAUSE <input-path.pdf>

The bounding box is in PostScript/PDF units, 1/72 inch. Note that the output isn't really XML, its 'like' XML.

Related

Equivalent of shape-rendering="crispEdges" in PDF

Consider the following SVG:
<svg xmlns="http://www.w3.org/2000/svg" version="1.1" viewBox="0 0 40 20">
<g shape-rendering="crispEdges">
<rect x="0" y="0" width="20" height="20" fill="#b4b4b4"/>
<rect x="20" y="0" width="20" height="20" fill="#b4c4b4"/>
</g>
</svg>
The intended effect of the shape-rendering="crispEdges" annotation is to prevent there being a visible seam between the two rectangles, no matter how the rendering is scaled. This works as intended when viewing the SVG file in both Firefox and Chromium. However, when I convert the SVG into a PDF using inkscape -A and view the PDF, I can still see a visible seam at some zoom levels, e.g. as in this screen shot:
Moreover, the PDF page stream produced by inkscape is identical with and without shape-rendering="crispEdges"
1 0 0 -1 0 15 cm
q
0.705882 0.705882 0.705882 rg /a0 gs
0 0 15 15 re f
0.705882 0.768627 0.705882 rg 15 0 15 15 re f
Q
and the /ExtGState dictionary referenced as /a0 is also identical:
/ExtGState <<
/a0 <<
/CA 1
/ca 1
>>
>>
This could mean that there is no equivalent in PDF of this SVG feature, or it could mean that Inkscape's PDF exporter doesn't implement the equivalent. I'm not having any luck finding anything that sounds like this SVG feature in the PDF specification, which is an argument in favor of "no equivalent", but the PDF spec is gigantic and I could easily have missed something.
So the question is: Is or isn't there an equivalent in PDF of this SVG feature, and if there is, how do I use it? I am prepared to edit my exported PDF by hand if I have to.
Note 1: The example is minimal; I originally noticed the problem with a much more complicated figure from an academic paper, in which there are many such rectangles aligned to a grid, but some grid positions are empty. I tried enlarging the rectangles in the original figure so they would overlap, and I was not able to find an amount of enlargement that eliminated all visible seams without also visibly causing the rectangles to bleed into the empty spaces.
Note 2: With the original figure, the problem is visible with Evince, pdf.js, and two printers manufactured by different companies.

The closest thing in PDF would be to use shading meshes (e.g. tensor and lattice free form meshes). This will remove the slivers in most viewers.
Some PDF viewers ( like Acrobat, Xodo/PDFTron) have options that minimize the appearance of these slivers, but generally it's not well implemented across many implementations.

Adobe Illustrator 19 SVG Viewbox?

We produce icons in Illustrator as SVGs and then produce font based icons with fontcustom. Out of no where, they were coming in too low. I found in the svg this odd viewBox with a negative 49 on it. How is this controlled in Illustrator? I don't want any viewbox, I just want a perfectly centered icon. I also see that it thinks that it is grouped. My only fix is to ungroup, and then set the X and Y to zero, and it works. It does put in a transform compensating to -49 on the layer. Something is causing this odd offset.
<svg version="1.1" id="Layer_1" xmlns:x="&ns_extend;" xmlns:i="&ns_ai;" xmlns:graph="&ns_graphs;"
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" viewBox="-49 141 512 512"
style="enable-background:new -49 141 512 512;" xml:space="preserve">

Viewbox in illustrator is based on the art board I believe and it's a pain to get it to match the artwork exactly. The easy way to zero it all out is to just highlight the object in illustrator select copy.. then go to your code editor, make a new document and paste it. It will copy the svg code with a the view box set correctly(starting at 0,0). Then just save the file as an svg and you're done.

Precise bounding box for glyphs in a PDF?

I'm trying to calculate the exact bounding box of every text glyph in a vector PDF.
This involves keeping track of the CTM, drawing/positioning PDF instructions, etc., but also calculating the boundaries of every specific glyph in "glyph space" (using the information from the GLYF tables in the embedded fonts).
I realize the PDF FontDescriptor includes a rough bounding box for each embedded font, but that's a composite of all glyphs in the font -- i.e., the smallest bounding box that fits all glyphs in the font. For my purposes, I need more precise positioning.
My specific application is extracting the musical semantics from a vector PDF of sheet music. As such, one nice constraint is that I can assume glyphs aren't drawn together in the same Tj/TJ operator. Each glyph is drawn independently.
Also, note that I'm defining bounding box as "the smallest box that can contain all the drawn parts of the glyph." There's no need to ignore the ascenders/descenders/etc. that might be considered "outside" the bounding box in other applications.
There are many moving parts here, and I've found it's quite hard to debug. So here's what I'd love help with:
This example PDF I've created has 10 glyphs. What is the "ground truth" bounding box positioning for these 10 glyphs, in device space? My current code produces the following, but it's incorrect. I know it's incorrect because it says the first glyph ("&") horizontally intersects the second ("\u02d9"), which you can see isn't true when you view the PDF in a PDF reader.
'&' ( 57.2799755477664, 600.7092061684704, 86.7452642315424, 677.1570718099680)
'\u02d9' ( 82.0030393188000, 633.6851606704608, 96.3090818379936, 644.6969866323168)
'\u0153' (144.7841941848000, 623.9630080194528, 158.6735558539200, 634.5581702962656)
'\u0153' (181.6778111184000, 619.0027260546528, 195.5671727875200, 629.5978883314656)
'w' (226.1671727148000, 611.3638918288608, 245.0765465300448, 622.3161944071392)
'w' (320.1063822180000, 631.2050196880608, 339.0157560332448, 642.1573222663392)
'\u0153' (414.0455917212000, 641.3239948962528, 427.9349533903200, 651.9191571730656)
'\u0153' (450.9392086548000, 636.3637129314528, 464.8285703239200, 646.9588752082656)
'\u0153' (487.9878407856000, 631.4034309666528, 501.8772024547200, 641.9985932434656)
'\u0153' (524.8814577192000, 628.9232899842528, 538.7708193883200, 639.5184522610656)
How did you calculate those positions? (I realize this is a lot to ask, given complexity of PDF.) It would be a huge help to have a walkthrough, and I'm sure it would help others in the future.
Is there a tool that does this off the shelf?

I believe the only way to get truly accurate information is to actually render the glyphs at the given point size and collect the extents of the resulting bitmap.
Even extracting the path describing the glyph won't give you completely accurate information because hinting can subtly (or in some case, not so subtly) alter the way the glyph is rendered. In any event extracting the path is as much work, possibly more, as rendering the bitmap.....
There are broadly three categories of font in PDF:
Fonts with PostScript outlines
Fonts with TrueType outlines
User defined fonts.
You can use FreeType to render glyphs from fonts with PostScript and TrueType outlines (you can also have it return the path if you would rather use that).
User defined (type 3) fonts you have to treat as a series of PDF operations, scaled by the text matrix. So you need to do that yourself.
Note that fonts can be organised in 2 ways, regular fonts and CIDFonts and the means for getting the glyph data corresponding to a character code differ between the two, but I assume you're already prepared to deal with that in your existing code.
Its possible that in your case you have a workflow which limits the kinds of fonts you might see, so you may not need a full implementation of all this. For example I see that you are using CIDFonts with TrueType outlines, but the CIDToGIDMap is /Identity, which reduces the scope of the problem.
For additional complexity, you will need to consider what represents the 'bounding box' of your glyph. Do you consider the advance width and left side-bearing to be part of the bounding box, or just the areas marked ?
Remember that PDF can specify different Widths for glyphs to those defined in the font, and both your fonts include /W arrays which modify the widths defined in the font.
If you consider the left side-bearing and advance width as part of the glyph, but have a /Widths array with a value smaller than the advance width it may be that two glyphs will appear to 'collide', but actually still have white space between them. All the /Widths has done is reduce the white space from the advance width so that the glyphs are closer together than would normally be the case.
I had a quick bash at this using MuPDF which gave the answers:
<span bbox="39.21884 163.68216 42.53509 163.99687" font="PlantinMTStd-Regular" size="11.935925">
<char bbox="39.21884 163.68216 42.53509 163.99687" x="39.21884" y="163.99687" c=" "/>
<span bbox="57.200607 163.69899 73.08967 165.2394" font="OpusStd" size="19.841537">
<char bbox="57.200607 163.69899 73.08967 165.2394" x="57.200607" y="165.2394" c="&"/>
<char bbox="82.003044 151.29828 90.63545 152.83868" x="82.003044" y="152.83868" c="˙"/>
<char bbox="144.7842 161.21884 153.1744 162.75925" x="144.7842" y="162.75925" c="œ"/>
<char bbox="181.67781 166.17912 190.06801 167.71953" x="181.67781" y="167.71953" c="œ"/>
<char bbox="226.16718 173.61955 236.8826 175.15996" x="226.16718" y="175.15996" c="w"/>
<char bbox="320.10638 153.77843 330.8218 155.31883" x="320.10638" y="155.31883" c="w"/>
<char bbox="414.0456 143.85785 422.4358 145.39825" x="414.0456" y="145.39825" c="œ"/>
<char bbox="450.9392 148.81815 459.3294 150.35855" x="450.9392" y="150.35855" c="œ"/>
<char bbox="487.98785 153.77843 496.37805 155.31883" x="487.98785" y="155.31883" c="œ"/>
<char bbox="524.8815 156.25856 533.27167 157.79897" x="524.8815" y="157.79897" c="œ"/>
And for completeness, here's the same information from Ghostscript using the txtwrite device with -dTextFormat=0:
<page>
<span bbox="39 164 43 164" font="PlantinMTStd-Regular" size="11.9357">
<char bbox="39 164 39 164" c=" "/>
</span>
<span bbox="57 165 73 165" font="OpusStd" size="19.8411">
<char bbox="57 165 57 165" c="&"/>
</span>
<span bbox="82 153 91 153" font="OpusStd" size="19.8411">
<char bbox="82 153 82 153" c="˙"/>
</span>
<span bbox="145 163 153 163" font="OpusStd" size="19.8411">
<char bbox="145 163 145 163" c="œ"/>
</span>
<span bbox="182 168 190 168" font="OpusStd" size="19.8411">
<char bbox="182 168 182 168" c="œ"/>
</span>
<span bbox="226 175 237 175" font="OpusStd" size="19.8411">
<char bbox="226 175 226 175" c="w"/>
</span>
<span bbox="320 155 331 155" font="OpusStd" size="19.8411">
<char bbox="320 155 320 155" c="w"/>
</span>
<span bbox="414 145 422 145" font="OpusStd" size="19.8411">
<char bbox="414 145 414 145" c="œ"/>
</span>
<span bbox="451 150 459 150" font="OpusStd" size="19.8411">
<char bbox="451 150 451 150" c="œ"/>
</span>
<span bbox="488 155 496 155" font="OpusStd" size="19.8411">
<char bbox="488 155 488 155" c="œ"/>
</span>
<span bbox="525 158 533 158" font="OpusStd" size="19.8411">
<char bbox="525 158 525 158" c="œ"/>
</span>
</page>
It does look like there's a bug there though, the urx value is incorrect in the char bbox, but correct in the span bbox.

You may also want to look at this Adobe GitHub repository:
github.com/adobe-type-tools
The afdko sub directory contains a lot of command line tools which could be useful to test, check and convert font files. I used the tx tool from this repo in order to print some info about the font file extracted with mutool extract from your PDF sample:
$ mutool extract pdf_example.pdf
extracting font QNAAAA+PlantinMTStd-Regular-0013.ttf
extracting font QSAAAA+OpusStd-0018.ttf
Then:
$ tx -mtx QSAAAA+OpusStd-0018.ttf
tx: --- QSAAAA+OpusStd-0018.ttf
tx: (ttr) cmap table missing
### glyph[tag] {gname,enc,width,{left,bottom,right,top}}
glyph[0] {.notdef,-,0,{0,0,0,0}}
glyph[1] {g1,-,1640,{4,-1313,1489,2540}}
glyph[2] {g2,-,891,{0,-276,721,279}}
glyph[3] {g3,-,866,{0,-266,700,268}}
glyph[4] {g4,-,1106,{0,-276,953,276}}
Maybe this, or one of the other 28 command line tools in this repo, could also be useful to you...

How to correctly crop PDF with uneven text margins

I have PDF like this:
where all margins relative to text content are different on per page basis.
Is there any tool that can correct this for me?
I know Scan Tailor can do this on bitmap, but this is PDF with just text layer, so I'm not after solution that would involve bitmaps at any stage
Update:
OK, for me there is no need to try to run PDFCrop on Windows, as main feature is provided by ghostscript. This command (taken from pdfcrop perl script):
gswin32c.exe -dSAFER -dNOPAUSE -dBATCH -q -r72 -sDEVICE=bbox -f input.pdf 2> bbox.txt
produces bbox.txt file, with text content dimensions, as if there are no margins (bounding box). It looks like this:
%%BoundingBox: 91 259 474 757
%%HiResBoundingBox: 91.000000 259.000000 474.000000 757.000000
%%BoundingBox: 85 224 470 768
%%HiResBoundingBox: 85.000000 224.000000 469.375000 768.000000
%%BoundingBox: 102 217 489 768
%%HiResBoundingBox: 102.000000 217.000000 488.457031 768.000000
...
where first to numbers are lower left corner x,y values and rest two and upper right, measuring from lower left edge (in pixels/points).
This can be read by user's language of choice and then bboxes corrected as desired and passed again to ghostscript as i.e. referenced here: Cropping a PDF using Ghostscript 9.01

If you are sure that only text is involved (and not images with text drawn on it or paths drawing symbols), you can quite easily build such a tool in Java using iText (or most likely also some .NET language using iTextSharp) using the parser package functionality.
The book iText in Action, 2nd edition, in chapter 15.3.4 shows how to find the text margins, and the sample code can be found in ShowTextMargins.java in the SourceForge iText SVN repository.
By manipulating the MediaBox entries of the individual pages you can then adapt the margins as desired.

Combining SVG images programmatically

I have a program that generates multiple SVG files in batch, which I then need to be able to combine (tiled) into one file, with a set whitespace and set width in cm (or mm).
I need either an existing script or a pointer to which libraries and languages I can use to accomplish this. Any suggestions where to start?

Here are some tools which can help you to create a SVG sprite sheet from your svg files:
SVG STACK
SVG UTILS
And then you can clean up your svg when all done with a tool like
SCOUR

Yes as #victor-henriquez noted you can use montage but it’s a bit tricky, I got into it by activating the -verbose output and see that it created an inkscape command and analysing that solved this issue for me.
montage -version
# Version: ImageMagick 7.0.7-31 Q16 x86_64 20180506
I wanted …
… to label desktop icons: use -label and -pointsize (tricky to get font size correct via pointsize but depending on density)
… to increase -density (it’s tricky to find a suitable number for the output)
… to stack and tile them orderly: use -tile 15x30 (here 15 columns x 30 rows)
… to add a margin on each sub-image: use -geometry '+40+0' (adds 40px horizontally but 0px vertically)
The resulting command was (add -verbose to get detailed processing information):
montage -label '%f' -pointsize 2 -density 300 *.svg \
-tile 15x30 \
-geometry '+40+0' \
./papirus-icons-mimetypes.png
If you specify additionally the desired output pixel size geometry, eg. 96 pixels by 96 pixels -geometry '96x96+40+0', it becomes even more complex to understand what -density plays a role at. I failed to figure it out deeply ;-)

I used Victor gem https://github.com/DannyBen/victor
first_svg = File.open("first.svg").read
second_svg = File.open("second.svg").read
first_content = first_svg.split("\n")[1..-2].join(", ")
second_content = second_svg.split("\n")[1..-2].join(", ")
svg = Victor::SVG.new width: "100%", height: "100%"
svg << first_content
svg << second_content
svg.save 'final.svg'

You can have a look to montage, from ImageMagick: http://www.imagemagick.org/Usage/montage/
You can build your script around it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas