PDFBox - How to convert Xheight to points - pdf

I am trying to find the Xheight of a font using Pdfbox.
font is type of PDFont
println(font.name + ": " + font.fontDescriptor.xHeight)
Output of this is for font size 16pt:
TimesNewRomanPS-BoldMT: 546.0
But I am not able to identify how to convert this 546.0 into points or pixel or mm.

When you shared the PDF you took your information from, the cause became clear: The information in the font at hand simply is broken.
Details
As an example you refer to CourierNew in your example file font-list-1.pdf.
This font is used on page 2, the associated FontDescriptor is this object:
44 0 obj
<<
/StemV 42
/FontName/CourierNewPSMT
/FontStretch/Normal
/FontWeight 400
/Flags 34
/Descent -300
/FontBBox[-21 -680 638 1021]
/Ascent 832
/FontFamily(Courier New)
/CapHeight 578
/XHeight -578
/Type/FontDescriptor
/ItalicAngle 0
>>
endobj
So the font's XHeight value is -578. Which means it is rubbish in multiple ways:
It is negative. According to the specification the XHeight value is the vertical coordinate of the top of flat nonascending lowercase letters (like the letter x), measured from the baseline (ISO 32000-1, Table 122 – Entries common to all font descriptors). Having a negative value, therefore, means that all those flat nonascending lowercase letters are drawn completely way under the baseline.
This obviously is nonsense for a fairly normal font like CourierNew.
When loading the font descriptor, PDFBox executes a sanity check and takes the absolute value here which is why you have not seen the negative sign.
The absolute value of XHeight equals the CapHeight value which is specified as the vertical coordinate of the top of flat capital letters, measured from the baseline (ibidem).
Ignoring the negative XHeight sign (which is nonsense, see above), therefore, the font claims that flat nonascending lowercase letters and flat capital letters reach up to the same top coordinate.
This obviously is nonsense for CourierNew.
(The XHeight values of many other fonts in your sample file are similarly broken.)
How else to get a sensible x height value
If you really need a x height value of your fonts, you should inspect the drawing instructions for the flat nonascending lowercase letters in them and derive a x height value from their respective heights.
(This wont always succeed because those fonts may be available as embedded subsets only, and such subsets might be void of flat nonascending lowercase letters.)

Related

PDF TJ operator

is it possible to determine if a number in TJ operator represents space between words?
Example: [(Sta)28(ry)-333(Plzenec,)]TJ
Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96
First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.
Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:
(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)
Thus, first of all such a number only widens the gap to the next character if it's negative.
Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.
The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.
The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.
Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.
Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.
Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).
You can improve your heuristics
by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...

How do I calculate word spacing in a PDF document? For example:

For example:
20 0 0 48 20 500.0 Tm
[(H)6(ello)54(Wor)7(ld)] TJ
0 -1.1075 TD
There is no space (32) character in this array of text.
But somehow viewers understands that 54 is a space. But 6 and 7 is a char spacing (Kerning). Any ideas?
The TJ operator is documentated in the PDF specification PDF 32000-1:2008 - Table 109 – Text-showing operators as follows:
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space [...]. This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. [...]

iTextSharp - Incorrect text position

When extracting the position of the words in this example:
http://www.dertour.de/static/agb/2015/sommer/DER_Deutschland_So15.pdf
with iTextSharp 5.5.8
I'm getting 'incorrect' coordinates for some words. For example on line 17 of the first paragraph: 'gehen oder im Widerspruch zur Reiseaus-'
the x-values of the left,top position of the words are 118, 217, 296, 350, 524, 587. Only the first value seems correct (118,208,277,320,487,540). The x-value of the right-bottom point of the space-character between 'gehen' and 'oder' is 208, which seems correct and also seems to be the correct x-pos for the word 'oder'. Maybe it has something to do with the fillmode of the paragraph, but I'm not sure which actions I should perform to get the right coordinates.
I'm using LocationTextExtractionStrategy and calculate the word-positions to a 300 dpi coordinate system.
public override void RenderText(TextRenderInfo renderInfo)
{
// for the provided example
// uUnit = 1
// originX = 33.862
// originY = 33.555
// dpi = 300
// above values where calculated with code:
// PdfNumber userUnit = pageDict.GetAsNumber(PdfName.USERUNIT);
// if (userUnit != null)
// {
// uUnit = userUnit.FloatValue;
// }
// Rectangle dim = reader.GetPageSize(i);
// float originX = dim.Left;
// float originY = dim.Bottom;
// calculate coordinates:
renderInfo.GetText();
LineSegment segment = renderInfo.GetBaseline();
List<TextRenderInfo> charInfo = renderInfo.GetCharacterRenderInfos().ToList();
foreach (TextRenderInfo item in charInfo)
{
LineSegment char_segment = item.GetBaseline();
int char_left = (int)Math.Round((char_segment.GetStartPoint()[0] - originX) * dpi * uUnit / 72.0f);
int char_top = (int)Math.Round((item.GetAscentLine().GetEndPoint()[1] - originY) * dpi * uUnit / 72.0f);
int char_right = (int)Math.Round((char_segment.GetEndPoint()[0] - originX) * dpi * uUnit / 72.0f);
int char_bottom = (int)Math.Round((item.GetDescentLine().GetStartPoint()[1] - originY) * dpi * uUnit / 72.0f);
}
}
This indeed is a bug in iText & iTextSharp:
The lines with the extremely inaccurate x coordinates are those for which a large wordspacing value is set, e.g. your line:
0.2861 Tw T*
[<0047004500480045004E0000>-286<004F0044004500520000>-286<0049004D0000>-231<003700490044004500520053005000520055004300480000>-286<005A005500520000>-286<00320045004900530045004100550053000D>]TJ
(That 0.2861 argument for Tw is large.)
According to the ToUnicode map of the font in question the 0000 at the end of each word maps to the space character. Thus, iText here adds the word spacing value when calculating the x coordinates because according to the PDF specification ISO 32000-1:
Word spacing works the same way as character spacing but shall apply only to the ASCII SPACE character
(First sentence of section 9.3.3 Word Spacing)
Unfortunately it does not take into account
Word spacing shall be applied to every occurrence of the single-byte character code 32 in a string when using
a simple font or a composite font that defines code 32 as a single-byte code. It shall not apply to occurrences of
the byte value 32 in multiple-byte codes.
(Last sentence of section 9.3.3 Word Spacing)
At the 0000 above, therefore, word spacing must not be applied even though it is mapped to the space character because
the font encoding in question is purely multi-byte and
even in case of single-byte encoded space characters the word spacing is applied only at the single-byte code 32, not at a code which merely maps to the space character with ASCII code 32.
Usually this is not a problem during text extraction, usually PDF generators which encode space characters using multi-byte encodings are aware that word spacing does not apply for them and, therefore, don't change the word spacing from its default 0 value, so the iText bug here does no harm. Usage of word spacing instructions usually indicates that fonts are used which do map the single-byte code 32 to the space character.
Your PDF, on the other hand, seems to not have been created with that fact on the mind, it looks like first the word spacing has been set (0.2861 Tw), and after recognizing that it made no difference, explicit gaps have been added (-286 in the TJ instruction). (Or that was part of the development history of the PDF generator in question.)
Please be aware that positive values in the TJ argument mean a shift to the left, so negative values (as claimed for the -286 above) indeed widen or add gaps:
array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm . The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
(Table 109 – Text-showing operators in ISO 32000-1)

StemV value of the TrueType font

I'm embedding a TrueType font into pdf and thus need to create descriptor dictionary for it.
Among the required fields is StemV and I haven't found where in the ttf this info is stored.
I think I saw an hint somewhere that it is part of the CVT program, but nothing specific.
So, my question is how to find out the StemV value for the given TrueType font. I want to read this value from the ttf file directly (as opposed to using ie windows API) as I want to write cross-platform solution.
Update:
Grep-ed LibreOffice 5.1.0.3 source and it seems that when exporting to pdf, the FontDescriptor is generated in vcl/source/gdi/pdfwriter_impl.cxx, method PDFWriterImpl::emitFontDescriptor(). There, around line 3888 is following code:
// According to PDF reference 1.4 StemV is required
// seems a tad strange to me, but well ...
aLine.append( "\n"
"/StemV 80\n" );
The question is now why is it 80, not 42? Seriously though, if project like LibreOffice uses hardcoded constant, it seems to indicate that the value is either not stored into font file or reading it is extremely costly (ie requires implementing TrueType font engine to interpret the font program).
BTW, for those who are wondering what this StemV is - in the "PDF Reference
sixth edition" it is described as "The thickness, measured horizontally, of the dominant vertical stems of glyphs in the font".
According to ISO 32000-1:2008, while StemH is optional, StemV is required (see Table 122). Alas, there doesn't seem to be a clear consensus on where to get this data from.
The variable is probably derived from Adobe's original Type 1 (CFF) font format:
The entry StdVW is an array with only one real number entry
expressing the dominant width of vertical stems (measured horizontally
in character space units). Typically, this will be the width
of straight stems in lower case letters. (For an italic font program,
give the width of the vertical stem measured at an angle perpendicular
to the stem direction.) For example:
/StdVW [85] def
(Adobe Type 1 Font Format, February 1993, Version 1.1, p. 42)
This is an optional entry in the /Private Dictionary of a CFF font.
However, Werner Lemberg states (http://blog.gmane.org/gmane.comp.fonts.freetype.devel/month=20130601)
The StemV value is not used by the PDF engine if the embedded font
is either a Type 1 or CFF font; in that case the value from the
private dictionary gets used. For a CID font, the value associated
with the glyph's font DICT gets used.
In case there is no StemV value in the PDF, the following algorithm
applies ...
which adds to the confusion, since it is marked "Required" in the PDF specs.
Some other toolkits' attempts
Apache FOP notes in its 'goals' under Fonts
.. if [important], parse the .pfb file to extract it when building the FOP xml metric file ..
(http://www.cs.helsinki.fi/group/xmltools/formatters/fop/fop-0.20.5/build/site/dev/fonts.html)
PDFLib uses FreeType, and the header file ft_font.h contains a list:
+---------------------------------------------------------------------------+
Copyright (c) 1997-2006 Thomas Merz and PDFlib GmbH. All rights reserved. |
+---------------------------------------------------------------------------+
(.. omitted..)
/*
* these defaults are used when the stem value
* must be derived from the name (unused)
*/
#define FNT_STEMV_MIN 50 /* minimum StemV value */
#define FNT_STEMV_LIGHT 71 /* light StemV value */
#define FNT_STEMV_NORMAL 109 /* normal StemV value */
#define FNT_STEMV_MEDIUM 125 /* mediumbold StemV value */
#define FNT_STEMV_SEMIBOLD 135 /* semibold StemV value */
#define FNT_STEMV_BOLD 165 /* bold StemV value */
#define FNT_STEMV_EXTRABOLD 201 /* extrabold StemV value */
#define FNT_STEMV_BLACK 241 /* black StemV value */
Note the "unused". This list also only appears in older versions of FreeType.
PrawnPDF just says (http://prawnpdf.org/docs/0.11.1/Prawn/Font/TTF.html)
stemV()
not sure how to compute this for true-type fonts...
The TrueType Embedder in Apache FontBox makes an educated guess:
// StemV - there's no true TTF equivalent of this, so we estimate it
fd.setStemV(fd.getFontBoundingBox().getWidth() * .13f);
(https://pdfbox.apache.org/download.cgi) - where I feel I must add that it's better than nothing, but only by a very narrow margin. For most fonts, the relationship between stem width and bounding box is not this simple. There are also some famous fonts that fatten "inwards" and so their bounding boxes actually have the exact same values.
Further searching led me all the way back to a 1998 UseNet post:
.ttf tables, and PDF's StemV value
From: John Bley
Date: Tue, 16 Jun 1998 17:09:19 GMT
When embedding a TrueType font in PDF, I require a vertical stem width value - I can get all the other values (ascent, descent, italic angle, etc.) that I need from various .ttf tables, but I can't seem to locate or calculate the average or normal vertical (or horizontal) stem width anywhere. By watching an embedded PDF font, I know that the "hint" in the 'OS/2' table is not enough - it's a highly precise value, not a 1-10 kind of scale. Any clues? Thanks for your time!
The value is not in TrueType fonts. You have to calculate it by analysis of, say, the cap I glyph. Don't worry too much about putting in a precise value: the value will only ever be used if the font is not present with the PDF file, when a vaguely similar font will be used instead. -- Laurence
(http://www.truetype-typography.com/ttqa_1998.htm)
The "'OS/2' table" hint, presumably, is usWeightClass. While its values are defined in the range from 100 to 900, this is not a continuous range. Only the entire 100ths are used, and so it's a scale from 1-9 (not 1-10 as mentioned in the question above). The scale is derived from Microsoft's font definitions, which only has these 9 distinct values. (Note that the ft_font.h file only lists 8 predefined stem values. Another problem, there.)
An (inconclusive) InDesign test
Using Adobe InDesign CS4, I created a small test PDF using the font Aller in Light, Regular, and Bold, and Arial in Regular, Bold, and Black weights (these are both TTF fonts) and found InDesign writes out the StemV's as
Aller-Light 68
Aller-Regular 100
Aller-Bold 144
Arial 88
Arial-Bold 136
Arial-Black 200
This shows InDesign uses some kind of heuristics to calculate the stem width for each individual font and does not rely on a fixed weight based table. It is not as simple as "the width of an uppercase 'I'", which are 69, 102, 147 (Aller) and 94.7, 144.5, 221.68 (Arial) design units, respectively. I tested deliberately with sans serif fonts, as the serifs on a serif font would need estimating the width somewhere halfway the glyph.
I exported the same document using InDesign CC 2014 and got the exact same values. I have no further ideas on how to find out where InDesign gets these values from.
(Later addition:) Minion Pro is a CFF flavour OpenType font and so it may contain a valid StdVW value. After testing, I found it does: 79 StdVW. Quite noteworthy: InDesign does not use this value but exports it as /StemV 80 instead. The value for Minion Pro Bold, 128, is correct but, at this point, I am positive this could be pure coincidence. With these two already different, I did not have further incentive to check either Minion Pro Semibold or Minion Black.
TL,DR Summary:
If you are embedding a Type 1 (CFF) font, you could fill in whatever you want, and the actual value will be read from the font data
... except when it's not in there.
If you are embedding a TrueType font, you need to supply a good value.
The least worst solution seems to be to read usWeightClass out of the OS/2 header and map this directly to a reasonable value.
This is what PDFLib actually uses:
(from: https://fossies.org/dox/PDFlib-Lite-7.0.5p3/ft__font_8c_source.html)
#define FNT_STEMV_WEIGHT 65.0
#define FNT_STEMV_MIN 50
fnt_weight2stemv(int weight)
{
double w = weight / FNT_STEMV_WEIGHT;
return (int) (FNT_STEMV_MIN + w * w + 0.5);
}
presumably, the 'weight' argument used will be 'OS/2'.usWeightClass

write in unicode text on visible signature - pdfbox

I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.
Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.