I'm embedding a TrueType font into pdf and thus need to create descriptor dictionary for it.
Among the required fields is StemV and I haven't found where in the ttf this info is stored.
I think I saw an hint somewhere that it is part of the CVT program, but nothing specific.
So, my question is how to find out the StemV value for the given TrueType font. I want to read this value from the ttf file directly (as opposed to using ie windows API) as I want to write cross-platform solution.
Update:
Grep-ed LibreOffice 5.1.0.3 source and it seems that when exporting to pdf, the FontDescriptor is generated in vcl/source/gdi/pdfwriter_impl.cxx, method PDFWriterImpl::emitFontDescriptor(). There, around line 3888 is following code:
// According to PDF reference 1.4 StemV is required
// seems a tad strange to me, but well ...
aLine.append( "\n"
"/StemV 80\n" );
The question is now why is it 80, not 42? Seriously though, if project like LibreOffice uses hardcoded constant, it seems to indicate that the value is either not stored into font file or reading it is extremely costly (ie requires implementing TrueType font engine to interpret the font program).
BTW, for those who are wondering what this StemV is - in the "PDF Reference
sixth edition" it is described as "The thickness, measured horizontally, of the dominant vertical stems of glyphs in the font".
According to ISO 32000-1:2008, while StemH is optional, StemV is required (see Table 122). Alas, there doesn't seem to be a clear consensus on where to get this data from.
The variable is probably derived from Adobe's original Type 1 (CFF) font format:
The entry StdVW is an array with only one real number entry
expressing the dominant width of vertical stems (measured horizontally
in character space units). Typically, this will be the width
of straight stems in lower case letters. (For an italic font program,
give the width of the vertical stem measured at an angle perpendicular
to the stem direction.) For example:
/StdVW [85] def
(Adobe Type 1 Font Format, February 1993, Version 1.1, p. 42)
This is an optional entry in the /Private Dictionary of a CFF font.
However, Werner Lemberg states (http://blog.gmane.org/gmane.comp.fonts.freetype.devel/month=20130601)
The StemV value is not used by the PDF engine if the embedded font
is either a Type 1 or CFF font; in that case the value from the
private dictionary gets used. For a CID font, the value associated
with the glyph's font DICT gets used.
In case there is no StemV value in the PDF, the following algorithm
applies ...
which adds to the confusion, since it is marked "Required" in the PDF specs.
Some other toolkits' attempts
Apache FOP notes in its 'goals' under Fonts
.. if [important], parse the .pfb file to extract it when building the FOP xml metric file ..
(http://www.cs.helsinki.fi/group/xmltools/formatters/fop/fop-0.20.5/build/site/dev/fonts.html)
PDFLib uses FreeType, and the header file ft_font.h contains a list:
+---------------------------------------------------------------------------+
Copyright (c) 1997-2006 Thomas Merz and PDFlib GmbH. All rights reserved. |
+---------------------------------------------------------------------------+
(.. omitted..)
/*
* these defaults are used when the stem value
* must be derived from the name (unused)
*/
#define FNT_STEMV_MIN 50 /* minimum StemV value */
#define FNT_STEMV_LIGHT 71 /* light StemV value */
#define FNT_STEMV_NORMAL 109 /* normal StemV value */
#define FNT_STEMV_MEDIUM 125 /* mediumbold StemV value */
#define FNT_STEMV_SEMIBOLD 135 /* semibold StemV value */
#define FNT_STEMV_BOLD 165 /* bold StemV value */
#define FNT_STEMV_EXTRABOLD 201 /* extrabold StemV value */
#define FNT_STEMV_BLACK 241 /* black StemV value */
Note the "unused". This list also only appears in older versions of FreeType.
PrawnPDF just says (http://prawnpdf.org/docs/0.11.1/Prawn/Font/TTF.html)
stemV()
not sure how to compute this for true-type fonts...
The TrueType Embedder in Apache FontBox makes an educated guess:
// StemV - there's no true TTF equivalent of this, so we estimate it
fd.setStemV(fd.getFontBoundingBox().getWidth() * .13f);
(https://pdfbox.apache.org/download.cgi) - where I feel I must add that it's better than nothing, but only by a very narrow margin. For most fonts, the relationship between stem width and bounding box is not this simple. There are also some famous fonts that fatten "inwards" and so their bounding boxes actually have the exact same values.
Further searching led me all the way back to a 1998 UseNet post:
.ttf tables, and PDF's StemV value
From: John Bley
Date: Tue, 16 Jun 1998 17:09:19 GMT
When embedding a TrueType font in PDF, I require a vertical stem width value - I can get all the other values (ascent, descent, italic angle, etc.) that I need from various .ttf tables, but I can't seem to locate or calculate the average or normal vertical (or horizontal) stem width anywhere. By watching an embedded PDF font, I know that the "hint" in the 'OS/2' table is not enough - it's a highly precise value, not a 1-10 kind of scale. Any clues? Thanks for your time!
The value is not in TrueType fonts. You have to calculate it by analysis of, say, the cap I glyph. Don't worry too much about putting in a precise value: the value will only ever be used if the font is not present with the PDF file, when a vaguely similar font will be used instead. -- Laurence
(http://www.truetype-typography.com/ttqa_1998.htm)
The "'OS/2' table" hint, presumably, is usWeightClass. While its values are defined in the range from 100 to 900, this is not a continuous range. Only the entire 100ths are used, and so it's a scale from 1-9 (not 1-10 as mentioned in the question above). The scale is derived from Microsoft's font definitions, which only has these 9 distinct values. (Note that the ft_font.h file only lists 8 predefined stem values. Another problem, there.)
An (inconclusive) InDesign test
Using Adobe InDesign CS4, I created a small test PDF using the font Aller in Light, Regular, and Bold, and Arial in Regular, Bold, and Black weights (these are both TTF fonts) and found InDesign writes out the StemV's as
Aller-Light 68
Aller-Regular 100
Aller-Bold 144
Arial 88
Arial-Bold 136
Arial-Black 200
This shows InDesign uses some kind of heuristics to calculate the stem width for each individual font and does not rely on a fixed weight based table. It is not as simple as "the width of an uppercase 'I'", which are 69, 102, 147 (Aller) and 94.7, 144.5, 221.68 (Arial) design units, respectively. I tested deliberately with sans serif fonts, as the serifs on a serif font would need estimating the width somewhere halfway the glyph.
I exported the same document using InDesign CC 2014 and got the exact same values. I have no further ideas on how to find out where InDesign gets these values from.
(Later addition:) Minion Pro is a CFF flavour OpenType font and so it may contain a valid StdVW value. After testing, I found it does: 79 StdVW. Quite noteworthy: InDesign does not use this value but exports it as /StemV 80 instead. The value for Minion Pro Bold, 128, is correct but, at this point, I am positive this could be pure coincidence. With these two already different, I did not have further incentive to check either Minion Pro Semibold or Minion Black.
TL,DR Summary:
If you are embedding a Type 1 (CFF) font, you could fill in whatever you want, and the actual value will be read from the font data
... except when it's not in there.
If you are embedding a TrueType font, you need to supply a good value.
The least worst solution seems to be to read usWeightClass out of the OS/2 header and map this directly to a reasonable value.
This is what PDFLib actually uses:
(from: https://fossies.org/dox/PDFlib-Lite-7.0.5p3/ft__font_8c_source.html)
#define FNT_STEMV_WEIGHT 65.0
#define FNT_STEMV_MIN 50
fnt_weight2stemv(int weight)
{
double w = weight / FNT_STEMV_WEIGHT;
return (int) (FNT_STEMV_MIN + w * w + 0.5);
}
presumably, the 'weight' argument used will be 'OS/2'.usWeightClass
Related
I am trying to find the Xheight of a font using Pdfbox.
font is type of PDFont
println(font.name + ": " + font.fontDescriptor.xHeight)
Output of this is for font size 16pt:
TimesNewRomanPS-BoldMT: 546.0
But I am not able to identify how to convert this 546.0 into points or pixel or mm.
When you shared the PDF you took your information from, the cause became clear: The information in the font at hand simply is broken.
Details
As an example you refer to CourierNew in your example file font-list-1.pdf.
This font is used on page 2, the associated FontDescriptor is this object:
44 0 obj
<<
/StemV 42
/FontName/CourierNewPSMT
/FontStretch/Normal
/FontWeight 400
/Flags 34
/Descent -300
/FontBBox[-21 -680 638 1021]
/Ascent 832
/FontFamily(Courier New)
/CapHeight 578
/XHeight -578
/Type/FontDescriptor
/ItalicAngle 0
>>
endobj
So the font's XHeight value is -578. Which means it is rubbish in multiple ways:
It is negative. According to the specification the XHeight value is the vertical coordinate of the top of flat nonascending lowercase letters (like the letter x), measured from the baseline (ISO 32000-1, Table 122 – Entries common to all font descriptors). Having a negative value, therefore, means that all those flat nonascending lowercase letters are drawn completely way under the baseline.
This obviously is nonsense for a fairly normal font like CourierNew.
When loading the font descriptor, PDFBox executes a sanity check and takes the absolute value here which is why you have not seen the negative sign.
The absolute value of XHeight equals the CapHeight value which is specified as the vertical coordinate of the top of flat capital letters, measured from the baseline (ibidem).
Ignoring the negative XHeight sign (which is nonsense, see above), therefore, the font claims that flat nonascending lowercase letters and flat capital letters reach up to the same top coordinate.
This obviously is nonsense for CourierNew.
(The XHeight values of many other fonts in your sample file are similarly broken.)
How else to get a sensible x height value
If you really need a x height value of your fonts, you should inspect the drawing instructions for the flat nonascending lowercase letters in them and derive a x height value from their respective heights.
(This wont always succeed because those fonts may be available as embedded subsets only, and such subsets might be void of flat nonascending lowercase letters.)
I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?
The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);
I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!
Is there any way to use special characters like 'rcaron'(U+0159, ř) in TJ operator in base14 fonts (Helvetica)?
Something like [(\rcaron)] TJ ?
Is it present in the font?
I went through Helvetica.afm and it seems that this character is present in the font. Also when I use this character in an interactive textfield in PDF it seems to be present.
I tried pdfbox to generate a sample file, but it fails - it uses TJ and the character is not correct.
Thanks a lot.
Concerning the character set PDF viewers must support for un-embedded base14 fonts, the PDF specification ISO 32000-1 states in section 9.6.2.2:
The character sets and encodings for these fonts are listed in Annex D.
and in annex D.1:
D.2, "Latin Character Set and Encodings", describes the entire character set for the Adobe standard Latin-text fonts. This character set shall be supported by the Times, Helvetica, and Courier font families, which are among the standard 14 predefined fonts; see 9.6.2.2, "Standard Type 1 Fonts (Standard 14 Fonts)".
If you inspect the tables in D.2, you'll see that rcaron is not explicitly supported, only scaron, zcaron, and a naked caron. The latter indicates that you can construct a rcaron. Unfortunately, though, the table states that the naked caron is not available in WinAnsiEncoding which is the standard encoding assumed in PDFBox.
Thus, to draw the unembedded base14 Helvetica rcaron you essentially will have to use a Helvetica font object with a non-WinAnsiEncoding encoding, e.g. MacRomanEncoding.
Furthermore you have to adapt the encoding of the strings added to your content streams. If you e.g. used to use PDPageContentStream.drawString(String), you'll have to change that because that method uses the COSString(String) constructor which implicitly assumes other encodings ("ISO-8859-1" or "UTF-16BE") not appropriate for the task at hand.
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.