I need the metrics for the 14 standard PDF fonts.
I've download the following from Adobe, but it appears to use ISO-8859-1 encoding, rather than CP1252:
https://partners.adobe.com/public/developer/en/pdf/Core14_AFMs.zip
So it's missing code points 127 to 142 (for example, the ellipsis character).
Where can I download CP1252 versions of these Type1 fonts? Thanks.
The 'Core 14' PDF fonts don't know of 'CP1252' encoding (nor of 'ISO-8859-1').
They use their own encodings and encoding names, called: StandardEncoding, MacRomanEncoding, WinAnsiEncoding and PDFDocEncoding (where the WinAnsiEncoding largely maps to CP1252).
The font metric files you linked to are all for the Extended Roman character set (except the two symbol fonts Symbol and ZapfDingbats, which use a 'Special' character set) and the AdobeStandardEncoding encoding scheme (again except the two fonts mentioned before, which use a font specific scheme each).
The metrics for the ellipsis character is NOT MISSING, but it IS contained in 12 of these 14 AFM files (again, the two symbol fonts don't contain this glyph, and therefor also don't list its metrics).
To learn more about the encodings and character sets used by the 14 core PDF fonts, please refer to Annex D (normative), titled 'Character Sets and Encodings', of the PDF-1.7 specification.
Related
The pdf 1.7 reference mentions that there are 14 Fonts that do not require embedding.
PDF prescribes a set of 14 standard fonts that can be used without
prior definition. These include four faces each of three Latin text
typefaces (Courier, Helvetica*, and Times*), as well as two symbolic
fonts (Symbol and ITC Zapf Dingbats ® ). These fonts, or suitable
substitute fonts with the same metrics, are required to be available
in all PDF consumer applications
The same reference document additionally enumerates different "font types" defined in PDF (/Subtype being /Type0,/Type1,/Type3,/CIDFontType0,/CIDFontType2,/MMType1,/TrueType).
The problem, and hence reason for this question, is that the font type, has implications to the way the text string data is mapped to the corresponding glyphs of the font. The reference documents broadly categorizes "simple" and "composite" fonts. Only "composite" fonts are described as being able to have an multi-byte character encoding. The "simple" font types, are basically encoded via single byte 1byte = 1glyph.
It would be hence really interesteting to know if the 14 base /standard fonts, are supposedly simple fonts, or if they can be used as as CID-keyed fonts?
Or plainly, what "font type" are those 14 standard fonts?.
The already linked reference lists the following font types
Type0 (PDF 1.2) A composite font—a font composed of glyphs from a descendant CIDFont (see Section 5.6, “Composite Fonts”)
Type 1 Type1 A font that defines glyph shapes using Type 1 font technology (see Section 5.5.1, “Type 1 Fonts”).
MMType1 A multiple master font an extension of the Type 1 font that allows the generation of a wide variety of typeface styles from a
single font (see “Multiple Master Fonts” on page 416)
Type 3 Type3 A font that defines glyphs with streams of PDF graphics operators (see Section 5.5.4, “Type 3 Fonts”)
TrueType TrueType A font based on the TrueType font format (see Section 5.5.2, “TrueType Fonts”)
CIDFont CIDFontType0 (PDF 1.2) A CIDFont whose glyph descriptions are based on Type 1 font technology (see Section 5.6.3, “CIDFonts”)
CIDFontType2 (PDF 1.2) A CIDFont whose glyph descriptions are based on TrueType font technology (see Section 5.6.3, “CIDFonts”)
The standard 14 PDF fonts are Type1 fonts. The AFM files needed to get the necessary meta information like glyph width can be obtained freely from Adobe. As for the encoding: Most applications use MacRomanEncoding or WinAnsiEncoding.
I am having issues with previewing PDF in Gmail. It doesn't recognize some of the international characters that I am using (it doesn't show letters like ą ć ś, but it shows for example ł). I am encoding the pdf with Cp1250.
Any ideas on whats going on?
It looks like you are using the Standard 14 Fonts and don't embed them into your PDF. PDF readers are required to bring along these fonts but only with a limited character set which does not include ą, ć, or ś but which does include ł which matches your observation
(it doesn't show letters like ą ć ś, but it shows for example ł)
For details on these fonts confer the PDF specification
9.6.2.2 Standard Type 1 Fonts (Standard 14 Fonts)
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
In Annex D you'll find ł but not ą, ć, or ś.
Is there any way to use special characters like 'rcaron'(U+0159, ř) in TJ operator in base14 fonts (Helvetica)?
Something like [(\rcaron)] TJ ?
Is it present in the font?
I went through Helvetica.afm and it seems that this character is present in the font. Also when I use this character in an interactive textfield in PDF it seems to be present.
I tried pdfbox to generate a sample file, but it fails - it uses TJ and the character is not correct.
Thanks a lot.
Concerning the character set PDF viewers must support for un-embedded base14 fonts, the PDF specification ISO 32000-1 states in section 9.6.2.2:
The character sets and encodings for these fonts are listed in Annex D.
and in annex D.1:
D.2, "Latin Character Set and Encodings", describes the entire character set for the Adobe standard Latin-text fonts. This character set shall be supported by the Times, Helvetica, and Courier font families, which are among the standard 14 predefined fonts; see 9.6.2.2, "Standard Type 1 Fonts (Standard 14 Fonts)".
If you inspect the tables in D.2, you'll see that rcaron is not explicitly supported, only scaron, zcaron, and a naked caron. The latter indicates that you can construct a rcaron. Unfortunately, though, the table states that the naked caron is not available in WinAnsiEncoding which is the standard encoding assumed in PDFBox.
Thus, to draw the unembedded base14 Helvetica rcaron you essentially will have to use a Helvetica font object with a non-WinAnsiEncoding encoding, e.g. MacRomanEncoding.
Furthermore you have to adapt the encoding of the strings added to your content streams. If you e.g. used to use PDPageContentStream.drawString(String), you'll have to change that because that method uses the COSString(String) constructor which implicitly assumes other encodings ("ISO-8859-1" or "UTF-16BE") not appropriate for the task at hand.
I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like '2%dd! w!|<~' and this would end up as 'Hello World' in the PDF represented by the corresponding glyphs from the TTF. I'd like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?
Glyph ID's do not always correspond to Unicode character values - especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text - otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe's Glyph Naming Convention or Adobe Glyph List Specification - but many fonts, including the standard Windows fonts, don't follow this naming convention.
UTF-8 is an encoding that allows UTF8 encoded streams to be decoded to reveal a sequence of unicode char points. In any case, PDF does not encode using UTF-8. For true type text, each glyph is encoded using 8 bits.
To decode:
Read the differences array and encoding from the font definition
Read 8 bits at a time and produce an "AdobeGlyphId" using the encoding and differences array read in step 1.
Use the adobe glyph id to look up the unicode value
This is detailed in section 9.10 of the PDF Specification
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.