I am having issues with previewing PDF in Gmail. It doesn't recognize some of the international characters that I am using (it doesn't show letters like ą ć ś, but it shows for example ł). I am encoding the pdf with Cp1250.
Any ideas on whats going on?
It looks like you are using the Standard 14 Fonts and don't embed them into your PDF. PDF readers are required to bring along these fonts but only with a limited character set which does not include ą, ć, or ś but which does include ł which matches your observation
(it doesn't show letters like ą ć ś, but it shows for example ł)
For details on these fonts confer the PDF specification
9.6.2.2 Standard Type 1 Fonts (Standard 14 Fonts)
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
In Annex D you'll find ł but not ą, ć, or ś.
Related
Adobe Glyph List (AGL) is described as
is a mapping of 4,281 glyph names to one or more Unicode characters.
For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:
<<
/Type /Page
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding <<
/Differencs [ 1 /Adiaresis /adiaresis ]
>>
>>
>>
>>
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.
So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?
In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?
Motivation
It would be nice to have a way in PDF that would be the equivalent to HTML5's
<meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.
This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.
AGL names and Differences arrays
First let's consider the focal point of your original question,
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.
This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj
the instruction
( ) Tj
shows you a ꭕ ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.
Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!
Encodings of composite fonts
Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!
Here the Encoding shall be
The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").
And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.
Which fonts does a viewer have to have available?
Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?
Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.
Annex D of the specification clarifies the character ranges to support:
All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.
D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.
D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.
D.4 and D.5 contain a single table each describing additional single byte encodings.
Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!
(You wondered about this in comments to this answer to another question of yours.)
Concerning your clarifications
In your text accompanying your bounty offer you expressed a desire for
being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).
As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!
sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.
"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.
the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.
As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!
The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.
Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?
No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.
If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.
Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.
There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.
You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.
ToUnicode comes into play for text extraction vs text showing.
The pdf 1.7 reference mentions that there are 14 Fonts that do not require embedding.
PDF prescribes a set of 14 standard fonts that can be used without
prior definition. These include four faces each of three Latin text
typefaces (Courier, Helvetica*, and Times*), as well as two symbolic
fonts (Symbol and ITC Zapf Dingbats ® ). These fonts, or suitable
substitute fonts with the same metrics, are required to be available
in all PDF consumer applications
The same reference document additionally enumerates different "font types" defined in PDF (/Subtype being /Type0,/Type1,/Type3,/CIDFontType0,/CIDFontType2,/MMType1,/TrueType).
The problem, and hence reason for this question, is that the font type, has implications to the way the text string data is mapped to the corresponding glyphs of the font. The reference documents broadly categorizes "simple" and "composite" fonts. Only "composite" fonts are described as being able to have an multi-byte character encoding. The "simple" font types, are basically encoded via single byte 1byte = 1glyph.
It would be hence really interesteting to know if the 14 base /standard fonts, are supposedly simple fonts, or if they can be used as as CID-keyed fonts?
Or plainly, what "font type" are those 14 standard fonts?.
The already linked reference lists the following font types
Type0 (PDF 1.2) A composite font—a font composed of glyphs from a descendant CIDFont (see Section 5.6, “Composite Fonts”)
Type 1 Type1 A font that defines glyph shapes using Type 1 font technology (see Section 5.5.1, “Type 1 Fonts”).
MMType1 A multiple master font an extension of the Type 1 font that allows the generation of a wide variety of typeface styles from a
single font (see “Multiple Master Fonts” on page 416)
Type 3 Type3 A font that defines glyphs with streams of PDF graphics operators (see Section 5.5.4, “Type 3 Fonts”)
TrueType TrueType A font based on the TrueType font format (see Section 5.5.2, “TrueType Fonts”)
CIDFont CIDFontType0 (PDF 1.2) A CIDFont whose glyph descriptions are based on Type 1 font technology (see Section 5.6.3, “CIDFonts”)
CIDFontType2 (PDF 1.2) A CIDFont whose glyph descriptions are based on TrueType font technology (see Section 5.6.3, “CIDFonts”)
The standard 14 PDF fonts are Type1 fonts. The AFM files needed to get the necessary meta information like glyph width can be obtained freely from Adobe. As for the encoding: Most applications use MacRomanEncoding or WinAnsiEncoding.
I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.
Is there any way to use special characters like 'rcaron'(U+0159, ř) in TJ operator in base14 fonts (Helvetica)?
Something like [(\rcaron)] TJ ?
Is it present in the font?
I went through Helvetica.afm and it seems that this character is present in the font. Also when I use this character in an interactive textfield in PDF it seems to be present.
I tried pdfbox to generate a sample file, but it fails - it uses TJ and the character is not correct.
Thanks a lot.
Concerning the character set PDF viewers must support for un-embedded base14 fonts, the PDF specification ISO 32000-1 states in section 9.6.2.2:
The character sets and encodings for these fonts are listed in Annex D.
and in annex D.1:
D.2, "Latin Character Set and Encodings", describes the entire character set for the Adobe standard Latin-text fonts. This character set shall be supported by the Times, Helvetica, and Courier font families, which are among the standard 14 predefined fonts; see 9.6.2.2, "Standard Type 1 Fonts (Standard 14 Fonts)".
If you inspect the tables in D.2, you'll see that rcaron is not explicitly supported, only scaron, zcaron, and a naked caron. The latter indicates that you can construct a rcaron. Unfortunately, though, the table states that the naked caron is not available in WinAnsiEncoding which is the standard encoding assumed in PDFBox.
Thus, to draw the unembedded base14 Helvetica rcaron you essentially will have to use a Helvetica font object with a non-WinAnsiEncoding encoding, e.g. MacRomanEncoding.
Furthermore you have to adapt the encoding of the strings added to your content streams. If you e.g. used to use PDPageContentStream.drawString(String), you'll have to change that because that method uses the COSString(String) constructor which implicitly assumes other encodings ("ISO-8859-1" or "UTF-16BE") not appropriate for the task at hand.
I need the metrics for the 14 standard PDF fonts.
I've download the following from Adobe, but it appears to use ISO-8859-1 encoding, rather than CP1252:
https://partners.adobe.com/public/developer/en/pdf/Core14_AFMs.zip
So it's missing code points 127 to 142 (for example, the ellipsis character).
Where can I download CP1252 versions of these Type1 fonts? Thanks.
The 'Core 14' PDF fonts don't know of 'CP1252' encoding (nor of 'ISO-8859-1').
They use their own encodings and encoding names, called: StandardEncoding, MacRomanEncoding, WinAnsiEncoding and PDFDocEncoding (where the WinAnsiEncoding largely maps to CP1252).
The font metric files you linked to are all for the Extended Roman character set (except the two symbol fonts Symbol and ZapfDingbats, which use a 'Special' character set) and the AdobeStandardEncoding encoding scheme (again except the two fonts mentioned before, which use a font specific scheme each).
The metrics for the ellipsis character is NOT MISSING, but it IS contained in 12 of these 14 AFM files (again, the two symbol fonts don't contain this glyph, and therefor also don't list its metrics).
To learn more about the encodings and character sets used by the 14 core PDF fonts, please refer to Annex D (normative), titled 'Character Sets and Encodings', of the PDF-1.7 specification.