PDF special char in TJ operator, base 14 fonts - pdfbox

Is there any way to use special characters like 'rcaron'(U+0159, ř) in TJ operator in base14 fonts (Helvetica)?
Something like [(\rcaron)] TJ ?
Is it present in the font?
I went through Helvetica.afm and it seems that this character is present in the font. Also when I use this character in an interactive textfield in PDF it seems to be present.
I tried pdfbox to generate a sample file, but it fails - it uses TJ and the character is not correct.
Thanks a lot.

Concerning the character set PDF viewers must support for un-embedded base14 fonts, the PDF specification ISO 32000-1 states in section 9.6.2.2:
The character sets and encodings for these fonts are listed in Annex D.
and in annex D.1:
D.2, "Latin Character Set and Encodings", describes the entire character set for the Adobe standard Latin-text fonts. This character set shall be supported by the Times, Helvetica, and Courier font families, which are among the standard 14 predefined fonts; see 9.6.2.2, "Standard Type 1 Fonts (Standard 14 Fonts)".
If you inspect the tables in D.2, you'll see that rcaron is not explicitly supported, only scaron, zcaron, and a naked caron. The latter indicates that you can construct a rcaron. Unfortunately, though, the table states that the naked caron is not available in WinAnsiEncoding which is the standard encoding assumed in PDFBox.
Thus, to draw the unembedded base14 Helvetica rcaron you essentially will have to use a Helvetica font object with a non-WinAnsiEncoding encoding, e.g. MacRomanEncoding.
Furthermore you have to adapt the encoding of the strings added to your content streams. If you e.g. used to use PDPageContentStream.drawString(String), you'll have to change that because that method uses the COSString(String) constructor which implicitly assumes other encodings ("ISO-8859-1" or "UTF-16BE") not appropriate for the task at hand.

Related

Writing Unicode into PDF

I have Unicode text (a sequence of Unicode codes) and a TTF font (bytes of a TTF file). I would like to write that text into a PDF file using that font.
I understand PDF quite well. I don't mind using two bytes per character. I would like to attach the TTF file as it is (charcode-to-glyf map should be used from a TTF file).
What font Subtype and Encoding value should I use? Is it possible to avoid having ToUnicode record?
I tried to use Subtype = "/TrueType", but it requires to specify FirstChar, LastChar and Widths (which are already inside TTF).
You cannot use Unicode with a Font, at all (except in the limited case of Latin, or nearly Latin, languages), because Fonts use an Encoding, and an Encoding is a single byte array. So you can't reference more than 256 characters from a Font, and a character code can't be more than a single byte.
The first problem with 'using Unicode' is that Unicode is not a simple 2-byte Encoding, its a multi-byte format, with variable lengths and sometimes a single glyph is represented by multiple Unicode code points.
So, in order to deal with this you need to use a CIDFont, not a Font. You cannot 'use the charcode-to-glyf map', by which I assume you mean the CMAP subtable in the TTF font. You must compose the CIDFont with a CMap in order to map the multiple bytes in the text string into the character codes for lookup in the CMap, which gives you the CID to reference the precise character program in the font.
It may be possible to construct a single CMap which would cover every Unicode code point, but I have my doubts, it would certainly be a huge task. However certain CMaps already exist. Adobe publish a standard list on their web site which includes CMaps such as UniCNS-UCS2-H and UniCNS-UCS2-V or UniGB-UTF8-H etc.
You can probably use one of the standard CMaps.
Note that it doesn't matter that the FirstChar, LastChar etc are already stored in the TrueType font, you still need to specify them in the PDF Font object. That's because a PDF consumer might not be rendering the text at all, it could (for example) be extracting the text, in which case it doesn't need to interpret the font provided this information is available.

Characters not displayed when viewing PDF in Gmail

I am having issues with previewing PDF in Gmail. It doesn't recognize some of the international characters that I am using (it doesn't show letters like ą ć ś, but it shows for example ł). I am encoding the pdf with Cp1250.
Any ideas on whats going on?
It looks like you are using the Standard 14 Fonts and don't embed them into your PDF. PDF readers are required to bring along these fonts but only with a limited character set which does not include ą, ć, or ś but which does include ł which matches your observation
(it doesn't show letters like ą ć ś, but it shows for example ł)
For details on these fonts confer the PDF specification
9.6.2.2 Standard Type 1 Fonts (Standard 14 Fonts)
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
In Annex D you'll find ł but not ą, ć, or ś.

Programmatic extraction of Unicode character values from True type font file in C/C++

I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like '2%dd! w!|<~' and this would end up as 'Hello World' in the PDF represented by the corresponding glyphs from the TTF. I'd like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?
Glyph ID's do not always correspond to Unicode character values - especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text - otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe's Glyph Naming Convention or Adobe Glyph List Specification - but many fonts, including the standard Windows fonts, don't follow this naming convention.
UTF-8 is an encoding that allows UTF8 encoded streams to be decoded to reveal a sequence of unicode char points. In any case, PDF does not encode using UTF-8. For true type text, each glyph is encoded using 8 bits.
To decode:
Read the differences array and encoding from the font definition
Read 8 bits at a time and produce an "AdobeGlyphId" using the encoding and differences array read in step 1.
Use the adobe glyph id to look up the unicode value
This is detailed in section 9.10 of the PDF Specification

Adobe Font Metrics for Standard PDF Fonts in CP1252

I need the metrics for the 14 standard PDF fonts.
I've download the following from Adobe, but it appears to use ISO-8859-1 encoding, rather than CP1252:
https://partners.adobe.com/public/developer/en/pdf/Core14_AFMs.zip
So it's missing code points 127 to 142 (for example, the ellipsis character).
Where can I download CP1252 versions of these Type1 fonts? Thanks.
The 'Core 14' PDF fonts don't know of 'CP1252' encoding (nor of 'ISO-8859-1').
They use their own encodings and encoding names, called: StandardEncoding, MacRomanEncoding, WinAnsiEncoding and PDFDocEncoding (where the WinAnsiEncoding largely maps to CP1252).
The font metric files you linked to are all for the Extended Roman character set (except the two symbol fonts Symbol and ZapfDingbats, which use a 'Special' character set) and the AdobeStandardEncoding encoding scheme (again except the two fonts mentioned before, which use a font specific scheme each).
The metrics for the ellipsis character is NOT MISSING, but it IS contained in 12 of these 14 AFM files (again, the two symbol fonts don't contain this glyph, and therefor also don't list its metrics).
To learn more about the encodings and character sets used by the 14 core PDF fonts, please refer to Annex D (normative), titled 'Character Sets and Encodings', of the PDF-1.7 specification.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.