Unicode in PDF - pdf

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:
(something)
There is also the option to escape a character with octal codes:
(\527)
but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.
Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.
The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.
Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.
Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

In the PDF reference in chapter 3, this is what they say about Unicode:
Text strings are encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibliography).
For text strings encoded in Unicode, the first two bytes must be 254 followed by
255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating
that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified
in the Unicode standard. (This mechanism precludes beginning a string using
PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to
be a meaningful beginning of a word or phrase).

The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work.
Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object.
Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont.
The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification.
And that's it! I've just did it and now I have a Unicode PDF!
Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020
And yes, don't forget /ToUnicode entry as #user2373071 pointed out or user will not be able to search your PDF or copy text from it.

As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.
Check out PDFBox or iText.
http://www.adobe.com/devnet/pdf/pdf_reference.html

I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.
seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).
So to use unicode in pdf directly
you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space
IMHO, these points make it absolutely unfeasible to use unicode directly.
What I am doing instead now is using the characters indirectly in the following way:
For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like
std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;
then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
{
LookupTable[fontname][*i] = Codepage[fontname].size();
Codepage[fontname].push_back(*i);
}
}
then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:
static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
int id = LookupTable[fontname][*i] + 1;
result += hex[(id & 0x00F0) >> 4];
result += hex[(id & 0x000F)];
}
result += ">";
for example, "H€llo World!" might become <01020303040506040703080905>
and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences
For the "H€llo World!" example, this Font-Object would work:
5 0 obj
<<
/F1
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding
<<
/Type /Encoding
/Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
>>
>>
>>
endobj
I generate it with this code:
ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
(*stream) << " /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
(*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
(*stream) << " /" << GlyphName(*j) << "\n";
(*stream) << " ] >>";
(*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";
Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...
So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at
http://www.jdawiseman.com/papers/trivia/character-entities.html
and it looks like this
const std::string GlyphName(wchar_t UnicodeCodepoint)
{
switch(UnicodeCodepoint)
{
case 0x00A0: return "nonbreakingspace";
case 0x00A1: return "exclamdown";
case 0x00A2: return "cent";
...
}
}
A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.
Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).
I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.
This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)
However, and more importantly,
I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.
The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).

I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:
Are you sure you are using a font that supports all the characters you need?
In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)? How to display more glyphs?

Adobe Glyph List (AGL) is described as
is a mapping of 4,281 glyph names to one or more Unicode characters.
For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:
<<
/Type /Page
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding <<
/Differencs [ 1 /Adiaresis /adiaresis ]
>>
>>
>>
>>
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.
So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?
In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?
Motivation
It would be nice to have a way in PDF that would be the equivalent to HTML5's
<meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.
This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.
AGL names and Differences arrays
First let's consider the focal point of your original question,
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.
This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj
the instruction
( ) Tj
shows you a ꭕ ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.
Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!
Encodings of composite fonts
Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!
Here the Encoding shall be
The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").
And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.
Which fonts does a viewer have to have available?
Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?
Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.
Annex D of the specification clarifies the character ranges to support:
All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.
D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.
D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.
D.4 and D.5 contain a single table each describing additional single byte encodings.
Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!
(You wondered about this in comments to this answer to another question of yours.)
Concerning your clarifications
In your text accompanying your bounty offer you expressed a desire for
being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).
As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!
sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.
"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.
the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.
As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!
The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.
Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?
No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.
If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.
Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.
There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.
You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.
ToUnicode comes into play for text extraction vs text showing.

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

write in unicode text on visible signature - pdfbox

I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.
Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.