PDF to Text for Urdu and Arabic using Ghostscript

PDF to Text for Urdu and Arabic using Ghostscript - pdf

I have a few PDF files which are in the Urdu language, and some of the PDF files are in the Arabic language.
I want to convert the PDF files to text format. I have issued the following Ghostscript code from the command line in my Windows 10 system:
gswin64c.exe -sDEVICE=txtwrite -o output.txt new.pdf
The text file is generated, however, the contents of text file is not in the Urdu language or Arabic language.
This is how it looks like (I have pasted a portion of output as it is huge):
ی첺جⰧ�� ہ셈ے
How can I properly convert PDF to text using Ghostscript?

Well basically the answer is that the PDF files you have supplied have 'not terribly good' ToUnicode CMap tables.
Looking at your first file we see that it uses one font:
26 0 obj
<<
/BaseFont /CCJSWK+JameelNooriNastaleeq
/DescendantFonts 28 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 29 0 R
/Type /Font
>>
endobj
That has a ToUnicode CMap in object 29, the ToUnicode CMap is meant to map character codes to Unicode code points. Looking at the first piece of text as an example we see:
/C2_0 1 Tf
13 0 0 13 39.1302 561.97 Tm
<0003>Tj
/Span<</ActualText<FEFF0645062A>>> BDC
<38560707>Tj
So that's character code 0x003 (notice no marked content for the first characetr), looking at the ToUnicode CMap we see:
<0003> <0020>
So character code 0x003 maps to Unicode point U+0020, a space. The next two character codes are 3856 and 0707. Again consulting the ToUnicode CMap we see:
<3856> <062A0645>
So that single character code maps to two Unicode code points, U+062A and U+0645, Which is 'Teh' ت and 'Meem' م
So far so good. The next code is 0707, when we look at the ToUnicode CMap it comes up as 0xFFFD, which is the 'replacement character' �. Obviously that's meaningless.
We then have this :
0.391 0 Td
[<011C07071FEE>1 <0003>243.8 <2E93>]TJ
/Span<</ActualText<FEFF0644>>> BDC
<0707>Tj
EMC
So that's character codes 0x011C, 0x0707, 0x1FEE, 0x0003, 0x2E93 followed by 0x0707. Notice that the final <0707> is associated with a Marked Content definition which says the ActualText is Unicode 0x0644, which is the 'Lam' glyph ل
So clearly the ToUnicode CMap should be mapping 0707 to U+0644, and it doesn't.
Now when given a ToUnicode CMap the text extraction code trusts it. So your problem with this file is that the ToUnicode CMap is 'wrong', and that's why the text is coming out incorrect.
I haven't tried to debug further through the file, it is possible there are other errors.
Your second file has this ToUnicode CMap:
26 0 obj
<<
/Length 606
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AABACF+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /AABACF+TT1+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0707> endcodespacerange
15 beginbfchar
<0003> <0020>
<0011> <002E>
<00e7> <062A>
<00ec> <062F>
<00ee> <0631>
<00f3> <0636>
<00f8> <0641>
<00fa> <0644>
<00fc> <0646>
<00fe> <0648>
<0119> <0647>
<011a> <064A>
<0134> <0066>
<013b> <006D>
<0707> <2423>
endbfchar
2 beginbfrange
<00e4> <00e5> <0627>
<011f> <0124> <0661>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
The first text in the file is:
<3718>Tj
And again, that's not in the CMap. Because the text extraction code prioritises the CMAp (because it's usually reliable), the missing entries cause the extraction to basically fail.
In addition to the fact that the ToUnicode CMaps are incorrect, the embedded fonts are subset and use an Identity-H CMap for drawing. That eliminates another source of information we could use.
Fundamentally the only way you're going to get text out of that PDF fie is manual transcription or OCR software.
Since you are using Ghostscript on Windows, the distributed binary includes Tesseract so you could try using that with pdfwrite and an Urdu training file to produce a PDF file with a possibly better ToUnicode CMap. You could then extract the text from that PDF file.
You would have to tell the pdfwrite device not to use the embedded ToUnicode CMaps, see the UseOCR switch documented here https://ghostscript.com/doc/9.56.1/VectorDevices.htm#PDF
And information on setting up the OCR engine and getting output here https://ghostscript.com/doc/9.56.1/Devices.htm#OCR-Devices
You may get better results by using an 'image' OCR output and then using the text extraction on that file to get the text out.

Related

How do I inspect the cmap table and subtables in a TrueType font?

The PDF Reference says:
A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions, using an internal data structure called a “cmap”
It goes on to explain that the behaviour of a PDF processor depends on which cmap subtables are present in the font file.
I am trying to analyze a .ttf font file extracted using fontforge from a PDF that was generated by LibreOffice. The PDF embeds this font file as a simple font, using single-byte codes. When I look at the .ttf file in fontdrop.info, it tells me the "glyphIndexMap" is as follows:
{"0":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0,"50":0,"51":0,"52":0,"53":0,"54":0,"55":0,"56":0,"57":0,"58":0,"59":0,"60":0,"61":0,"62":0,"63":0,"64":0,"65":0,"66":0,"67":0,"68":0,"69":0,"70":0,"71":0,"72":0,"73":0,"74":0,"75":0,"76":0,"77":0,"78":0,"79":0,"80":0,"81":0,"82":0,"83":0,"84":0,"85":0,"86":0,"87":0,"88":0,"89":0,"90":0,"91":0,"92":0,"93":0,"94":0,"95":0,"96":0,"97":0,"98":0,"99":0,"100":0,"101":0,"102":0,"103":0,"104":0,"105":0,"106":0,"107":0,"108":0,"109":0,"110":0,"111":0,"112":0,"113":0,"114":0,"115":0,"116":0,"117":0,"118":0,"119":0,"120":0,"121":0,"122":0,"123":0,"124":0,"125":0,"126":0,"127":0,"160":0,"161":0,"162":0,"163":0,"165":0,"167":0,"168":0,"169":0,"170":0,"171":0,"172":0,"174":0,"175":0,"176":0,"177":0,"180":0,"181":0,"182":0,"183":0,"184":0,"186":0,"187":0,"191":0,"192":0,"193":0,"194":0,"195":0,"196":0,"197":0,"198":0,"199":0,"200":0,"201":0,"202":0,"203":0,"204":0,"205":0,"206":0,"207":0,"209":0,"210":0,"211":0,"212":0,"213":0,"214":0,"216":0,"217":0,"218":0,"219":0,"220":0,"223":0,"224":0,"225":0,"226":0,"227":0,"228":0,"229":0,"230":0,"231":0,"232":0,"233":0,"234":0,"235":0,"236":0,"237":0,"238":0,"239":0,"241":0,"242":0,"243":0,"244":0,"245":0,"246":0,"247":0,"248":0,"249":0,"250":0,"251":0,"252":0,"255":0,"305":0,"338":0,"339":0,"376":0,"402":0,"675":3,"710":0,"711":0,"728":0,"729":0,"730":0,"731":0,"732":0,"733":0,"916":0,"937":0,"960":0,"8211":0,"8212":0,"8216":0,"8217":0,"8218":0,"8220":0,"8221":0,"8222":0,"8224":0,"8225":0,"8226":0,"8230":0,"8240":0,"8249":0,"8250":0,"8260":0,"8364":0,"8482":0,"8706":0,"8719":0,"8721":0,"8730":0,"8734":0,"8747":0,"8776":0,"8800":0,"8804":0,"8805":0,"9674":0,"57374":0,"64257":0,"64258":0}
(the interesting part is "675":3)
I can understand this insofar as the font contains 4 glyphs, and the glyph at index 3 is the ʣ character (decimal Unicode point 675 / U+02A3).
But in the PDF, this character is used in text strings as <01>, and no other encoding is given - so according to the PDF Reference, the mapping from <01> to the glyph at index 3 must be done according to a mapping within the .ttf file:
If no Encoding entry is specified in the font dictionary, the “cmap” subtable with platform ID 1 and encoding 0 will be used to map directly from character codes to glyph descriptions, without any consideration of character names. This is the normal convention for symbolic fonts.
I have confirmed that no Encoding entry is specified within the PDF. Here are the /Font and /FontDescriptor objects extracted using qpdf:
18 0 obj
<<
/BaseFont /BAAAAA+LiberationSerif
/FirstChar 0
/FontDescriptor 20 0 R
/LastChar 1
/Subtype /TrueType
/ToUnicode 21 0 R
/Type /Font
/Widths [
777
802
]
>>
endobj
20 0 obj
<<
/Ascent 891
/CapHeight 981
/Descent -216
/Flags 4
/FontBBox [
-543
-303
1277
981
]
/FontFile2 23 0 R
/FontName /BAAAAA+LiberationSerif
/ItalicAngle 0
/StemV 80
/Type /FontDescriptor
>>
endobj
So how can I investigate the .ttf file to confirm that "the “cmap” subtable with platform ID 1 and encoding 0" is in place and contains the mappings I think it does?
Edit: the PDF in question

How do I inspect the cmap table and subtables in a TrueType font?
OT Master Light, from Dutch Type Library, is a free tool that's quite handy for inspecting internal font tables.

Using OT Master Light it can be seen that the cmap 1:0 maps character code 0x01 to glyph index 1 (1st image, 2nd entry in the list) which is the 'dz' symbol (2nd image).

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)? How to display more glyphs?

Adobe Glyph List (AGL) is described as
is a mapping of 4,281 glyph names to one or more Unicode characters.
For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:
<<
/Type /Page
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding <<
/Differencs [ 1 /Adiaresis /adiaresis ]
>>
>>
>>
>>
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.
So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?
In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?
Motivation
It would be nice to have a way in PDF that would be the equivalent to HTML5's
<meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.

This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.
AGL names and Differences arrays
First let's consider the focal point of your original question,
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.
This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj
the instruction
( ) Tj
shows you a ꭕ ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.
Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!
Encodings of composite fonts
Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!
Here the Encoding shall be
The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").
And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.
Which fonts does a viewer have to have available?
Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?
Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.
Annex D of the specification clarifies the character ranges to support:
All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.
D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.
D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.
D.4 and D.5 contain a single table each describing additional single byte encodings.
Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!
(You wondered about this in comments to this answer to another question of yours.)
Concerning your clarifications
In your text accompanying your bounty offer you expressed a desire for
being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).
As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!
sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.
"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.
the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.
As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!
The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?
No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.
If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.
Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.

There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.
You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.
ToUnicode comes into play for text extraction vs text showing.

Where can I get Adobe-Identity-UCS cmap file？

I have a pdf file which can not be extracted text by pdfbox or itext7. The font is encoded by Identity-H with Adobe-Identity-UCS. The details of ToUnicode are given below.
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo > def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
The ToUnicode is invalid. Is there any way to fixed it?
I tried to download an intact Adobe-Identity-UCS cmap file and to replace it. But after a lot of google searching, I can't find the Adobe-Identity-UCS cmap file.
Any help? Thanks.
Edit:
Chinese-cidmap-broken.pdf

The ToUnicode CMap you show corresponds to the example ToUnicode CMap in the PDF specification ISO 32000 (either part), merely without any bfrange or bfchar section.
Thus, what you have essentially is a template into which one can put arbitrary mappings.
Concerning your question, therefore:
Is there any way to fixed it?
Yes and no.
Yes, you can fix it by adding the appropriate bfrange or bfchar sections with the correct mappings.
BUT... to do so you need to know which codes map to which Unicode strings respectively for the font at hand, the name Adobe-Identity-UCS by itself usually does not imply the mapping. So also:
No, not without additional information.
#Tilman in his comment to your question referenced one of his answers in which he showed how to add a missing ToUnicode map using information on the actual mappings gathered from different sources.

Why a pdf document with embedded fonts can be copied but is not searchable in pdf reader

I am writing a pdf files with embedded subset fonts. As required, I am including the ToUnicode and CIDSet objects. To test, I created a simple PDF with two Hebrew characters. I can select the two characters and copy to the clipboard, and paste it properly into another application such as Word. But I am not able to search for a word containing these two characters. Adobe Reader (or Acrobat) displays the message that the word was not found. So in essence, I have created a PDF document which can be copied properly, but is not searchable. Any idea what I might be missing when creating the document?
Additional information:
1. The file in question is a minimal file with just two characters. I have tested with many such files in many different languages including English. None of the files are searchable.
2. Curiously, if I search for the letter 'e', Adobe reader highlights an incorrect word, even if the letter 'e' does not exists in the file.
3. Adobe acrobat is also not able to search within this file, however when I save the file to another disk file, the saved file now is searchable. I confirmed that the major objects such as the font-file, ToUnicode object, CID object, and the font description objects are the same in the saved file. However, one of the font object is brought up closer to the top of the file.
4. FoxIt is able to search these files properly.
Relevant PDF objects:
5 0 obj
<>
stream
q 0.750000 0 0 0.750000 0.000000 792.000000 cm
q q q 0.160000 0.000000 0.000000 0.160000 0.000000 0.000000 cm
BT /F0 100.000000 Tf 0 g 750.000000 -690 Td[<02B0>] TJ 35.000000 0 Td[<02B9>] TJ ET Q
Q
Q
Q
endstream
endobj
10 0 obj
<>
endobj
11 0 obj
<> /FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font>>
endobj
12 0 obj
<>
endobj
8 0 obj
<>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0000> <0000>
<02B0> <05E0>
<02B9> <05E9>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

In short
The problem is due to identical PDF IDs used for different documents.
In detail
Adobe Reader / Acrobat seem to cache search information for documents identifying the document by its ID. Some of the OP's documents seem to have the same ID, at least the two sample files do:
/ID[<754DC77D28E62763C4916970D595A10F><754DC77D28E62763C4916970D595A10F>]
Thus, search information from earlier viewed PDFs with that ID was used when the OP tried to search his test.pdf. Considering this description from one of his comments:
What happens if you search for the English letter 'e'. For me, the two Hebrew letters can selected. The same happens when I search for one of these English letters: d, i, n, o, p, r, t, y, I, N, R, T and Y.
the search information seems to have been cached for a document with Latin glyphs, Furthermore, considering this comment on test_en.pdf (a document sharing the same ID, too):
It has one English line: 'This is a test line'. When I search for "This', I find it. But I can not find the other words.
the text of the original document seems to have started with "This" but continued differently.

write in unicode text on visible signature - pdfbox

I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?

QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.

Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas