DITA OT printing '#' in stead of Chinese characters in PDF

DITA OT printing '#' in stead of Chinese characters in PDF - pdf

I am very new to DITA OT. Downloaded the DITA-OT1.5.4_full_easy_install_bin and playing around with it. I'm trying to print few characters in Simplified Chinese (zh-CN) into a PDF. I see that the characters are printed correctly in XHTML but in PDF they are printed as "#".
In the command line I see this - "Warning: Glyph "?" (0x611f) not available in font "Helvetica".
Here are the things I have tried so far:
In demo\fo\fop\conf\fop.xconf :
<fonts>
<font kerning="yes"
embed-url="file:///C:/Windows/Fonts/simsun.ttc"
embedding-mode="subset" encoding-mode="cid">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<auto-detect/>
<directory recursive="true">C:\Windows\Fonts</directory>
</fonts>
In demo\fo\cfg\fo\attrs\custom.xsl :
<xsl:attribute-set name="__fo__root">
<xsl:attribute name="font-family">SimSun</xsl:attribute>
</xsl:attribute-set>
In demo\fo\cfg\fo\font-mapping.xml added this block for Sans, Serif & Monospaced logical fonts:
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
In samples\concepts\garageconceptsoverview.xml :
<shortdesc xml:lang="zh_CN">職業道德感.</shortdesc>
And this is the command I am using to generate the PDF:
ant -Dargs.input=samples\hierarchy.ditamap -Dtranstype=pdf
Any help would be appreciated. Thanks.
[EDIT]
I see that the topic.fo file which gets generated in temp folder, does contain the Chinese characters correctly. Like this:
<fo:block font-size="10pt" keep-with-next.within-page="5" start-indent="25pt">職業道德感.</fo:block>
But I do not see the font related information anywhere in this document.

First of all you should set the "xml:lang='zh_CN'" attribute on the root elements for all DITA topics and maps. This will help the DITA OT publishing decide the language to use for static texts like "Table X" and also to decide on the charset to use for the font mappings.
Then you should run the publishing by setting the parameter "clean.temp" parameter to "no".
After the publishing you can look in the temporary files folder for a file called "topic.fo" and look inside it to see what font families are used.
Because even if you set a font on the root element, there are other places in the XSL-FO file where you have font families set explicitly.
So instead of setting a font on the XSL-FO root element you should edit the font mappings XML file and for each of the logical fonts "Sans" and "Serif" you should configure the actual font family to use for the Chinese charset, something like:
<logical-font name="Sans">
.........
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
......
</logical-font>
More about how the font mappings work:
https://www.oxygenxml.com/doc/versions/17.0/ug-editor/#topics/DITA-map-set-font-Apache-FOP.html
Update:
If you insist of having that XSLT customization which sets the "SimSun" font as a font family on the root element, then in the font-mappings.xml you need to define a new mapping for your alias:
<aliases>
<alias name="SimSun">SimSun</alias>
</aliases>
and then map the logical font to a physical one in the same font-mappings.xml:
<logical-font name="SimSun">
<physical-font char-set="Simplified Chinese">
<font-face>SimSun</font-face>
</physical-font>
</logical-font>

0x611f , this character is a chinese character (感), helvetica is an europe font , so no this character in the "helvetica" font. You can search this "helvetica" font loaction, in this position your content(ditamap/dita) should use chinese font, not europe font. You must find that arritbute that include the [font-famliy=helvetical], modify in your own plugin [SimSun, Helvetical].

Sorry, I cannot answer your question, but you should definetely try a newer DITA-OT from http://dita-ot.github.io/. Your DITA-OT is not supported anymore. Maybe your problem fades away using the latest release.

Related

Nbspace not available

I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?

The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);

No glyph for U+000D in font Helvetica

How to solve this for pdfbox with boxable.
I am getting in table.draw as
No glyph for U+000D in font Helvetica
What to do.I am building table with boxable

That error tells you that your strings you use to fill the tables contain CR (carriage return) characters.
Do not use control characters (like CR, LF, TAB, ...) in those string as your software stack does not interpret them to mean something like a line break; instead it tries to interpret it as a glyph in the font which it fails doing.
If you need to break lines in boxable tables, try using <p> or <br> instead. According to their README, they support
HTML tags in cell content (not all! <p>,<i>,<b>,<br>,<ul>,<ol>,<li>)

extract information from tables in truetype font file

While parsing a pdf file, my parser encounter a Tf operator with the value of the SubType entry in the font dictionary set to TrueType. The Encoding entry is not present, the symblic flag is set.
My question is : how do I suppose to map the character codes to characters with no encoding ?
The PDF reference section 5.5.5 Character Encoding states that TrueType font has internal data represented in tables in the font files. It seems that those tables would help me map the character codes. Am I getting it right ? How can I extract those information from the font file ?
The font file extracted from the PDF gave something like :
I read Apple's documentation The True Type Font File but still don't get how can I extract those informations from those tables.
Any help, links or reading suggestion would be greatly appreciated.

Symblic flag means that encoding is set to [0..255] range. Every character code must be in the this range. Font presents glyphs only for these codes.
Here is a great set of resources regarding TrueType and OpenType font formats.

You can use freetype library function FT_Get_Char_Index for going from a character code to a glyph index. See FT_Get_Char_Index
You'll have to dump the truetype font to file and load it with freetype to get an FT_Face first.

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.

The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.

TCPDF font conversion results in missing glyphs

I'm using the TCPDF library to generate server-side PDFs daily in a cronjob. This library takes UTF8 strings from the DB and writes them into a PDF using the Arial Unicode MS font (also embedding it in the PDF).
To be able to use this font, I had to convert it to a PHP-friendly format following these instructions: http://www.tcpdf.org/fonts.php
However, while most of the languages seem right (glyphs are correct in Hebrew, Chinese, Japanese, Portuguese, etc.), Korean glyphs appear as squared boxes in the PDF.
I noticed many (hundreds of) errors while running the ttf2ufm binary described in the link above:
Previous entry type: M
Warning: **** closepath on empty path in glyph "_d_8235" ****
I'm suspecting this has to do with this issue (not being able to correctly convert those couple of hundred glyphs, thus resulting in an invalid font file).
Am I doing something wrong? Or is this just a limitation of this library?

The latest TCPDF version automatically convert fonts into TCPDF format using the addTTFfont() method. The old font programs and scripts were removed.
For example:
// convert TTF font to TCPDF format and store it on the fonts folder
$fontname = $pdf->addTTFfont('/path-to-font/FreeSerifItalic.ttf', 'TrueTypeUnicode', '', 96);
// use the font
$pdf->SetFont($fontname, '', 14, '', false);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas