Nbspace not available - pdfbox

I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?

The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);

Related

Writing Unicode into PDF

I have Unicode text (a sequence of Unicode codes) and a TTF font (bytes of a TTF file). I would like to write that text into a PDF file using that font.
I understand PDF quite well. I don't mind using two bytes per character. I would like to attach the TTF file as it is (charcode-to-glyf map should be used from a TTF file).
What font Subtype and Encoding value should I use? Is it possible to avoid having ToUnicode record?
I tried to use Subtype = "/TrueType", but it requires to specify FirstChar, LastChar and Widths (which are already inside TTF).
You cannot use Unicode with a Font, at all (except in the limited case of Latin, or nearly Latin, languages), because Fonts use an Encoding, and an Encoding is a single byte array. So you can't reference more than 256 characters from a Font, and a character code can't be more than a single byte.
The first problem with 'using Unicode' is that Unicode is not a simple 2-byte Encoding, its a multi-byte format, with variable lengths and sometimes a single glyph is represented by multiple Unicode code points.
So, in order to deal with this you need to use a CIDFont, not a Font. You cannot 'use the charcode-to-glyf map', by which I assume you mean the CMAP subtable in the TTF font. You must compose the CIDFont with a CMap in order to map the multiple bytes in the text string into the character codes for lookup in the CMap, which gives you the CID to reference the precise character program in the font.
It may be possible to construct a single CMap which would cover every Unicode code point, but I have my doubts, it would certainly be a huge task. However certain CMaps already exist. Adobe publish a standard list on their web site which includes CMaps such as UniCNS-UCS2-H and UniCNS-UCS2-V or UniGB-UTF8-H etc.
You can probably use one of the standard CMaps.
Note that it doesn't matter that the FirstChar, LastChar etc are already stored in the TrueType font, you still need to specify them in the PDF Font object. That's because a PDF consumer might not be rendering the text at all, it could (for example) be extracting the text, in which case it doesn't need to interpret the font provided this information is available.

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

PDF special char in TJ operator, base 14 fonts

Is there any way to use special characters like 'rcaron'(U+0159, ř) in TJ operator in base14 fonts (Helvetica)?
Something like [(\rcaron)] TJ ?
Is it present in the font?
I went through Helvetica.afm and it seems that this character is present in the font. Also when I use this character in an interactive textfield in PDF it seems to be present.
I tried pdfbox to generate a sample file, but it fails - it uses TJ and the character is not correct.
Thanks a lot.
Concerning the character set PDF viewers must support for un-embedded base14 fonts, the PDF specification ISO 32000-1 states in section 9.6.2.2:
The character sets and encodings for these fonts are listed in Annex D.
and in annex D.1:
D.2, "Latin Character Set and Encodings", describes the entire character set for the Adobe standard Latin-text fonts. This character set shall be supported by the Times, Helvetica, and Courier font families, which are among the standard 14 predefined fonts; see 9.6.2.2, "Standard Type 1 Fonts (Standard 14 Fonts)".
If you inspect the tables in D.2, you'll see that rcaron is not explicitly supported, only scaron, zcaron, and a naked caron. The latter indicates that you can construct a rcaron. Unfortunately, though, the table states that the naked caron is not available in WinAnsiEncoding which is the standard encoding assumed in PDFBox.
Thus, to draw the unembedded base14 Helvetica rcaron you essentially will have to use a Helvetica font object with a non-WinAnsiEncoding encoding, e.g. MacRomanEncoding.
Furthermore you have to adapt the encoding of the strings added to your content streams. If you e.g. used to use PDPageContentStream.drawString(String), you'll have to change that because that method uses the COSString(String) constructor which implicitly assumes other encodings ("ISO-8859-1" or "UTF-16BE") not appropriate for the task at hand.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.