When I type "Col·laborador" I get exactly this with other letter types:
but with Merryweather Sans, I get "Coll·laborador":
I use ttf-merriweather-sans 1.006-3 in ArchLinux
The problem lies in that font's OpenType definition.
Its author included a locl OpenType feature for the language code CAT (Catalan) that was supposed to replace the combination /l' /cdot' /l with /ldot. The ' characters after the first two names indicate what characters are to be replaced by the new one (http://www.adobe.com/devnet/opentype/afdko/topic_feature_file_syntax.html#5.e). However, the author made a mistake there:
replace /periodcentered after /l before /l with /ldot
and so it replaces just the dot character with ŀ. That means that you get this
with an OpenType-aware program such as Adobe InDesign (and only with the applied language set to "Catalan"). Note the double l, as in your own findings.
The output from another program confirms this; characters that do not take part of a replacement are in gray, replaced characters are in black.
Both /ls are in gray, only the center dot is black. A correct definition should have shown the left hand l and L in black as well.
The font you link to is version 1.006; an older Merriweather Sans that I found (1.003) does not contain this particular OpenType feature and so does not display this error.
To fix it, adjust the OpenType code and regenerate it from its source, or contact its author. Contact details are on the Google Fonts page for this font. (I have left a message in there, referring to this question.)
Related
I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.
I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?
The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);
I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.
You can have 80-characters / right margin line in Netbeans, Text Mate and probably many, many more other IDEs. Is it possible to have it in Sublime Text 3 as well? Any option, plugin etc.?
Yes, it is possible in Sublime Text 2, ST3, and ST4 (which you should really upgrade to if you haven't already). Select View → Ruler → 80 (there are several other options there as well). If you like to actually wrap your text at 80 columns, select View → Word Wrap Column → 80. Make sure that View → Word Wrap is selected.
To make your selections permanent (the default for all opened files or views), open Preferences → Settings and use any of the following rules in the right-side pane:
{
// set vertical rulers in specified columns.
// Use "rulers": [80] for just one ruler
// default value is []
"rulers": [80, 100, 120],
// turn on word wrap for source and text
// default value is "auto", which means off for source and on for text
"word_wrap": true,
// set word wrapping at this column
// default value is 0, meaning wrapping occurs at window width
"wrap_width": 80
}
These settings can also be used in a .sublime-project file to set defaults on a per-project basis, or in a syntax-specific .sublime-settings file if you only want them to apply to files written in a certain language (Python.sublime-settings vs. JavaScript.sublime-settings, for example). Access these settings files by opening a file with the desired syntax, then selecting Preferences → Settings—Syntax Specific.
As always, if you have multiple entries in your settings file, separate them with commas , except for after the last one. The entire content should be enclosed in curly braces { }. Basically, make sure it's valid JSON.
If you'd like a key combo to automatically set the ruler at 80 for a particular view/file, or you are interested in learning how to set the value without using the mouse, please see my answer here.
Finally, as mentioned in another answer, you really should be using a monospace font in order for your code to line up correctly. Other types of fonts have variable-width letters, which means one 80-character line may not appear to be the same length as another 80-character line with different content, and your indentations will look all messed up. Sublime has monospace fonts set by default, but you can of course choose any one you want. Personally, I really like Liberation Mono. It has glyphs to support many different languages and Unicode characters, looks good at a variety of different sizes, and (most importantly for a programming font) clearly differentiates between 0 and O (digit zero and capital letter oh) and 1 and l (digit one and lowercase letter ell), which not all monospace fonts do, unfortunately. Version 2.0 and later of the font are licensed under the open-source SIL Open Font License 1.1 (here is the FAQ).
For this to work, your font also needs to be set to monospace.
If you think about it, lines can't otherwise line up perfectly perfectly.
This answer is detailed at sublime text forum:
http://www.sublimetext.com/forum/viewtopic.php?f=3&p=42052
This answer has links for choosing an appropriate font for your OS,
and gives an answer to an edge case of fonts not lining up.
Another website that lists great monospaced free fonts for programmers.
http://hivelogic.com/articles/top-10-programming-fonts
On stackoverflow, see:
Michael Ruth's answer here:
How to make ruler always be shown in Sublime text 2?
MattDMo's answer here:
What is the default font of Sublime Text?
I have rulers set at the following:
30
50 (git commit message titles should be limited to 50 characters)
72 (git commit message details should be limited to 72 characters)
80 (Windows Command Console Window maxes out at 80 character width)
Other viewing environments that benefit from shorter lines:
github: there is no word wrap when viewing a file online
So, I try to keep .js .md and other files at 70-80 characters.
Windows Console: 80 characters.
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.