Page 17 of the PDF 1.7 spec indicates that /lime#20Green should produce Lime Green. Is this an errata? I see nothing in the spec about capitalizing the first character of a name, and the example just below (paired#28#29parenthesis) does not correct capitalization. One of those examples is incorrect, but I'm not confident about which.
Do PDF name objects require capitalization?
No. The table entry in ISO 32000-1:2008 you refer to is erroneous. In ISO 32000-2:2017 you find the Lime Green line fixed:
Related
I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.
When I type "Col·laborador" I get exactly this with other letter types:
but with Merryweather Sans, I get "Coll·laborador":
I use ttf-merriweather-sans 1.006-3 in ArchLinux
The problem lies in that font's OpenType definition.
Its author included a locl OpenType feature for the language code CAT (Catalan) that was supposed to replace the combination /l' /cdot' /l with /ldot. The ' characters after the first two names indicate what characters are to be replaced by the new one (http://www.adobe.com/devnet/opentype/afdko/topic_feature_file_syntax.html#5.e). However, the author made a mistake there:
replace /periodcentered after /l before /l with /ldot
and so it replaces just the dot character with ŀ. That means that you get this
with an OpenType-aware program such as Adobe InDesign (and only with the applied language set to "Catalan"). Note the double l, as in your own findings.
The output from another program confirms this; characters that do not take part of a replacement are in gray, replaced characters are in black.
Both /ls are in gray, only the center dot is black. A correct definition should have shown the left hand l and L in black as well.
The font you link to is version 1.006; an older Merriweather Sans that I found (1.003) does not contain this particular OpenType feature and so does not display this error.
To fix it, adjust the OpenType code and regenerate it from its source, or contact its author. Contact details are on the Google Fonts page for this font. (I have left a message in there, referring to this question.)
I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!
I am working with FontNames from PDF documents and wish to convert them to one of the 14 standard fonts if possible. An example is:
KAIKCD+Helvetica-Oblique
Are there standards for the punctuation and values of the pre/suf/fixes? I have found both -Oblique .I and -Italic as suffixes all - presumably - meaning Italic. Or are the names semantically void?
The prefix is as defined in the PDF specification:
For a font subset, the PostScript name of the font — the value of the
font’s BaseFont entry and the font descriptor’s FontName entry — shall
begin with a tag followed by a plus sign (+). The tag shall consist of
exactly six uppercase letters; the choice of letters is arbitrary, but
different subsets in the same PDF file shall have different tags.
Otherwise there merely are some common naming patterns; Jon Tan has tried to catalogue some here.
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.