Is the /Widths array of a PDF font object redundant information? - pdf

A refrence to pdf informs that a pdf dictionary to defined a font resource needs to contain a property /Widhts giving this information:
(Required except for the standard 14 fonts; indirect reference
preferred) An array of ( LastChar − FirstChar + 1) widths, each
element being the glyph width for the character code that equals
FirstChar plus the array index. For character codes outside the range
FirstChar to LastChar , the value of MissingWidth from the
FontDescriptor entry for this font is used. The glyph widths are
measured in units in which 1000 units corresponds to 1 unit in text
space. These widths must be consistent with the actual widths given in
the font program. (See implementation note 61 in Appendix H.)
emphasis added.
What good is it to provide the widths again is they are obviously included in the font program?
Plainly: Can somebody confirm or reject wether the information one is supposed to provide here, the glyph width is blantantly redundant information, considering it is even mentioned to be contained in the font-program?
Or do some font programs inlcude glyphs without specifying their widths?
Is it because there are font programs that do not include the widths, or is this merely an execercise in patience, indented to complicate the generation of PDF files, hoping people then stick to Adobe software?
Are the /Widths entries required to test if a referenced font (being not embedded), is "correct" (i.e. the pdf viewer is supposed to check if the font-program wanted by the pdf, might be the one found on the platform, comparing the /Widths)?

The Widths array is documented as being present so that application programs can determine the metrics of glyphs without being required to decode a font. This might be of use (for example) when drawing a selection box around text, or highlighting text in some manner.
See pages 393 and 394 of the PDF 1.7 specification:
The width information for each glyph is stored both in the font
dictionary and in the font program itself. (The two sets of widths
must be identical; storing this information in the font dictionary,
although redundant, enables a consumer application to determine glyph positioning without having to look inside
the font program.)
I should also mention that there are many PDF producers which regard abusing the Widths array as a convenient way to alter the spacing of a font. Where the Widths of the Font array do not match the metrics of the glyphs in the font program, Acrobat uses the Widths array values (which is the implementation note in Appendix H referred to by the text you quoted). I also seem to recall that the latest version of the specification lifts the exception for the base 14 fonts, all fonts are supposed to have a /Widths array now.
We've got numerouus examples of PDF files where the metrics array do not match the Widths in the font program.
Note that the Preflight checker in Acrobat Pro, when checking for PDF/A compatibility, will throw an error if the Widths and metrics differ.
So while it is technically true that the /Widths array is redundant, because the same information can be retrieved from the font, it is convenient for some applications to have the informaiton in a more readily accessible form and if (as a PDF consumer) you hope to match the rendering from Acrobat, you need to use it.

Related

How Adobe Acrobat does break words in PDF documents when copying text?

PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?
The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:
unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7
Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.
The found spacing width is subjected to scaling according to font size and other text state context.

Thai character not rendered correctly in PDF

My app should be able to output a PDF file containing the user guide in several supported languages. (I'm using pdfkit)
I had some troubles finding a suitable font for Thai: some so-called Thai supported languages (included Noto Thai from Google) would output squares, question marks or even worse unreadable stuff.
After a bit of research, I found one that seemed to work reasonably well, until our Thai guy noted that the charachters
ต่ำ
were rendered like in the picture below, basically with the two elements above the first character collapsed with one covering the other
I'm using Nimbus Sans Thai Family downloaded from myfonts.com that, by the way, would seem able to render those characters correctly, as you might appreciate trying to copypaste ต่ำ in the preview input
Any hints?
Your font is incomplete in a certain way. It lacks some glyphs that usually reside in Private Use Area (PUA) of Unicode.
Some applications (I'm aware of Microsoft Word) can manually overcome this problem, but your rendering app (and Adobe Acrobat Viewer) does not.
You should either find a font with these glyphs presenting or alternatively find an application that would displace the existing glyphs manually.
Many fonts, despite they claim supporting Thai (and they, indeed, contain "regular" Thai glyphs), can be incomplete.
Besides canonic glyphs, a well-formed font should contain a "Private Use
Area" (PUA) subrange that contains glyphs in non-canonical forms. Those
glyphs include:
Tone marks shifted to the upper position for use in combination with upper
vowels (SARA_I, SARA_UE, etc) and shifted in a lower position in case of Consonant + Tone Mark and no upper vowel;
Tone marks and upper-vowels slightly shifted to the left for use in combination with PO_PLA, FO_FAN, etc (otherwise it would overlap with the consonants' upper tail);
also, both effects combined, e.g. the tone mark shifted down-left at the same time:
Special glyphs for YO_YING and THO_THAN (with no tail) for use in combination with under-vowels;
Several more;
Normally, when a rendered app finds above mentioned symbol combinations, it looks for substitute glyphs in PUA area. If not found, it simply falls back to default glyph, which happens in your case.
Here are two screenshots of PUA areas of Arial Unicode and FreeSerif
which are self-explanatory: FreeSerif has PUA empty. I think, the same problem occurs with your Nimbus font.
And the final observation. Incorrect fonts can be incorrect in different ways. Above I have described a more canonical case when the standard positions of tone marks a upper positions, while non-standard positions are shifted down (or are absent, which constitutes an incomplete font).
There are, however, fonts that behave the opposite way; they (only) contain tone marks in lower positions. This is what you seem to observe.
The problem is that PDFKit does not perform complex script rendering.
Several scripts such as arabic, thai etc, require glyph substitution and re-positioning depending on context (position in string, neighbor characters) and PDFKit seems not to do it.
PDF viewer applications display exactly what is defined in the PDF file. The Nimbus Sans Thai font probably includes all the required glyphs but what bytebuster explains in his answer needs to be performed by PDFKit and not by the viewer application.

How to detect Font Family in PDF?

I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ?
Is this possible in any language?
I couldn't find any library or language that could do this.
So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.
I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.
Any Suggestions on how to do this? May be using svg file or ps file
I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes.
Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.
There can be three types of text in your document:
text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.
The answer to your question depends on the type of text you're confronted with:
There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.
You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.
Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?
Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.
Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.
Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.
Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.
NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.
I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.

How to convert italic font to normal font in pdf using some library?

Is there any way to convert Italic font, Bold font in my pdf to normal font using some library like Imagemagick or GhostScript etc. ?
Basically the answer is 'no' though there are several levels of caveat in there.
The most common scenario for a PDF file is that it contains an embedded font, and that font is subset. In this case the font will use a custom Encoding, so that when you see 'Hello' on your monitor, the actual character codes might be 'Axtte' or similar gibberish. If the font also contain a ToUnicode table you could, technically, create an embedded subset of the regular font from the same family as the bold or italic and embed that, and it would work. This would be an immense amount of work.
If the font isn't subset then it may not contain a custom Encoding, which would make that task easier, because you wouldn't have to re-encode the replacement.
If the font isn't embedded, then you need only change the font name in the Font object, because the PDF consumer will have to find a substitute anyway.
Note that, because PDF is a binary format, with an index (xref) containing the offset of every object in the file, any changes will mean that the xref table has to be reconstructed, again a considerable task.
I'm not aware of any tools which would do any of this for you automatically, you'd have to write your own, though some things could be done automatically. MuPDF for example will 'fix' a PDF file which has an incorrect xref table for you.
And even after all that, the likelihood is that the spacing would be different for the italic or bold font compared to the regular font anyway, and would look peculiar if you replaced them with a regular font.
So, fundamentally, no.
In low-level PDF you can apply some rendering flags in front of a text stream. Like the "Rendering Mode" Tr operation. For instance, in this scenario you can include the rendering of text outline and increase outline drawing width with the command sequence 0.4 w 2 Tr which will cause Normal text to become more "bold" (There are other better ways to accomplish this using the Font Description dictionary). However, one can also employ this tactic to slim down bold text using a clipped thicker outline, but this may not be ideal.
As for italic, most fonts contain a metric indicating their italic angle, and you can use this to add a faux italic using a shear CTM transformation matrix with the cm operation. Once again, this may work better to add an italic shear, but may also have some success in removing it.
See the PDF Reference.
This will require a library with lower level PDF building and you would have to do it manually, but it is possible technically.

what sizes are valid for true type fonts?

I am trying to use true type fonts, but finding that when I load different ones, some do not display at various font sizes while others do. It seems, although I am not certain, that some simply do not show below a certain font size. Is there any way to easily tell what sizes are valid for a ttf by examining the file and not simply trying it out in my app?
UPDATE: Considering smilingthax's answer, it would seem like these must be bitmap glyphs since the smaller sizes are not showing up (but using some system font instead I think). Does someone know how I can determine the valid sizes I can use with them using the iOS sdk (iPad)?
TTFs can contain Bitmap glyphs and/or Outline glyphs. Outline glyphs are scalable to any size (e.g. "31.4592 pt") although they are often optimized (hinted) for certain sizes (8,10,12,16,....72). You can get these sizes via your Font API (Win32, Freetype, ...). Bitmap glyphs are only available in certain sizes.
For example Freetype's FTFace object contains a .face_flags member which can have FT_FACE_FLAG_FIXED_SIZES set. Also the sizes are accessible via .num_fixed_sizes and the .available_sizes array.