My app should be able to output a PDF file containing the user guide in several supported languages. (I'm using pdfkit)
I had some troubles finding a suitable font for Thai: some so-called Thai supported languages (included Noto Thai from Google) would output squares, question marks or even worse unreadable stuff.
After a bit of research, I found one that seemed to work reasonably well, until our Thai guy noted that the charachters
ต่ำ
were rendered like in the picture below, basically with the two elements above the first character collapsed with one covering the other
I'm using Nimbus Sans Thai Family downloaded from myfonts.com that, by the way, would seem able to render those characters correctly, as you might appreciate trying to copypaste ต่ำ in the preview input
Any hints?
Your font is incomplete in a certain way. It lacks some glyphs that usually reside in Private Use Area (PUA) of Unicode.
Some applications (I'm aware of Microsoft Word) can manually overcome this problem, but your rendering app (and Adobe Acrobat Viewer) does not.
You should either find a font with these glyphs presenting or alternatively find an application that would displace the existing glyphs manually.
Many fonts, despite they claim supporting Thai (and they, indeed, contain "regular" Thai glyphs), can be incomplete.
Besides canonic glyphs, a well-formed font should contain a "Private Use
Area" (PUA) subrange that contains glyphs in non-canonical forms. Those
glyphs include:
Tone marks shifted to the upper position for use in combination with upper
vowels (SARA_I, SARA_UE, etc) and shifted in a lower position in case of Consonant + Tone Mark and no upper vowel;
Tone marks and upper-vowels slightly shifted to the left for use in combination with PO_PLA, FO_FAN, etc (otherwise it would overlap with the consonants' upper tail);
also, both effects combined, e.g. the tone mark shifted down-left at the same time:
Special glyphs for YO_YING and THO_THAN (with no tail) for use in combination with under-vowels;
Several more;
Normally, when a rendered app finds above mentioned symbol combinations, it looks for substitute glyphs in PUA area. If not found, it simply falls back to default glyph, which happens in your case.
Here are two screenshots of PUA areas of Arial Unicode and FreeSerif
which are self-explanatory: FreeSerif has PUA empty. I think, the same problem occurs with your Nimbus font.
And the final observation. Incorrect fonts can be incorrect in different ways. Above I have described a more canonical case when the standard positions of tone marks a upper positions, while non-standard positions are shifted down (or are absent, which constitutes an incomplete font).
There are, however, fonts that behave the opposite way; they (only) contain tone marks in lower positions. This is what you seem to observe.
The problem is that PDFKit does not perform complex script rendering.
Several scripts such as arabic, thai etc, require glyph substitution and re-positioning depending on context (position in string, neighbor characters) and PDFKit seems not to do it.
PDF viewer applications display exactly what is defined in the PDF file. The Nimbus Sans Thai font probably includes all the required glyphs but what bytebuster explains in his answer needs to be performed by PDFKit and not by the viewer application.
Related
PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?
The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:
unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7
Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.
The found spacing width is subjected to scaling according to font size and other text state context.
A refrence to pdf informs that a pdf dictionary to defined a font resource needs to contain a property /Widhts giving this information:
(Required except for the standard 14 fonts; indirect reference
preferred) An array of ( LastChar − FirstChar + 1) widths, each
element being the glyph width for the character code that equals
FirstChar plus the array index. For character codes outside the range
FirstChar to LastChar , the value of MissingWidth from the
FontDescriptor entry for this font is used. The glyph widths are
measured in units in which 1000 units corresponds to 1 unit in text
space. These widths must be consistent with the actual widths given in
the font program. (See implementation note 61 in Appendix H.)
emphasis added.
What good is it to provide the widths again is they are obviously included in the font program?
Plainly: Can somebody confirm or reject wether the information one is supposed to provide here, the glyph width is blantantly redundant information, considering it is even mentioned to be contained in the font-program?
Or do some font programs inlcude glyphs without specifying their widths?
Is it because there are font programs that do not include the widths, or is this merely an execercise in patience, indented to complicate the generation of PDF files, hoping people then stick to Adobe software?
Are the /Widths entries required to test if a referenced font (being not embedded), is "correct" (i.e. the pdf viewer is supposed to check if the font-program wanted by the pdf, might be the one found on the platform, comparing the /Widths)?
The Widths array is documented as being present so that application programs can determine the metrics of glyphs without being required to decode a font. This might be of use (for example) when drawing a selection box around text, or highlighting text in some manner.
See pages 393 and 394 of the PDF 1.7 specification:
The width information for each glyph is stored both in the font
dictionary and in the font program itself. (The two sets of widths
must be identical; storing this information in the font dictionary,
although redundant, enables a consumer application to determine glyph positioning without having to look inside
the font program.)
I should also mention that there are many PDF producers which regard abusing the Widths array as a convenient way to alter the spacing of a font. Where the Widths of the Font array do not match the metrics of the glyphs in the font program, Acrobat uses the Widths array values (which is the implementation note in Appendix H referred to by the text you quoted). I also seem to recall that the latest version of the specification lifts the exception for the base 14 fonts, all fonts are supposed to have a /Widths array now.
We've got numerouus examples of PDF files where the metrics array do not match the Widths in the font program.
Note that the Preflight checker in Acrobat Pro, when checking for PDF/A compatibility, will throw an error if the Widths and metrics differ.
So while it is technically true that the /Widths array is redundant, because the same information can be retrieved from the font, it is convenient for some applications to have the informaiton in a more readily accessible form and if (as a PDF consumer) you hope to match the rendering from Acrobat, you need to use it.
I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ?
Is this possible in any language?
I couldn't find any library or language that could do this.
So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.
I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.
Any Suggestions on how to do this? May be using svg file or ps file
I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes.
Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.
There can be three types of text in your document:
text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.
The answer to your question depends on the type of text you're confronted with:
There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.
You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.
Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?
Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.
Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.
Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.
Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.
NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.
I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.
I'm just curious to know how text is handled in PDF under the hood. Does it contain high-level layout system like HTML that does things like breaking paragraphs into lines, or does it only support low-level operations like putting each characters at an absoulte position?
In PDF, text is represented by Glyphs. Individual glyphs can be positioned exactly on a page, or a sequence of glyphs can be laid out, following some rules for spacing between them. There is no concept of words, lines, paragraphs, blocks or anything similar. The PDF specification does allow some descriptive information (like the number of columns on a page), but generally speaking such information is not reliable.
How is text direction for right-to-left languages, like Arabic, encoded in PDF? My understanding is that since PDF is fundamentally a graphical format, the concept of text-direction doesn't need to really be encoded. Rather, the glyphs simply need to be painted on-screen from right to left. However, the PDF reference manual mentions an attribute called WritingMode, where you can specify combinations left-to-right, right-to-left and top-to-bottom, bottom-to-top.
So my questions is:
(1) If my understanding is correct, and RTL or LTR is merely expressed by the way the glyphs are painted on-screen, what is the point of the WritingMode attribute?
(2) If there is no actual directionality information encoded in the PDF file, other than the order the glyphs are painted, how does a PDF-to-Text program know if a given line is supposed to be read right-to-left or left-to-right? (I suppose the PDF program could just check if the Unicode codepoints extracted from a ToUnicode map fall into a range that corresponds to an RTL language.)
WritingMode is only for Tagged PDF, if I'm reading the spec correctly. If a PDF doesn't contain the appropriate logical structure, you don't get WritingMode.
The general answer, as I understand it, is "it depends". In R-L writing, you probably have the text advance info encoded in the font and a single text placement will advance the text to the right place. I say 'probably' because it might be that the actual generation software ignores this and places each glyph on its own, regardless of the text advance in the font. Then you get fun languages like Arabic and Hebrew which aren't strictly R-L, as numbers are still L-R within a R-L line.
Text direction will be set in the Trm