Does PDF contain high-level layout system for text? - pdf

I'm just curious to know how text is handled in PDF under the hood. Does it contain high-level layout system like HTML that does things like breaking paragraphs into lines, or does it only support low-level operations like putting each characters at an absoulte position?

In PDF, text is represented by Glyphs. Individual glyphs can be positioned exactly on a page, or a sequence of glyphs can be laid out, following some rules for spacing between them. There is no concept of words, lines, paragraphs, blocks or anything similar. The PDF specification does allow some descriptive information (like the number of columns on a page), but generally speaking such information is not reliable.

Related

How Adobe Acrobat does break words in PDF documents when copying text?

PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?
The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:
unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7
Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.
The found spacing width is subjected to scaling according to font size and other text state context.

What governs the text selection order of PDFs, how can it be improved when generating PDFs?

A number of PDFs, particularly those exported by presentation software, desktop publishing or latex typesetting seem to have an illogical text selection marquee order.
For example selecting parts of a math equation in one of my documents seems to randomly select another large block of equations elsewhere on the page, even though they are separated by body text. Is this a problem in the PDF viewer(mac preview) or in the PDF file itself. What procedures should be followed when programmatically generating PDFs to insure a logical ordering for textual selection.
Text selection in PDF viewers is determined by an algorithm in the viewer. Different viewers will have different algorithms and yield different results. Some viewers will leverage the structure tags if they are present, others will ignore the tags even when present.
Unfortunately, there is nothing you can do as the PDF author to influence how any particular viewer software interprets the text rendering instructions into words then into blocks of text into page regions and finally into a text selection.

Thai character not rendered correctly in PDF

My app should be able to output a PDF file containing the user guide in several supported languages. (I'm using pdfkit)
I had some troubles finding a suitable font for Thai: some so-called Thai supported languages (included Noto Thai from Google) would output squares, question marks or even worse unreadable stuff.
After a bit of research, I found one that seemed to work reasonably well, until our Thai guy noted that the charachters
ต่ำ
were rendered like in the picture below, basically with the two elements above the first character collapsed with one covering the other
I'm using Nimbus Sans Thai Family downloaded from myfonts.com that, by the way, would seem able to render those characters correctly, as you might appreciate trying to copypaste ต่ำ in the preview input
Any hints?
Your font is incomplete in a certain way. It lacks some glyphs that usually reside in Private Use Area (PUA) of Unicode.
Some applications (I'm aware of Microsoft Word) can manually overcome this problem, but your rendering app (and Adobe Acrobat Viewer) does not.
You should either find a font with these glyphs presenting or alternatively find an application that would displace the existing glyphs manually.
Many fonts, despite they claim supporting Thai (and they, indeed, contain "regular" Thai glyphs), can be incomplete.
Besides canonic glyphs, a well-formed font should contain a "Private Use
Area" (PUA) subrange that contains glyphs in non-canonical forms. Those
glyphs include:
Tone marks shifted to the upper position for use in combination with upper
vowels (SARA_I, SARA_UE, etc) and shifted in a lower position in case of Consonant + Tone Mark and no upper vowel;
Tone marks and upper-vowels slightly shifted to the left for use in combination with PO_PLA, FO_FAN, etc (otherwise it would overlap with the consonants' upper tail);
also, both effects combined, e.g. the tone mark shifted down-left at the same time:
Special glyphs for YO_YING and THO_THAN (with no tail) for use in combination with under-vowels;
Several more;
Normally, when a rendered app finds above mentioned symbol combinations, it looks for substitute glyphs in PUA area. If not found, it simply falls back to default glyph, which happens in your case.
Here are two screenshots of PUA areas of Arial Unicode and FreeSerif
which are self-explanatory: FreeSerif has PUA empty. I think, the same problem occurs with your Nimbus font.
And the final observation. Incorrect fonts can be incorrect in different ways. Above I have described a more canonical case when the standard positions of tone marks a upper positions, while non-standard positions are shifted down (or are absent, which constitutes an incomplete font).
There are, however, fonts that behave the opposite way; they (only) contain tone marks in lower positions. This is what you seem to observe.
The problem is that PDFKit does not perform complex script rendering.
Several scripts such as arabic, thai etc, require glyph substitution and re-positioning depending on context (position in string, neighbor characters) and PDFKit seems not to do it.
PDF viewer applications display exactly what is defined in the PDF file. The Nimbus Sans Thai font probably includes all the required glyphs but what bytebuster explains in his answer needs to be performed by PDFKit and not by the viewer application.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

PDF Text Direction

How is text direction for right-to-left languages, like Arabic, encoded in PDF? My understanding is that since PDF is fundamentally a graphical format, the concept of text-direction doesn't need to really be encoded. Rather, the glyphs simply need to be painted on-screen from right to left. However, the PDF reference manual mentions an attribute called WritingMode, where you can specify combinations left-to-right, right-to-left and top-to-bottom, bottom-to-top.
So my questions is:
(1) If my understanding is correct, and RTL or LTR is merely expressed by the way the glyphs are painted on-screen, what is the point of the WritingMode attribute?
(2) If there is no actual directionality information encoded in the PDF file, other than the order the glyphs are painted, how does a PDF-to-Text program know if a given line is supposed to be read right-to-left or left-to-right? (I suppose the PDF program could just check if the Unicode codepoints extracted from a ToUnicode map fall into a range that corresponds to an RTL language.)
WritingMode is only for Tagged PDF, if I'm reading the spec correctly. If a PDF doesn't contain the appropriate logical structure, you don't get WritingMode.
The general answer, as I understand it, is "it depends". In R-L writing, you probably have the text advance info encoded in the font and a single text placement will advance the text to the right place. I say 'probably' because it might be that the actual generation software ignores this and places each glyph on its own, regardless of the text advance in the font. Then you get fun languages like Arabic and Hebrew which aren't strictly R-L, as numbers are still L-R within a R-L line.
Text direction will be set in the Trm