What governs the text selection order of PDFs, how can it be improved when generating PDFs? - pdf

A number of PDFs, particularly those exported by presentation software, desktop publishing or latex typesetting seem to have an illogical text selection marquee order.
For example selecting parts of a math equation in one of my documents seems to randomly select another large block of equations elsewhere on the page, even though they are separated by body text. Is this a problem in the PDF viewer(mac preview) or in the PDF file itself. What procedures should be followed when programmatically generating PDFs to insure a logical ordering for textual selection.

Text selection in PDF viewers is determined by an algorithm in the viewer. Different viewers will have different algorithms and yield different results. Some viewers will leverage the structure tags if they are present, others will ignore the tags even when present.
Unfortunately, there is nothing you can do as the PDF author to influence how any particular viewer software interprets the text rendering instructions into words then into blocks of text into page regions and finally into a text selection.

Related

How Adobe Acrobat does break words in PDF documents when copying text?

PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?
The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:
unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7
Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.
The found spacing width is subjected to scaling according to font size and other text state context.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Is there a way to change the order/way Acrobat selects text of a PDF?

I have a visual basic program that extracts text from a PDF and imports the text into excel. It relies on reading the text like a human, reading left to right across the page. However, there are instances on this particular PDF where if I go to select the text with my mouse, I click and drag straight across but Adobe starts to select/highlight words on the above and below lines before continuing to highlight across the page. This gives me data that I do not want/need. The page has renderable text and is not from a scanned document.
Is there a way to "reset" the way Adobe interprets the text on the PDF? Since the information on the left is far from the information on the right, it treats them almost like separate columns.
I've tried saving the PDF in different formats such as a txt or postscript and distilling to another PDF but they all seem to result in the same outcome. This is weird to me because I have other similar PDFs where this isn't an issue.
Any help or thoughts would be greatly appreciated, thanks.
As PDF (in its basic form) essentially means placing strings on a canvas, the concept of "sentence" or "reading order" is not built in.
In order to extract text, you would have to read out the bounding box of the piece of text, and then use some logic and heuristics to assemble your text based on the coordinates of the bounding box.
Things can be easier if the PDF is a structured PDF, where the text contents is embedded as text in the document. This is also the prime requirement for an accessible document. So, if your document is accessible, you can rely on the structure for the correct reading order.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.
There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.