How is hidden text stored in OCR-enhanced PDF files

How is hidden text stored in OCR-enhanced PDF files - pdf

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/

Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Related

PDF cannot display Chinese fonts in table of contents

I made a PDF file from Latex (using TexMaker).
Acrobat Reader is able to display BOTH the text and the table of contents in Linux.
But Acrobat Reader is unable to display the table of contents in Windows XP (the Chinese characters came out as boxes). However, the text is displayed correctly.
I tried to embed the fonts into the PDF but the various methods are not 100% successful, so I'm not sure if the fonts are embedded correctly or not. Anyway, the table of contents remain unreadable in Windows.
I wonder if it is really an font embedding problem? Or do I need to install these "Adobe Reader X Font Packs":
https://www.adobe.com/support/downloads/detail.jsp?ftpID=4883
My concern is that I'd like my PDF to be readable in Windows, including the table of contents (and preferably without further installations). If this is possible...

I suspect you are talking about "bookmarks" and not saying part of the text in the document is ok and part is not. PDF Bookmarks are part of the UI of the application and are not selected from embedded fonts. Therefore, the system you are running on needs to know how to handle fonts in the language(s) of choice.
See https://forums.adobe.com/thread/1144972?start=0&tstart=0
Embedding the fonts will have no effect on the bookmarks.

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.

If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.

In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.

'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.

There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.

There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).

Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.

Processing Illustrator or pdf files into XAML

What are the alternatives to process illustrator files or PDFs into XAML. My Current workflow works like this:
Open the PDF file in Adobe illustrator
Save the file as .ai (Adobe Illustrator) file
Open in Expression Design
Do some processing, mainly separating elements to layers and removing unneeded parts.
Save as XAML
Add XAML to Blend project
My only problem is that this way the text gets converted to paths. I would like to keep my text in XAML as well instead of paths.
Is there any other way to do this, so I keep the text? Any other tools?

I think what you want is to have Glyphs elements instead of Paths.
The problem is that Glyphs elements require you to specify the URI of the font file. Also, Glyphs elements reference glyphs by their index into a font file (it may happen that a converter that generates Glyphs elements - like the Microsoft XPS Document Writer - uses indices into font subset files: so these indices may not be the right indices to the same glyphs as defined in the original font file). I have been able to "solve" this problem in two ways with my own PDF to XAML conversion tools.
1. approach: Embed the font-subset file, BASE64 coded, in the generated XAML code and have the application implement a class that, upon loading, extracts and decodes an embedded font-subset file to a temporary location and hands a valid URI to that temporary file back to the XAML loader.
or, 2. approach: Have most font files already installed along with my application and, again, adding some support by my application that replaces the font name by an URI to the installed font file upon loading of the XAML code. The problem with this second approach is that glyph indices need to be correctly mapped to the installed font file, which may not be all that trivial to do. (You can find a link to an example file that has been generated for this way of loading on my blog: in particular take a peek at the file truncatedcone-xaml.txt)
In short: both solutions require a special PDF to XAML converter and support by the loading application. The reason I wanted to do it this way instead of just having my PDFs converted to Paths only is that my application is a shared whiteboard: thus I want my vector graphics to be as small as possible. (Conversion to paths tends to blow up the XAML code by a factor of 10 or more in most cases).
I am contemplating the implementation of a third approach: this would consist in generating the outline for every glyph that is used only once and then add support by my application to transform and position these glyph outlines in a way closely analogous to what Glyphs elements do that would otherwise have to be generated. The advantage would be that the generated XAML would still be relatively small (comparable to the second approach described above) without requiring the relevant font files to be installed along with the application and without having to map glyph indices from a subset file to the installed font file. The reason I have not yet tried to implement this in earnest is twofold: first, my current (second) approach already works very well for what I currently need; second, there might be performance problems with this third approach as reagards loading and / or rendering.

There's a (free) Adobe Illustrator plugin to export to XAML. Not sure it does exactly what you are looking for, though.
Find it at http://www.mikeswanson.com/XAMLExport/

Well an XPS file is actually a ZIP file. So if you open it with a ZIP-archiver or if you rename its extension to ZIP you can see what is inside. It already contains the pages as XAML code (those files have the form [pagenumber].fpage). However, that XAML code may refer to other files (like raster images and font subset files, those are typically odttf files - basically encrypted true type files) that are included in that ZIP archive as well. Which means, that the XAML code that you find in an XPS document may not be directly usable as pure XAML in your application. I have written python scripts to do the conversion of XAML taken from XPS documents (generated by the Microsoft XPS Document Writer) to get XAML files that my application can load (see approaches 1 and 2 above). I could send you copies of those python scripts (they are not particularly great code, which is no problem for me since I am now using a different approach to convert PDFs to XAML anyway).

#gyurisc: Keeping the font file should work but keeping the text might turn out to be a problem, because, you see, glyphs are not characters. It might be that you could figure out the character by examining the font file that a given glyph is part of, but that would involve parsing the font file. If you are unlucky, your PDF to XPS converter does even not keep enough information in the font subset files to figure out the character a given glyph (very likely) represents.
For example: If I convert a PDF file to XPS with the help of Microsoft's XPS Document Writer, and then try to select a piece of text from that XPS document, I can (only apparently) copy it to the clipboard. However, if I then paste it back into a Word document, I get garbage. Whereas if I select that same piece of text in the original PDF document and paste it into the same Word document, I get reasonably meaningful text. So Microsoft's XPS Document Writer apparently does not care about the interpretation of a "glyph run" as text, and thus it seems very likely to me that the link between the glyph indices that one finds in the generated XPS code and the characters they are meant to represent is already broken at that point. (But, admittedly, that's just a guess.)
A representation of text (as opposed to a run of glyphs) would be a TextBlock element in XAML, I suppose. However, my guess is that a typical PDF to XPS converter is unlikely to generate TextBlock elements. XPS is mainly meant to be rendered - on screen or on paper - it doesn't suggest itself as a file format that is particularly suitable for data exchange (exchange of text in your case).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas