What indicates that the font is embedded in the PDF file? - pdf

My problem is the following:
I have a PDF file, and I check for numerous things about its text, pictures, etc. But I still can't manage to find an answer how to figure it out, if the actual font, which is used is embedded or not.
What indicates it in a PDF file? How are the extractors, or the PDF readers checking it? (Since they can tell in the properties of the file that it has xy embedded fonts.)
Any help or tip would be much appreciated, since I am getting lost in between these PDF technology pages without any luck.

Every font in the PDF file (except base 14 fonts) must have a FontDescriptor dictionary. This dict contains a lot of information about the font, and amongst other things can contain keys that contain embedded font files ("FontFile", "FontFile2" and "FontFile3"). The presence of these keys indicates whether the font is embedded or not.

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.
There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.
There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).
Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.

Processing Illustrator or pdf files into XAML

What are the alternatives to process illustrator files or PDFs into XAML. My Current workflow works like this:
Open the PDF file in Adobe illustrator
Save the file as .ai (Adobe Illustrator) file
Open in Expression Design
Do some processing, mainly separating elements to layers and removing unneeded parts.
Save as XAML
Add XAML to Blend project
My only problem is that this way the text gets converted to paths. I would like to keep my text in XAML as well instead of paths.
Is there any other way to do this, so I keep the text? Any other tools?
I think what you want is to have Glyphs elements instead of Paths.
The problem is that Glyphs elements require you to specify the URI of the font file. Also, Glyphs elements reference glyphs by their index into a font file (it may happen that a converter that generates Glyphs elements - like the Microsoft XPS Document Writer - uses indices into font subset files: so these indices may not be the right indices to the same glyphs as defined in the original font file). I have been able to "solve" this problem in two ways with my own PDF to XAML conversion tools.
1. approach: Embed the font-subset file, BASE64 coded, in the generated XAML code and have the application implement a class that, upon loading, extracts and decodes an embedded font-subset file to a temporary location and hands a valid URI to that temporary file back to the XAML loader.
or, 2. approach: Have most font files already installed along with my application and, again, adding some support by my application that replaces the font name by an URI to the installed font file upon loading of the XAML code. The problem with this second approach is that glyph indices need to be correctly mapped to the installed font file, which may not be all that trivial to do. (You can find a link to an example file that has been generated for this way of loading on my blog: in particular take a peek at the file truncatedcone-xaml.txt)
In short: both solutions require a special PDF to XAML converter and support by the loading application. The reason I wanted to do it this way instead of just having my PDFs converted to Paths only is that my application is a shared whiteboard: thus I want my vector graphics to be as small as possible. (Conversion to paths tends to blow up the XAML code by a factor of 10 or more in most cases).
I am contemplating the implementation of a third approach: this would consist in generating the outline for every glyph that is used only once and then add support by my application to transform and position these glyph outlines in a way closely analogous to what Glyphs elements do that would otherwise have to be generated. The advantage would be that the generated XAML would still be relatively small (comparable to the second approach described above) without requiring the relevant font files to be installed along with the application and without having to map glyph indices from a subset file to the installed font file. The reason I have not yet tried to implement this in earnest is twofold: first, my current (second) approach already works very well for what I currently need; second, there might be performance problems with this third approach as reagards loading and / or rendering.
There's a (free) Adobe Illustrator plugin to export to XAML. Not sure it does exactly what you are looking for, though.
Find it at http://www.mikeswanson.com/XAMLExport/
Well an XPS file is actually a ZIP file. So if you open it with a ZIP-archiver or if you rename its extension to ZIP you can see what is inside. It already contains the pages as XAML code (those files have the form [pagenumber].fpage). However, that XAML code may refer to other files (like raster images and font subset files, those are typically odttf files - basically encrypted true type files) that are included in that ZIP archive as well. Which means, that the XAML code that you find in an XPS document may not be directly usable as pure XAML in your application. I have written python scripts to do the conversion of XAML taken from XPS documents (generated by the Microsoft XPS Document Writer) to get XAML files that my application can load (see approaches 1 and 2 above). I could send you copies of those python scripts (they are not particularly great code, which is no problem for me since I am now using a different approach to convert PDFs to XAML anyway).
#gyurisc: Keeping the font file should work but keeping the text might turn out to be a problem, because, you see, glyphs are not characters. It might be that you could figure out the character by examining the font file that a given glyph is part of, but that would involve parsing the font file. If you are unlucky, your PDF to XPS converter does even not keep enough information in the font subset files to figure out the character a given glyph (very likely) represents.
For example: If I convert a PDF file to XPS with the help of Microsoft's XPS Document Writer, and then try to select a piece of text from that XPS document, I can (only apparently) copy it to the clipboard. However, if I then paste it back into a Word document, I get garbage. Whereas if I select that same piece of text in the original PDF document and paste it into the same Word document, I get reasonably meaningful text. So Microsoft's XPS Document Writer apparently does not care about the interpretation of a "glyph run" as text, and thus it seems very likely to me that the link between the glyph indices that one finds in the generated XPS code and the characters they are meant to represent is already broken at that point. (But, admittedly, that's just a guess.)
A representation of text (as opposed to a run of glyphs) would be a TextBlock element in XAML, I suppose. However, my guess is that a typical PDF to XPS converter is unlikely to generate TextBlock elements. XPS is mainly meant to be rendered - on screen or on paper - it doesn't suggest itself as a file format that is particularly suitable for data exchange (exchange of text in your case).