Is there a way to change the order/way Acrobat selects text of a PDF? - vba

I have a visual basic program that extracts text from a PDF and imports the text into excel. It relies on reading the text like a human, reading left to right across the page. However, there are instances on this particular PDF where if I go to select the text with my mouse, I click and drag straight across but Adobe starts to select/highlight words on the above and below lines before continuing to highlight across the page. This gives me data that I do not want/need. The page has renderable text and is not from a scanned document.
Is there a way to "reset" the way Adobe interprets the text on the PDF? Since the information on the left is far from the information on the right, it treats them almost like separate columns.
I've tried saving the PDF in different formats such as a txt or postscript and distilling to another PDF but they all seem to result in the same outcome. This is weird to me because I have other similar PDFs where this isn't an issue.
Any help or thoughts would be greatly appreciated, thanks.

As PDF (in its basic form) essentially means placing strings on a canvas, the concept of "sentence" or "reading order" is not built in.
In order to extract text, you would have to read out the bounding box of the piece of text, and then use some logic and heuristics to assemble your text based on the coordinates of the bounding box.
Things can be easier if the PDF is a structured PDF, where the text contents is embedded as text in the document. This is also the prime requirement for an accessible document. So, if your document is accessible, you can rely on the structure for the correct reading order.

Related

Text changed to graphics, still selectable in PDF?

I have this PDF ebook with selectable text - the handwriting - but there is no such font embedded and the letters are all different, so it's not actually a font. How is this possible?
I've worked with CorelDraw and Adobe Acrobat, but I can't understand how this works.
The left side of the picture shows the document properties, the right side shows a page of the PDF file and I selected the last 3 rows. I can copy and paste that to a text file, no problem. How was this achieved?
There are a few possibilities but the most likely is the text is being converted to outlines/paths or vectors. Some software such as Adobe InDesign and other print design apps allow you to 'flatten' a font based text into vector or paths, meaning the original font isn't required to be embedded or installed on the system. The original text data is however still present and able to be copied into a text field or word processor.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Batch check Adobe Acrobat .pdf's for files containing rotated text

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks
You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

How to find the total number of annotated boxes present in a pdf?

I have a PDF file that contains several annotations.If you notice the image there are several boxes in Yellow and Beige. These boxes can be edited in Adobe Reader. Could anyone help me find-out the total number of these boxes present in the pdf file using VBA?
Also, I tried converting the pdf to word using vba, but those boxes weren't present in the word file; so it didn't work out.
Here is the pdf file: https://drive.google.com/file/d/0B7uN4B3mxUlZMjB1T3BuM0o1VGs/view?usp=sharing
The text in those boxes is always blue while other text is black. Maybe that could be used.
Another way would be to use pdfseparate from http://www.foolabs.com/xpdf/download.html, and count how often the string <</AP <</N occurs in the generated file.
Or you could convert the pdf to an image and then count the number of colored rectangles.
You could also use one of the commercial tools available for creating/editing pdfs e.g. http://www.pdflib.com/, I believe that one supports VBA.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.
There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).
Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.