Batch check Adobe Acrobat .pdf's for files containing rotated text - pdf

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks

You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

Get text from a pdf in NSString

I am trying to make an iOS app which would extract plain text from a pdf file and display it in a UITextView. Its simply not a pdf reader to view a pdf file but i would later wish to perform certain operations on that text.
I have already googled a lot but still not able to get an exact solution.
i already tried using https://github.com/zachron/pdfiphone
but the files are using ARMV6 architecture which seems obsolete with xcode 4.5
And if anyone can suggest some exact and non-confusing code using Quartz-2d framework of iOS then it would be great.
Here is An Sample code to Extract text from PDF Hope this Might Help You.
https://github.com/zachron/pdfiphone
This is a library to get the text out of a PDF for the iPhone.
Another Demo is there Which uses OCR technology find the link below
https://github.com/nolanbrown/Tesseract-iPhone-Demo
Also Check this page of the Quartz 2D Programming Guide, it covers everything you need to open and parse a PDF file in iOS. Note that it is not a simple task, since there's no method to extract the full text in one line. You have to work with the data as an input stream, using a CGPDFScanner
Two Other Libraries
https://github.com/KurtCode/PDFKitten/
https://github.com/mobfarm/FastPdfKit
This question comes up all the time. It is VERY hard to extract text from PDF in general. The PDF specification is not designed with text extraction in mind. There are many libraries that try to do the job, essentially by reconstructing the text from the geometric placement of the individual glyphs. These libraries have varying degrees of success, but will all fail on certain PDF documents. In fact, some PDF documents have Glyphs but no way to associate the glyph with a character. For these documents it is simply not possible to extract text, short of using some kind of OCR approach.
PDF is designed as a read-only format that is portable in the sense that a PDF document will be rendered identically on any platform. That is what it is best at, and what it should be used for.
If text is to be edited, do not use PDF.
Here (Extracting text from pdf using objective-c), I found an answer to your question and it works. But not so fine as i need it :(
it can extract only ascii
it return me only one paragraph
Good luck.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.
There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).
Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.

PDF Colo(u)r Analysis (without Acrobat itself ?)

Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)
You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.
Most PDF tools have access to this information but no api to access it. You could take any tool and add it in
Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.
We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1
Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator
If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).