Extract text from pdf using itextsharp returns empty string - pdf

I have a pdf file. The text can be extracted in Edge browser or in adobe reader after installing some fonts. Please let me know how to extract the text with itextsharp (latest version 5.x). I use this commands. Empty text is returning. But the file has 8 pages with text.
var reader = new PdfReader(bytes);
var pages = reader.NumberOfPages;
for (int i = 1; i <= pages; i++)
{
var t = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());
text += t;
}

The PDF
The PDF at first glance appears to be OCR'ed by an OCR program that did not realize that the pages are rotated by 180°.
For example, the OCR program on the second page started in what a PDF viewer displays as bottom left corner:
and here recognized
epnq eoⅢ9時u ez `9P...
押印S ’句OP JuP9A...
eA I臥O9叩Od n^Z小no...
This is not that bad, e.g. epnq eoⅢ... is not really unlike the ...mce bude rotated by 180°.
The OCR software appears to have a certain affinity to CJK glyphs; this impression is reinforced by the fact that the it uses fonts with an Adobe-Japan1-2 ROS and a 90ms-RKSJ-H encoding.
Text extraction
All the information above considered, though, I have some doubt that
The text can be extracted in Edge browser or in adobe reader after installing some fonts.
At least I doubt that anything similar to the actual text can be extracted, no matter how many fonts are installed. On the other hand both Adobe Reader and Edge out-of-the-box here extract the weird text recognized from the rotated letters.
iText
My observation with iText differs, while the OP reports that
Empty text is returning
I get a lot of CJK glyphs (I have added the Asian jar, though, which might make a difference). Unfortunately, though, not those found by inspection of the PDF.
As far as I remember, though, text extraction by Encoding + ROS has never been in focus during iText development up to version 5.5.x (inclusive), in particular the mixed single-byte/double-byte encoding of 90ms-RKSJ-H might not be supported.

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

PDF cannot display Chinese fonts in table of contents

I made a PDF file from Latex (using TexMaker).
Acrobat Reader is able to display BOTH the text and the table of contents in Linux.
But Acrobat Reader is unable to display the table of contents in Windows XP (the Chinese characters came out as boxes). However, the text is displayed correctly.
I tried to embed the fonts into the PDF but the various methods are not 100% successful, so I'm not sure if the fonts are embedded correctly or not. Anyway, the table of contents remain unreadable in Windows.
I wonder if it is really an font embedding problem? Or do I need to install these "Adobe Reader X Font Packs":
https://www.adobe.com/support/downloads/detail.jsp?ftpID=4883
My concern is that I'd like my PDF to be readable in Windows, including the table of contents (and preferably without further installations). If this is possible...
I suspect you are talking about "bookmarks" and not saying part of the text in the document is ok and part is not. PDF Bookmarks are part of the UI of the application and are not selected from embedded fonts. Therefore, the system you are running on needs to know how to handle fonts in the language(s) of choice.
See https://forums.adobe.com/thread/1144972?start=0&tstart=0
Embedding the fonts will have no effect on the bookmarks.

How to confirm a TrueType PDF font is missing glyphs

I have a PDF which renders fine in Acrobat but fails to print during the PDF to PS conversion process on our printer's RIP. After uncompressing with pdftk and editing I've found if I replace the usage of a certain font it will print.
The font is a strange one, a TrueType subset with a single character (space).
If I pass the PDF through Ghostscript it reports no errors, however an Acrobat pre-flight check will report a missing glyph for space. This error is not reported for the original file. I'm just using a basic command: gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -o gs.pdf original_sample.pdf
I've pulled out the font data from the original PDF and saved it. Running TTFDUMP.exe produces an interesting result where it seems that the 'glyf' table is missing:
4. 'glyf' - chksm = 0x00000000, off = 0x00000979, len = 0
5. 'head' - chksm = 0xE463EA67, off = 0x00000979, len = 54
Just wondering, am I interpreting this result correctly? Is it valid to run TTFDUMP like this on extracted data from a PDF? I think a 'glyf' table is required based on the spec, at least for the first 4 necessary characters.
TTFDUMP run on the ghostscript PDF produces a similar result but with a 1-byte 'glyf' table.
If so it seems that Acrobat doesn't particularly care about the missing space while other programs (including the printer) do. It's odd it isn't reported as missing though until it runs through Ghostscript.
The PDF is created by Adobe InDesign and the font is copyrighted like most so I can't share it.
Edit - I've accepted Ken's answer as he helped me on the Ghostscript bug tracker. In summary, it seems the font is broken as suspected due to the missing glyf table. Until I hear otherwise I'll have to suppose this is a bug in InDesign, and will continue investigating.
Yes you can run ttfdump on an embedded subset font, its still a perfectly valid font.
A missing glyph is not specifically a problem, because the .notdef glyph is used instead, a missing .notdef means a font isn't legal.
I think you are mistaken about the legality of sharing the PDF file (from the point of view of font embedding). Practically every PDF file you see will contain copyright fonts, but these are permitted to be embedded and distributed as part of a PDF (or indeed PostScript) file. TrueType fonts contain flags which control the DRM of the font, and which can deny embedding in in PDF (or other formats). Ghostscript honours these embedding flags in the font as does Acrobat Distiller and other Adobe products.
There were some fonts which inadvertently shipped with DRM which prevented embedding, and there's a list somewhere of these, along with an explicit statement from the font foundry that its permissible to embed these fonts. I think this was somewhere on the Adobe web site a few years back.
So if you have a PDF file with the font embedded in it (especially if it was produced by an Adobe application) then I would be comfortable that its legal to share.
I'm having some trouble figuring out what the problem actually is, and how you are using Ghostscript. If you are running the PDF->PS and then back to PDF then all bets are off frankly. Round-tripping files will often provoke problems.
In any event I'm happy to look at the file but you will have to make it available.

Fix PDF with unreadable characters

Example PDF page: https://db.tt/qRcF000k
This is sample page from a document, where copied text shows as question marks in my favorite reader SumatraPDF (mupdf) just the same as in Adobe Acrobat. But my main problem is that I can not search this document because of this, nor I can index it.
OTOH, xpdf's pdftotext extracts correct text.
In Adobe Acrobat if I use "Copy as formatted text", correct text is written to clipboard, although I still can't search from Acrobat.
Also if I open the linked page in Firefox's built-in PDF reader I can correctly copy the text.
Can GhostScript perhaps be instructed to correct this issue, which I can not describe differently then as 'unreadable characters'?
The PDF file uses subset fonts with non-standard Encodings and no ToUnicode CMaps. So no, you can't have Ghostscript 'correct' this file.
In fact I can't see how anything can possibly be extracting sensible text from this, and indeed my version of Acrobat (Pro X and Reader XI) can't copy meaningful text and don't appear to have a 'copy as formatted text' menu item, can you tell me where to find this ?
However, I notice that the PDF file has actually been created by Ghostscript (version 9.14) so possibly you mean 'starting with a different input file, which I haven't given you, could I have generated a PDF file where the text could be copied', to which I can only say 'I don't know', it depends what was in the original input file .

How to find the Word Co-ordinate using CGPDFScanner in the pdf page in iphone ?

I am doing parsing of the pdf page using CGPDFScanner.
But I am not able to find the co-oridnate of the serach result.
In the void Tm1(CGPDFScannerRef scanner, void *info),I am only getting co-oridnates for some word but not for every word of the pdf.
How can I find the co-oridnates e.g(x,y) of every word of pdf page ?
You're drastically under-estimating the complexity to convert PDF to text. I made that mistake as well, and it took months to write a text extraction engine that works with most PDFs. My code is commercial, but just to give you an idea:
Td, TD, Tm, T*, d0, d1 all can contain text. (d0, d1 are for Type3 fonts, which are less common, but Microsoft Word really likes them) So can do any objects in XObjects (also recursively). But you also need to parse the Fonts, since many PDFs have CMaps attached to fonts that translate "random numbers" to the character (or characters - PDF can have ligatures as well). Beware, XObjects might also contain fonts, and it's critical to parse them in the right order, since fonts can have parent fonts.
Adobe's ToUnicode PDF gives you some idea how to start, but just a warning, the spec is very incomplete. There's a bit more in the official PDF reference, but you still will find documents that should not work (when looking at the spec) but still DO work (when you try them in Adobe Acrobat).