I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode.
So Apache Tikka Doesn't do much help here.
I tried using tesseract to convert the document from PDF to text then pass the extracted text to google service it works fine. But there are three problems:
Tesseract is only able to convert high quality images.
It is able to do languages similar to English like Spanish , french but fails for Japanese, Chinese etc.
Document text are confidential and all manipulations should be done within.
Now I am in search of a standalone Language detection component which works across scanned PDF documents.
Related
I'm trying to extract text from PDF files using the Google Cloud Vision API. It works most of the times, but I get gibberish in a few cases. I tried both DOCUMENT_TEXT_DETECTION and TEXT_DETECTION, I tried forcing the language in the languageHints but it didn't help.
Then I tried with a screenshot saved as tiff and this did work, so I'm guessing that Google tries to use the text in the PDF if it's not just a picture. Indeed, when I select all "text" in the PDF, I get gibberish.
When I print the tiff back into PDF, text extraction works. So it's really something weird with the PDF. But other extraction software (such as abbyy) work well with the original PDF.
Has anyone had the same kind of issues?
One thing that could help would be an option to force treat the PDF as an "image PDF". Is there such an option?
Thanks for your help!
FYI, I am unfortunately not allowed to show the PDF, and I use the dotnet library.
Edit:
The info on the PDF is:
Creator: "PScript5.dll Version 5.2.2"
Producer: Acrobat Distiller 10.1.16 (Windows)
I made a PDF file from Latex (using TexMaker).
Acrobat Reader is able to display BOTH the text and the table of contents in Linux.
But Acrobat Reader is unable to display the table of contents in Windows XP (the Chinese characters came out as boxes). However, the text is displayed correctly.
I tried to embed the fonts into the PDF but the various methods are not 100% successful, so I'm not sure if the fonts are embedded correctly or not. Anyway, the table of contents remain unreadable in Windows.
I wonder if it is really an font embedding problem? Or do I need to install these "Adobe Reader X Font Packs":
https://www.adobe.com/support/downloads/detail.jsp?ftpID=4883
My concern is that I'd like my PDF to be readable in Windows, including the table of contents (and preferably without further installations). If this is possible...
I suspect you are talking about "bookmarks" and not saying part of the text in the document is ok and part is not. PDF Bookmarks are part of the UI of the application and are not selected from embedded fonts. Therefore, the system you are running on needs to know how to handle fonts in the language(s) of choice.
See https://forums.adobe.com/thread/1144972?start=0&tstart=0
Embedding the fonts will have no effect on the bookmarks.
I am generating PDF files which contain English and Chinese characters (using the Ruby Prawn library). I don't want to embed a Chinese font file in the generated PDF files, because these files need to stay small. So I'm wondering if I could just mentioning a Chinese font name in my PDF files, and have the PDF readers correctly rendering the Chinese characters because the PDF readers would already have the Chinese font file.
Is that something sensible? If so is there any commonly used Chinese font that one can expect to be installed in most of the PDF readers used by Chinese people?
The best way to ensure that a PDF file can be displayed on a any reader, is to use partially embedded fonts (also known as font subset). In PDF, you don't need to include the whole font with your document, having a subset with just the glyphs that were used in the file is enough for the file to be portable.
I want convert an electronic book for Kindle. I tried to convert large, two languages text-based PDF ebook with complex formatting styles and images into AZW3 book for Kindle using Calibre, and also tried amazon service, but conversion results poor quality with multiple errors. I can no convert PDF ebook into plain text which best suits for kindle, because ebook contains multiple formatting styles, programming code and images.
What is the best way automatically crop the white margins of text-based multi-page PDF document and convert it to image-based PDF document, resized to fit 600 x 800 pixels?
There are several tools exists for this kind of job:
http://www.willus.com/k2pdfopt/
http://code.google.com/p/papercrop/
http://briss.sourceforge.net/
https://sites.google.com/site/pdfscissors/
I'd check them out before trying to reinvent the wheel. Or you had not meant programming a new solution but then your question is off topic on this site, see the FAQ.
I am trying to make an iOS app which would extract plain text from a pdf file and display it in a UITextView. Its simply not a pdf reader to view a pdf file but i would later wish to perform certain operations on that text.
I have already googled a lot but still not able to get an exact solution.
i already tried using https://github.com/zachron/pdfiphone
but the files are using ARMV6 architecture which seems obsolete with xcode 4.5
And if anyone can suggest some exact and non-confusing code using Quartz-2d framework of iOS then it would be great.
Here is An Sample code to Extract text from PDF Hope this Might Help You.
https://github.com/zachron/pdfiphone
This is a library to get the text out of a PDF for the iPhone.
Another Demo is there Which uses OCR technology find the link below
https://github.com/nolanbrown/Tesseract-iPhone-Demo
Also Check this page of the Quartz 2D Programming Guide, it covers everything you need to open and parse a PDF file in iOS. Note that it is not a simple task, since there's no method to extract the full text in one line. You have to work with the data as an input stream, using a CGPDFScanner
Two Other Libraries
https://github.com/KurtCode/PDFKitten/
https://github.com/mobfarm/FastPdfKit
This question comes up all the time. It is VERY hard to extract text from PDF in general. The PDF specification is not designed with text extraction in mind. There are many libraries that try to do the job, essentially by reconstructing the text from the geometric placement of the individual glyphs. These libraries have varying degrees of success, but will all fail on certain PDF documents. In fact, some PDF documents have Glyphs but no way to associate the glyph with a character. For these documents it is simply not possible to extract text, short of using some kind of OCR approach.
PDF is designed as a read-only format that is portable in the sense that a PDF document will be rendered identically on any platform. That is what it is best at, and what it should be used for.
If text is to be edited, do not use PDF.
Here (Extracting text from pdf using objective-c), I found an answer to your question and it works. But not so fine as i need it :(
it can extract only ascii
it return me only one paragraph
Good luck.