I'm trying to parsing journal articles PDF.
normally parsing text in PDF(just text) works fine.
but parsing Korean article PDF have some problems.
other problems are not a big deal. but incorrect text order is a big deal.
here is sample.
original text
열 노출에 의한 IN738LC의 기계적 특성 및 미세조직 변화
PDF parsing library :
열 노출에 의한의 기계적 특성 및 미세조직 변화IN738LC
sample image
english text between korean goes backwards.
for test, I opened this pdf by pdf.JS in web. and I got a same result.
so I was about to give up.
but in Chrome pdf viewer and Mac preview showed the result in the correct order of the sentence.
I have no idea what is problem. I'm vaguely thinking it's an encoding problem.
So I want to know why this problem occurs.
Related
I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.
I'm having difficulties filling in a form using pdftk with text fields with true type fonts.
Font files (.ttf) are added to /Library/Fonts (OSX Mavericks)
The form is created with Adobe Acrobat Pro
The form includes normal (non form) text using these fonts
The form text fields also use these fonts
The form can successfully be filled and printed using Adobe Acrobat Pro and even Preview
However, pdftk throws an error when trying to fill it using the command:
pdftk ./my_form.pdf fill_form my_data.fdf output ./the_output.pdf
The output is:
Unhandled Java Exception in create_output():
java.lang.ArrayIndexOutOfBoundsException: 0
at pdftk.com.lowagie.text.pdf.DocumentFont.fillEncoding(pdftk)
at pdftk.com.lowagie.text.pdf.DocumentFont.doType1TT(pdftk)
at pdftk.com.lowagie.text.pdf.DocumentFont.<init>(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.getAppearance(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.setField(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.setFields(pdftk)
If I change the font of the text inputs to Helvetica, Times Roman or Courier, pdftk will successfully create a PDF. Oddly though, Arial and Georgia also throw the same error.
I have tried to no avail to embed the fonts in the PDF using Ghostscript as suggested in this question How to repair a PDF file and embed missing fonts. gs may have embedded the fonts, but it removes the form fields so the resulting PDF can't feed back into pdftk.
A working resolution would be greatly appreciated.
I was getting the same java.lang.ArrayIndexOutOfBoundsException: 0 error using pdftk to fill forms on an Adobe Acrobat generated PDF. This question is super old, but I couldn't find a consistent answer on stackoverflow or elsewhere so I figured I'd post my fix.
What ended up working for me:
Opening the PDF in the OS X app Preview
Clicking into a form field, adding text then deleting that text (so nothing is actually changed)
Saving it
Running the PDF through pdftk again
I'm not that familiar with encoding or PDFs in general, but saving the PDF with Preview seems to fix the encoding or at least get it to a place where pdftk can work with it. Good luck.
This was causing a huge headache for me for 2 days. It turns out I was focusing on the wrong end of the problem.
A nice alternative that isn't as manual and only has to be done once is to enter some text in a field of the source PDF form, in your case ./my_form.pdf. I don't know EXACTLY why this works, but it does. that way if you want to create a new file at any time, you dont have to go through this trouble :)
I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode.
So Apache Tikka Doesn't do much help here.
I tried using tesseract to convert the document from PDF to text then pass the extracted text to google service it works fine. But there are three problems:
Tesseract is only able to convert high quality images.
It is able to do languages similar to English like Spanish , french but fails for Japanese, Chinese etc.
Document text are confidential and all manipulations should be done within.
Now I am in search of a standalone Language detection component which works across scanned PDF documents.
I am trying to make an iOS app which would extract plain text from a pdf file and display it in a UITextView. Its simply not a pdf reader to view a pdf file but i would later wish to perform certain operations on that text.
I have already googled a lot but still not able to get an exact solution.
i already tried using https://github.com/zachron/pdfiphone
but the files are using ARMV6 architecture which seems obsolete with xcode 4.5
And if anyone can suggest some exact and non-confusing code using Quartz-2d framework of iOS then it would be great.
Here is An Sample code to Extract text from PDF Hope this Might Help You.
https://github.com/zachron/pdfiphone
This is a library to get the text out of a PDF for the iPhone.
Another Demo is there Which uses OCR technology find the link below
https://github.com/nolanbrown/Tesseract-iPhone-Demo
Also Check this page of the Quartz 2D Programming Guide, it covers everything you need to open and parse a PDF file in iOS. Note that it is not a simple task, since there's no method to extract the full text in one line. You have to work with the data as an input stream, using a CGPDFScanner
Two Other Libraries
https://github.com/KurtCode/PDFKitten/
https://github.com/mobfarm/FastPdfKit
This question comes up all the time. It is VERY hard to extract text from PDF in general. The PDF specification is not designed with text extraction in mind. There are many libraries that try to do the job, essentially by reconstructing the text from the geometric placement of the individual glyphs. These libraries have varying degrees of success, but will all fail on certain PDF documents. In fact, some PDF documents have Glyphs but no way to associate the glyph with a character. For these documents it is simply not possible to extract text, short of using some kind of OCR approach.
PDF is designed as a read-only format that is portable in the sense that a PDF document will be rendered identically on any platform. That is what it is best at, and what it should be used for.
If text is to be edited, do not use PDF.
Here (Extracting text from pdf using objective-c), I found an answer to your question and it works. But not so fine as i need it :(
it can extract only ascii
it return me only one paragraph
Good luck.
I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.