This is not a program code question.
I would like to know if you know an open source pdf editor based on mupdf.
In fact, I just need the following features:
Highlight a select rectangle. (And the feature delete a highlighted rectangle)
Add a line in some text (to indicate these text should be deleted or ignored). (And the feature to
delete this line)
Rotate a page or an entire pdf file.
Add a comment (annotation).
Thank you in advance.
I don't think there is a useful pdf editor based on mupdf
(1)Editing a PDF with MuPDF And (2)MuPDF wikipedia
though there are other pdf editors which are free softwares.
PDFedit
Pdftk
Hope that this will be of some help :)
Related
I have two versions of one same scanned PDF. One of them has an OCR layer. How can I transfer the layer to the other one? I already install Ghostscript, but I don't know what to do next.
How to Use Ghostscript
There's no such thing as an 'OCR layer' in PDF.
Most likely what you have is a PDF file which has a scanned image and the text extracted from that image using OCR which has been drawn as 'invisible' text (text rendering mode 3).
In general you can't copy and paste text between PDF files, so it's very hard to do what you are asking. I don't know of any tools which will help you here, I can say for certain that Ghostscript absolutely will not help you at all.
Most likely you will also need to copy the Font (or CIDFont) from the PDF file as well, and if it has a ToUnicode CMap you'll definitely also want that or search won't work (and there's little point in this sort of OCR otherwise).
Since you have a PDF file which includes the OCR'ed text, why not simply use that PDF ? I can't see any reason why you would want to 'transfer' it to another PDF file.
I am looking for the method (of Word ole-object) which can open pdf in the Microsoft Word.
I want to copy all pages of pdf into doc/docx and add there footers.
Could anybody give the cue how to import pdf?
PS: any sample code for this problem would be great.
Thanks,
Lilya
You need OCR (Optical Character Recognition) engine for converting PDF to document. PDF is generic format and it can include text as image. So it is very hard to convert PDF to document. SAP hasn't got any OCR function for doing this. Maybe OpenText (if customer using it) has this functionality, I haven't got detail information about opentext. You need third party tools for this. You can use online services or command line utilities to converting PDF files to text files easelly if PDF included text, otherwise you need professional SDKs (for example Abbyy Finereader) for doing this.
I used FoxIT PDF Reader to save the PDF file into text file and make a macro to read the text file. Of course, by doing so, you can only get the text, but nothing else.
I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.
Example PDF page: https://db.tt/qRcF000k
This is sample page from a document, where copied text shows as question marks in my favorite reader SumatraPDF (mupdf) just the same as in Adobe Acrobat. But my main problem is that I can not search this document because of this, nor I can index it.
OTOH, xpdf's pdftotext extracts correct text.
In Adobe Acrobat if I use "Copy as formatted text", correct text is written to clipboard, although I still can't search from Acrobat.
Also if I open the linked page in Firefox's built-in PDF reader I can correctly copy the text.
Can GhostScript perhaps be instructed to correct this issue, which I can not describe differently then as 'unreadable characters'?
The PDF file uses subset fonts with non-standard Encodings and no ToUnicode CMaps. So no, you can't have Ghostscript 'correct' this file.
In fact I can't see how anything can possibly be extracting sensible text from this, and indeed my version of Acrobat (Pro X and Reader XI) can't copy meaningful text and don't appear to have a 'copy as formatted text' menu item, can you tell me where to find this ?
However, I notice that the PDF file has actually been created by Ghostscript (version 9.14) so possibly you mean 'starting with a different input file, which I haven't given you, could I have generated a PDF file where the text could be copied', to which I can only say 'I don't know', it depends what was in the original input file .
We all know that we can highlight certain texts in a pdf file either using Adobe Acrobat or Preview on Mac. I'm wondering how I can extract all these highlights in a pdf file, and generate a summary (a note kind of thing).
The following post
PDF: standard format for highlights?
points out that there are multiple ways to do highlighting. Will it be a challenge to distinguish the original content of the file and the user-added highlights if shapes with transparency is used to achieve highlights?
Details about this can be found in open source pdf parsing-rendering libraries, and you just have to read the code or document if available.