In this question, mkl provides a fantastic answer to pnj's predicament. We are unfortunately facing a very similar issue (with a different font called Lohit - Devanagari, but still a Devanagari font) The second comment outlines the non-OCR solution steps beautifully, but I suffer from a huge lacuna in my understanding of PDFs and their structure. As such, it would be great if some direction can be given in terms of the following:
overwrite the ToUnicode map in this PDF using a general purpose PDF library with a low-level object access API for a programming language of your choice: What library in Python can I use to do this?
traversing the PDF object structure, finding the ToUnicode map stream, replacing its content, and saving the result.: Is there some example where I can see how exactly this is done for any font out there?
I hope this isn't too broad. Thank you!
Related
I am having trouble in trying to find the solution for the below described problem.
Annotate the PDF file when user clicks on specific location in pdf and then finaly save the pdf which in future opens at annotated location.
How to approach this?
What I have tried.
I have tried to find various libraries irrespective of programming language (since programing language is not the dependency)- found few libraries like minipdf in python, pdfbox in java to mention few relevant ones. Finally selected pdfbox since it seemed to be mature enough to provide the solution closeby.
There are various hurdles now how to get user the location clicked by the user? since after getting the location I can able to perform various actions like annotating at the clicked location and then saving the pdf on the same specific location.
It seems I have to write whole pdf javascript to approach it but again how to do so?
I had similar problem and have solved it the other way. In my case I am not opening PDF in Adobe reader, but in browser. So what I did is converted the pdf to html using python libraries (Let me know if you are interested, I will share different library names with their pros and cons).
Now that html can be edited easily. We can put hyperlinks, highlights everything there as source code is with us.
This workaround may be applicable to you if your front end is web based.
PS: Wanted to post this workaround as comment, but couldn't due to little less reputation count as of now. Hope, it won't be downmarked :)
I am trying to create a PDF viewer using the iTextSharp library, but there doesn't seem to be any documentation anywhere about how I can accomplish this. I don't need to create a PDF file, just display one and give users the option to save the file or export it to a CSV file.
Can somebody please point me in the right direction?
iText is not a PDF viewer (nor iTextSharp) for that matter, but it could be used to examine a PDF document. See for instance iText RUPS. iText RUPS is a tool that allows you to look under the hood of a PDF, more specifically at the PDF objects stored in a PDF as well as at the content streams.
This would be the first step towards writing a PDF viewer. However, iTextSharp doesn't interpret the content stream of a page, nor the resources that belong to that page (such as image streams, glyph descriptions, etc). If that's what you want to build, you need to consult ISO-32000-1. Note that it will probably take several man years to create a decent viewer.
As for the requirement to export a PDF document to a CSV, this may be possible if your original PDF is a Tagged PDF, but it will be impossible for the majority of PDF documents, including documents that consist of scanned images and documents with no machine-recognizable structure.
Please understand that this is a general answer. A more specific answer can not be given since your question is too broad for StackOverflow. All the answers you need can be found by using iText RUPS and reading ISO-32000-1 (there's a copy of ISO-32000-1 available on Adobe's web site).
I am using tools such as PDFBox to interpret PDF files (including text, strokes, glyphs and images) and can access the streams and dictionaries. I am not clear on how these components link together and how to interpret them. In particular I would like to know how to access fonts from the streams.
NOTE: I am not interested in tutorials on how to create PDF documents
You probably should start from reading PDF Reference. It's a huge file but you might read only relevant parts.
To understand font streams you are basically need to read about TrueType and Type1 font formats (it's not an easy reading either). PDF may contain other font types but TrueType and Type1 are probably most widely used.
Fiddling with fonts might be complicated so you will probably find it easier to use some font library as FreeType for extracting information from PDF font streams.
There are lots of good article on planetpdf.com and many PDF developers run blogs with useful generic articles. We have run a whole load on our blog (http://www.jpedal.org/PDFblog/)
I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data.
So my question is, is it possible to extract this embedded OCR-Data from the pdf Files?
It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.
What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.
I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)
Haskell is chosen purely for (self)educational purposes.
There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:
http://johnmacfarlane.net/pandoc/
Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).
There's also:
http://hackage.haskell.org/package/HsHaruPDF
http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
http://hackage.haskell.org/package/HPDF -- another pdf generation library
I'm not sure we have a good parsing tool yet.
Also as a learning exercise, I started a PDF parsing library in Haskell, but it's incomplete and has been languishing a bit from lack of attention. I'd be happy to share it with you, and would love feedback, improvements, etc. It's not currently hosted on hackage, but if you're interested in working with an incomplete implementation, let me know and I'll ask some colleagues for advice on getting it up there.
Here's a haskell binding to parts of xpdf:
http://hackage.haskell.org/package/pdf2line
Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.
Here is an example how to change title of an existing PDF file using incremental update feature.
Another package to consider is rakhana which is also on hackage.