Haskell: parsing PDF - pdf

What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.
I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)
Haskell is chosen purely for (self)educational purposes.

There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:
http://johnmacfarlane.net/pandoc/
Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).
There's also:
http://hackage.haskell.org/package/HsHaruPDF
http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
http://hackage.haskell.org/package/HPDF -- another pdf generation library
I'm not sure we have a good parsing tool yet.

Also as a learning exercise, I started a PDF parsing library in Haskell, but it's incomplete and has been languishing a bit from lack of attention. I'd be happy to share it with you, and would love feedback, improvements, etc. It's not currently hosted on hackage, but if you're interested in working with an incomplete implementation, let me know and I'll ask some colleagues for advice on getting it up there.

Here's a haskell binding to parts of xpdf:
http://hackage.haskell.org/package/pdf2line

Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.
Here is an example how to change title of an existing PDF file using incremental update feature.

Another package to consider is rakhana which is also on hackage.

Related

How do i translate my idml file using okapi rainbow application?

So as the question suggest I have an idml file along with a xliff file for translation. I'm using okapi rainbow application for translation purpose. But I'm having a hard time understanding how do I translate my idml file?
P.S - If anybody feels this question does not belong to programming section please move it to an appropriate section rather than downvoting. Thank you.
Also, i could not create new tags such as okapi or okapi-rainbow as I do not have the appropriate reputation.
Your question is not related to programming, but it may be relevant to everybody who develops software that is intended for users that speak different languages, so I will try to answer it here.
You mention that you have an IDML and an XLIFF file and you want to translate the IDML file. This approach does not really make sense: a typical translation workflow would be
extract translatables from native format (IDML in your case) into XLIFF
send XLIFF to a translator
get translated XLIFF back
import translated XLIFF into native format
do post-translation processing (adapt fonts, resize text boxes, update tables of content, to name just a few)
Okapi Rainbow is a tool that you can use for steps 1 and 4 in the process above. It will not perform automatic translation for you.
Actually the first thing you should do is ask your translator if he can process IDML. Some translation environments have built-in support for IDML, in which case you don't have to bother with providing XLIFF.
It might be a bit off-topic here.
I'm not sure how to get from IDML to XLIFF, but maybe there is a filter.
But if you got the XLIFF, you can just use OmegaT to translate it.
Or pretty much any other professional Translation Tool like e.g. SDL Trados.

DocBook to PDF with Cooperate Identity (Linux)

We have a Wiki-Page. Where we document our work and projects. From this page you can download the articles in different formats, like Text-File, HTML-File or as DocBook.
Now I need to transform the DocBook into a PDF. This part works, I did it with dblatex.
dblatex doc.xml
But the PDF-Document needs our Cooperate Identity (Headline & Foot-line). I have no idea how to do that.
Any suggestions?
I have done a lot of work with the DocBook XSLT stylesheets that produce Formatting Objects (FO). That's a different way to publish PDF from DocBook source.
There is an excellent set of documentation that explains how to customize PDF output if you're using the FO workflow. Here's the section about that:
http://www.sagehill.net/docbookxsl/PrintHeaders.html
Learning how to customize the dblatex conversion might be a great choice. I have never used it so I'm offering the FO conversion as an alternative.
Good luck!

How does wikipedia generates PDF

I would like to know on how wikipedia (http://en.wikipedia.org/) creates PDF? It seem to be using some application at the back-end. Could anyone please let me know on how this is done?
Thanks
Srikanth
Wikipedia runs Mediawiki.
A Google check tells me that they have two PDF extensions.
This one is the one who's still mantained: PDF_Writer
It doesn't use a PHP HTML→PDF generator, (though there are some)
It actually does something trickier and more clever.
The PDF Writer uses the Python Reportlab libraries to generate PDF based on a
DOM derived from parsing mediawiki-markup using the mwlib parser.
To confirm ZJR's answer, these are the document properties:

HowTo extract embedded OCR data from a PDF?

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data.
So my question is, is it possible to extract this embedded OCR-Data from the pdf Files?
It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.

A technology for reading pdfs online with annotations?

is there an open source solution that displays PDFs for online reading? It has to be searchable much like google books and if possible has the ability to display annotations?
By "online reading" I'll assume you mean without a PDF reader plugin on the client. In that case you'll need to convert to HTML
http://pdftohtml.sourceforge.net/
If you don't mind losing the ability to copy text then converting to PNG may give you a more accurate rendering
http://www.imagemagick.org/
Regardless of the output format you can manage your searching using the original PDF data. One technology for this is mnogosearch
http://www.mnogosearch.org/
Monogosearch uses pdftotext internally, you may find this useful if you want to write your own search routines. pdftotext is part of the Xpdf suite of utilities
http://www.foolabs.com/xpdf/about.html
All of the tools listed above are available on Windows or Linux
You may also be interested in the Vuzit DocuPub Platform: http://vuzit.com/products/docupub_platform
The display technology itself is not open source, but they provide an API to access their service, so perhaps it is worth investigating.
Don't know if you are looking a software to install or some service to pay for...
I've read a lot about www.getbackboard.com (this is not advertising, only reporting something I've read about, that maybe fits your needs.. ;)
Not sure if they do annotations, but both of these will show PDFs quite well:
http://pdfmenot.com
http://docs.google.com
ICEPdf recently released their code as open source. It is Java based.
PyPdf is really nice. It supports reading the text as well as encryption which I know that itextsharp does not.
Of course you'd have to program in python as IronPython's class libraries aren't quite to the point where you can ref them from another language and use them. (But I imagine they will be someday soon)
PyPdf
This is not open source, but check it out anyways. You can download a free trial of their SDK to try it out. Reading PDF's and their annotations is not simple and I wouldn't trust a production app to open source decoders.
Here is an online demo.
http://www.atalasoft.com/ajaxannotations/default.aspx
Another good pdf reader is FoxitReader.