Read existing PDF file with all format information - pdf

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic...), and paragraphs... Is there an code library for doing this, is it open source or commercial?
I am on Windows and favor C# libraries, but C/C++ is also acceptable.

I can very much recommend
pdflib (http://www.pdflib.com/).
Its commercial, but it also has a lite version which you can use for free privately. It contains very muach functionality and is available for all plattforms.

I'd echo Mr. Meyers on this. There appear to be a number of them; search for "pdf parser library" (plus your language) in your favorite search engine.
A few top hits:
http://www.lowagie.com/iText/
http://metacpan.org/pod/PDF::Parse
http://podofo.sourceforge.net/
http://www.vicman.net/download/13733/ (several for .NET)
Note that if you're wanting to edit an existing PDF, you might want to read this:
http://1t3xt.info/tutorials/faq.php?branch=faq.pdf_in_general&node=replace_word

The Pdfium.Net SDK also can help you. Via this API you can get access to a collection of text, images and other objects and ther properties.
Please note I work at the company who develop this API.

Related

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Editing `ods` file in C++ code

I need to edit LibreOffice Calc document programmatically in C++. I know that there is odfkit library, which uses webodf, but it looks like it doesn't support editing .ods files.
Is there any alternative that can deliver me this feature?
Libreoffice has API, called UNO, for controlling it from another process. So if you need something more complicated, that would be the simplest route.
If you just need some simple transformation, the other option is to unpack the file with plain old zip library (libzip, libarchive, ...) and modify the XML manually.
The opendocument site also mentions lpOD, but the web seems defunct and while search comes up with something that looks relevant, I am not sure whether there is anything usable.
see the SDK documentation, with many examples

I need a (preferably free) PDF/Word generator .Net component that can work from a document template

I'm looking for a .Net component that will allow me to generate Word and/or PDF documents.
This must work on the server without MS Office installation. Preferably free. Also, it needs to be able to generate the documents based on an existing template of some sort i.e. I don't want to generate the whole document from scratch but allow a number of different templates that all have similar content that comes from elsewhere (e.g. database, XML files etc).
My initial investigations have turned up iTextSharp (but not sure if it can work from templates).
Any help that can expedite my investigation time will be much appreciated.
Thanks
I use ActivePDF at work with .NET - give it some HTML and it will output a pdf doc. However it isn't free - but we did look at a few other ways and this was 1
http://pdfcrowd.com/html-to-pdf-api/
It doesn't do word documents but converts html (your template) to pdf

How to open PDF and read it?

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA
Update: thanks for the solutions, I'm sure some of them will suit me fine.
#RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.
For Perl, check out these modules:
PDF::API2
CAM::PDF
Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.
libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

Generate PDF from structured data

I want to be able to generate a highly graphical (with lots of text content as well) PDF file from data that I might have in a database or xml or any other structured form.
Currently our graphic designer creates these PDF files in Photoshop manually after getting the content as a MS Word Document. But usually, there are more than 20 revisions of the content; small changes here and there, spelling corrections, etc.
The 2 disadvantages are:
1) The graphic designer's time is unnecessarily occupied. The first version is the only one he/she should have to work on.
2) The PDF file becomes the document which now has the final revised content, and the initial content is out of sync with it. So if the initial content needs to be somewhere else (like on a website), we need to recreate it from the PDF file.
Generating the PDF file will help me solve both these problems. Perhaps some way in which the graphic designer creates a "Template" and then puts in tags/holders and maps these tags/holders to the relevant data.
Thanks :-)
There are some tools out there for doing this. XSL-FO is useful. Here is a tutorial for creating a pdf from xml (or xhtml) with cocoon. Also see Apache FOP.
You could format your SQL data as XML and still use the same templates this way.
I use the ReportLab python library for this. It could perhaps solve your problem, but you will need to do some work...
In the past I have written scripts that spit out LaTeX then used texi2pdf to solve this kind of problem.
Take a look at iReport and JasperReports at http://jasperforge.org.
iReport lets you design reports, and then you can either programatically fill it with the JasperReports library (Java), or just use iReport to manually create the report.
I have only used it for tabular data, but I don't think there would be any problem for other types of documents.
You could create a form and populate the entries programmatically using a pdf library like iText (Java).
You could look at doing the workflow in PostScript which is plain text that you can easily compose from fragments. Then you can use any free tool to convert to PDF.
Take a look at Prince XML. This tool allows to generate PDF based on XML or HTML and CSS.
A possible way is to use a template engine, like FreeMarker or StringTemplate: these are often used to generate HTML, but they are flexible enough to output any format, actually.
The problem is to make a PDF template, I suppose. Perhaps you can take a sample output and edit it to replace data with placeholders to be filled by the template engine. Might not be trivial!
Sounds like a job that SQL Server Reporting Services can handle quite easily.
Reporting Services allows you to query the data, define the layout, and export to PDF without any intervention. The PDF output can be distributed via email, stored on a file share, and accessed via a page on the report server.
It can handle XML data sources too.
Another approach to generating a PDF file from data is to use prawn, which is based on ruby. I was very pleasantly surprised by how much functionality is included in prawn. It may take some investment up front but this approach will give you a lot of flexibility.
You can combine CSStoXSLFO with XEP from RenderX for high quality output. With this solution you can merge XML data into an XHTML template, which is decorated with CSS. It can also generate charts with the fantastic JFreeChart library. CSS3 page media features are supported.