how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA
Update: thanks for the solutions, I'm sure some of them will suit me fine.
#RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.
For Perl, check out these modules:
PDF::API2
CAM::PDF
Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.
libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.
Related
I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/
I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.
I've been working on an app to create various document formats for a while now, and I've had limited success.
Ideally, I'd like to dynamically create a fairly simple ODT/PDF/DOC file. I've been focusing my efforts on ODT, because it is editable, and open enough that there are several tools which will convert it to any of the other formats I need.
The problem is that the ODT XML files are NOT simple, and there aren't any good-quality API's I could find (especially in python). So far, I've had the most success creating a template ODT file, and then manipulating the DOM in python as needed. This is ok generally, but is quickly becoming inadequate and requires too much tweaking every single time I need to alter one of the templates.
The requirements are:
1) Produce a simple document that will have lists, paragraphs, and the ability to draw simple graphics on the page (boxes, circles, etc...)
2) The ability to specify page size, and the different formats should generally print the exact same output when sent to a printer
My questions:
1) Are there any other ways I can produce ODT/PDF/DOC files?
2) Would LaTeX be acceptable? I've never really used it, does anyone have experience converting LaTeX files into other formats?
3) Would it be possible to use HTML? There are a lot of converters online. Technically you can specify dimensions in mm/cm, etc..., but I am worried that the printed output will differ between browsers/converters....
Any other ideas?
have you tried pandoc? i've been using it with good success for the conversion of different formats into each other. why try to invent the wheel twice?
I suppose to be successful, you'd have to define how you want to input everything. Why don't you just use openoffice? it will save to ODT (duh...), PDF, and HTML (though it's not clean HTML, it's actually quite ugly).
In my recent experience, I've had success going from latex -> xhtml via LaTeXML (i had to compile from source). LaTeX is seeming more and more like a terminal format. It's great for PDF, but once you need some flexibility, it kind of fails. I should also note that there is no latex -> dvi in my workflow, so I can't comment on things like tex4ht that reads out of a dvi file (I have too many graphics that don't work with DVI to switch them now).
Shortly I'll be moving everything into docbook 4.5-- i like the docbook-utils package which supports latex, html, and i even saw a converter to ODT. But docbook is super-heavy on the markup, which is annoying, but it will provide me with the flexibility i need going forward.
Since you're using python, have you just considered using ReStructured Text?
I've also really enjoyed publishing from emacs' orgmode, which is a super light weight markup that goes into a bunch of different formats.
Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...
I want to be able to generate a highly graphical (with lots of text content as well) PDF file from data that I might have in a database or xml or any other structured form.
Currently our graphic designer creates these PDF files in Photoshop manually after getting the content as a MS Word Document. But usually, there are more than 20 revisions of the content; small changes here and there, spelling corrections, etc.
The 2 disadvantages are:
1) The graphic designer's time is unnecessarily occupied. The first version is the only one he/she should have to work on.
2) The PDF file becomes the document which now has the final revised content, and the initial content is out of sync with it. So if the initial content needs to be somewhere else (like on a website), we need to recreate it from the PDF file.
Generating the PDF file will help me solve both these problems. Perhaps some way in which the graphic designer creates a "Template" and then puts in tags/holders and maps these tags/holders to the relevant data.
Thanks :-)
There are some tools out there for doing this. XSL-FO is useful. Here is a tutorial for creating a pdf from xml (or xhtml) with cocoon. Also see Apache FOP.
You could format your SQL data as XML and still use the same templates this way.
I use the ReportLab python library for this. It could perhaps solve your problem, but you will need to do some work...
In the past I have written scripts that spit out LaTeX then used texi2pdf to solve this kind of problem.
Take a look at iReport and JasperReports at http://jasperforge.org.
iReport lets you design reports, and then you can either programatically fill it with the JasperReports library (Java), or just use iReport to manually create the report.
I have only used it for tabular data, but I don't think there would be any problem for other types of documents.
You could create a form and populate the entries programmatically using a pdf library like iText (Java).
You could look at doing the workflow in PostScript which is plain text that you can easily compose from fragments. Then you can use any free tool to convert to PDF.
Take a look at Prince XML. This tool allows to generate PDF based on XML or HTML and CSS.
A possible way is to use a template engine, like FreeMarker or StringTemplate: these are often used to generate HTML, but they are flexible enough to output any format, actually.
The problem is to make a PDF template, I suppose. Perhaps you can take a sample output and edit it to replace data with placeholders to be filled by the template engine. Might not be trivial!
Sounds like a job that SQL Server Reporting Services can handle quite easily.
Reporting Services allows you to query the data, define the layout, and export to PDF without any intervention. The PDF output can be distributed via email, stored on a file share, and accessed via a page on the report server.
It can handle XML data sources too.
Another approach to generating a PDF file from data is to use prawn, which is based on ruby. I was very pleasantly surprised by how much functionality is included in prawn. It may take some investment up front but this approach will give you a lot of flexibility.
You can combine CSStoXSLFO with XEP from RenderX for high quality output. With this solution you can merge XML data into an XHTML template, which is decorated with CSS. It can also generate charts with the fantastic JFreeChart library. CSS3 page media features are supported.