ReadTheDocs generates PDFs without my HTML tables - pdf

We are converting a sizeable document for hosting on ReadTheDocs. We weren't happy with the simple presentation enabled by Markdown table syntax, so we coded our tables as HTML. Very nice in the HTML viewer (e.g., the end of http://manual.cytoscape.org/en/latest/Command_Line_Arguments.html).
In the PDF version generated by ReadTheDocs, each of our tables is completely missing (see page 9 on https://media.readthedocs.org/pdf/cytoscape-working-copy/latest/cytoscape-working-copy.pdf).
Have we made a mistake by coding tables as HTML? Could we have taken a different route and gotten nice tables in both HTML and PDF?
Any advice would be helpful ...
Thanks!

I have not used ReadTheDocs myself, but from reading their Getting Started guide, I assume you are using Sphinx? While Markdown supports embedding raw HTML, Sphinx does not support converting it to other formats.
You should consider moving to reStructuredText (Sphinx's native markup format), as it is much more advanced than Markdown. It can even be extended with custom directives and roles, should you need this. But be sure to first check whether reStructuredText tables offer the flexibility you require. Pandoc can convert your Markdown files to reStructuredText.
I see you are using a table to document command line options. reStructuredText supports documenting command line options using option lists. In theory, you could change how option lists are represented in the output document, but this might not be easy to accomplish, especially for PDF output using LaTeX (shameless plug: using rinohtype for PDF output should make this much easier in the future).

Related

Understanding the PDF DOM

I am writing an application that has to read and interpret data stored in some PDF files. The reading part is done but I am only able to get a dump of all the words on a page and not the format of the words. What I mean is that if I have to extract a table, I am getting the numbers in the table but not the markup which defines the table.
Further, there is some formatting used which displays a few of these numbers within parentheses (meaning that those numbers are negative) but the parentheses themselves are not part of the text. Hence, I am not able to distinguish between positive and negative numbers present in the PDF table!
How do you get the PDF markup along with the text? Is a PDF similar in structure to an XML with tags used to markup tables etc.? If not, then, is there a resource which describes the salient features of the PDF DOM?
I am using VBA and the Acrobat library (AcroExch etc.)
There is no such thing as "PDF markup" in the sense of HTML etc. A table in PDF cannot be distinguished from line art, other than by using OCR, which can be error-prone if the layout is complex. It is simply drawn using geometrical shapes, like in a vector-based graphics program.
"Is a PDF similar in structure to an XML with tags used to markup tables etc.?"
No, not at all.
And there is no such thing as a 'DOM' either. Google for a file named *PDF32000_2008.pdf*. The current PDF specification for v1.7 (ISO spec) is that file. You should be able to locate it on the Adobe website.
As omz stated, text inside PDF does not really have a structure. You can take a look on the specification here. However, for some very specific files, there is something called PDF Tags, or PDF Marked Content, which is fairly new, and it aims to give PDF documents some kind of structure. If you target this kind of files specifically, you might be able to achieve something. Take a look on chapter 10 (Document Interchange) of the Adobe's specification for further details.
Maybe what you want to achieve can be done with less effort and faster by using TET, the Text Extraction Toolkit made by the fine folks from pdflib.com ( http://www.pdflib.com/products/tet/ ) ??
AFAIR, the TET has some (limited) support for table detection as well....

Creating ODT and PDF files as end result

I've been working on an app to create various document formats for a while now, and I've had limited success.
Ideally, I'd like to dynamically create a fairly simple ODT/PDF/DOC file. I've been focusing my efforts on ODT, because it is editable, and open enough that there are several tools which will convert it to any of the other formats I need.
The problem is that the ODT XML files are NOT simple, and there aren't any good-quality API's I could find (especially in python). So far, I've had the most success creating a template ODT file, and then manipulating the DOM in python as needed. This is ok generally, but is quickly becoming inadequate and requires too much tweaking every single time I need to alter one of the templates.
The requirements are:
1) Produce a simple document that will have lists, paragraphs, and the ability to draw simple graphics on the page (boxes, circles, etc...)
2) The ability to specify page size, and the different formats should generally print the exact same output when sent to a printer
My questions:
1) Are there any other ways I can produce ODT/PDF/DOC files?
2) Would LaTeX be acceptable? I've never really used it, does anyone have experience converting LaTeX files into other formats?
3) Would it be possible to use HTML? There are a lot of converters online. Technically you can specify dimensions in mm/cm, etc..., but I am worried that the printed output will differ between browsers/converters....
Any other ideas?
have you tried pandoc? i've been using it with good success for the conversion of different formats into each other. why try to invent the wheel twice?
I suppose to be successful, you'd have to define how you want to input everything. Why don't you just use openoffice? it will save to ODT (duh...), PDF, and HTML (though it's not clean HTML, it's actually quite ugly).
In my recent experience, I've had success going from latex -> xhtml via LaTeXML (i had to compile from source). LaTeX is seeming more and more like a terminal format. It's great for PDF, but once you need some flexibility, it kind of fails. I should also note that there is no latex -> dvi in my workflow, so I can't comment on things like tex4ht that reads out of a dvi file (I have too many graphics that don't work with DVI to switch them now).
Shortly I'll be moving everything into docbook 4.5-- i like the docbook-utils package which supports latex, html, and i even saw a converter to ODT. But docbook is super-heavy on the markup, which is annoying, but it will provide me with the flexibility i need going forward.
Since you're using python, have you just considered using ReStructured Text?
I've also really enjoyed publishing from emacs' orgmode, which is a super light weight markup that goes into a bunch of different formats.

Generating PDF documents from LISP

I want to generate a technical report from lisp (AllegroCL in my case) and I studied various packages/project to help me do this.
Requirements:
Need to generate a PDF
May create an intermediate format like RTF, Restructured TEXT, HTML, Word DOC or Latex
Need to be flexible to be able to add content throughout my application
Need to handle Multi-Page, Headers, Footers, Tables, inclusion of Images.
Possibilities:
cl-pdf and cl-typesetting: I checked this one out and it works for now, but is there a better alternative?
Some Latex generator, but ???
Question:
Do you know alternatives to easily generate (PDF) reports from lisp. What is the best workflow to go for?
we are using cl-pdf and cl-typesetting for the last 3 years and it has numerous issues... (like its confusion around encodings, or silently not rendering things that don't fit, or...) so, i don't recommend new development based on them.
currently we are in the process of moving all our export mechanisms to open document format. openoffice is all happy with it, and there's a plugin for ms office, too.
there's .fodt, the so called flat open document text format, which is a mere xml file describing a document. generating it is as easy as generating xml files.
you can also make parts of your document read-only with a password (insert a section and mark it read-only and protected by a password. when generating the xml, you can generate random hashes as password...).

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader

Generate PDF from structured data

I want to be able to generate a highly graphical (with lots of text content as well) PDF file from data that I might have in a database or xml or any other structured form.
Currently our graphic designer creates these PDF files in Photoshop manually after getting the content as a MS Word Document. But usually, there are more than 20 revisions of the content; small changes here and there, spelling corrections, etc.
The 2 disadvantages are:
1) The graphic designer's time is unnecessarily occupied. The first version is the only one he/she should have to work on.
2) The PDF file becomes the document which now has the final revised content, and the initial content is out of sync with it. So if the initial content needs to be somewhere else (like on a website), we need to recreate it from the PDF file.
Generating the PDF file will help me solve both these problems. Perhaps some way in which the graphic designer creates a "Template" and then puts in tags/holders and maps these tags/holders to the relevant data.
Thanks :-)
There are some tools out there for doing this. XSL-FO is useful. Here is a tutorial for creating a pdf from xml (or xhtml) with cocoon. Also see Apache FOP.
You could format your SQL data as XML and still use the same templates this way.
I use the ReportLab python library for this. It could perhaps solve your problem, but you will need to do some work...
In the past I have written scripts that spit out LaTeX then used texi2pdf to solve this kind of problem.
Take a look at iReport and JasperReports at http://jasperforge.org.
iReport lets you design reports, and then you can either programatically fill it with the JasperReports library (Java), or just use iReport to manually create the report.
I have only used it for tabular data, but I don't think there would be any problem for other types of documents.
You could create a form and populate the entries programmatically using a pdf library like iText (Java).
You could look at doing the workflow in PostScript which is plain text that you can easily compose from fragments. Then you can use any free tool to convert to PDF.
Take a look at Prince XML. This tool allows to generate PDF based on XML or HTML and CSS.
A possible way is to use a template engine, like FreeMarker or StringTemplate: these are often used to generate HTML, but they are flexible enough to output any format, actually.
The problem is to make a PDF template, I suppose. Perhaps you can take a sample output and edit it to replace data with placeholders to be filled by the template engine. Might not be trivial!
Sounds like a job that SQL Server Reporting Services can handle quite easily.
Reporting Services allows you to query the data, define the layout, and export to PDF without any intervention. The PDF output can be distributed via email, stored on a file share, and accessed via a page on the report server.
It can handle XML data sources too.
Another approach to generating a PDF file from data is to use prawn, which is based on ruby. I was very pleasantly surprised by how much functionality is included in prawn. It may take some investment up front but this approach will give you a lot of flexibility.
You can combine CSStoXSLFO with XEP from RenderX for high quality output. With this solution you can merge XML data into an XHTML template, which is decorated with CSS. It can also generate charts with the fantastic JFreeChart library. CSS3 page media features are supported.