Headless LibreOffice or OpenOffice as a PDF report generator? - pdf

I hope it’s Ok to post a complete naive question here for LO or OO experts.
I’m looking for advice on whether scripting LibreOffice or OpenOffice would be suitable for the following:
General Question
I’m looking to generate PDF reports, based on a combination of a “template” and a set of data (currently in JSON format) and inserted images.
This would act as a headless service that gets invoked when necessary from a web server, when a user requests a PDF report (on linux).
We have a need to frequently modify/customise/generate new templates, hence the reluctance to go down a route of using something like Reportlab (plus I don't know Reportlab at all, so face huge learning curve that way
Background
This is in contrast to using an approach of using a PDF library like Reportlab directly within the web server, and having to build up the template/report programmatically.
As LibreOffice/OpenOffice is obviously a lot faster for generating good looking report "templates", this is a question about doing both the template generation, plus final template + data -> PDF generation all directly within LibreOffice.
Some more specifics
The data values would mostly either be substituted into fields in the template, with no to minimal processing of values required.
However, there would be situations where some of the data is in “sets” that would be shown in a table type view, and the number of fields (and so number of table rows for instance) would need to vary per report, based on the number of values in that particular JSON data.
Additionally, I’d need to be able to include (“import”) images into the report. Some of the JSON data would be paths to image files, and I’d like to include those. Again for these, the number of image may vary between each report.
This wouldn't be high frequency at all, so would not need to run either LO/OO as a service, but could simply invoke when required with a sys call. Conceptually something like "LibreOffice --template 'make_fancy.report' <data.json> <output_file.pdf>"
If this approach would be reasonable in either LO or OO, what languages are best to script in? (Hopefully python3).

Related

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

PDFBOX - Unknown number of pages

I am investigating a replacement for iText and have been looking at the API and example code for PdfBox. I am slightly confused by its useage though, it seems I need to manually create the page objects which implies I need to know the number of pages beforehand or at least work out when its time to create a new page.
I generally use PDF generation for reports based on user configurable parameters which call stored procedures which can return varying amounts of data.
My question is quite simple, is it down to me to try and work out how much data will fit onto a page and create the pages programmatically?
The API seems to state that each page object represents a single page. From my experience of iText I do not need to worry about this, I simply write my data to the document and the pages are created for me based on the content I am placing into it.
I recently made the switch from iText to PDFBox and ran into a similar issue. I asked this question and eventually worked out what I needed to do to generate reports with an unknown number of pages.
This model works well for generating reports containing lines of data generated from a ResultSet...though that's the only way I've been using it thus far. I may run into limitations, but for now, it's getting the job done.
And I guess I should state that I am still laying out each page manually, but this method does at least generate my pages dynamically depending on the number of results that return.

Populating PDF fields from a database

I have a PDF file (not created by me - I have no control over the design etc.) which allows users to fill in some form fields in Adobe Reader and save the result. I want to automate the process of populating the fields, using the following steps:
Fetch data from database.
Open PDF template.
Populate form fields with data.
Save modified file to a separate location on disk.
Lock modified file so that the form fields can no longer be edited.
Send file to user.
I'm happy to use PHP, Perl, Python or Java to do steps 2-5 (in descending order of preference), but whatever I use has to work under Linux (i.e. it mustn't rely on libraries which are only available on Windows for example).
The end result should be a PDF which the average user can open and print, but not modify (I'm sure advanced users could find a way to do so, but I accept that I can't guarantee complete security against modification). I don't want to change the structure of the PDF, merely populate the form fields.
Is there a standard piece of software for doing this? I've seen mentions of FDF Toolkit, but I'm not entirely sure if that's what I want and whether it will allow me to lock the file afterwards, and whether what I want to do fits in with the EULA.
Edit: Final answer is to use iText (as suggested by Mark Storer) but to implement it as a web service which allows you to pass in an array of form field names and values and the PDF file 'template'. The web service will be open source (and available on GitHub once I've written it), as per the AGPL, but anything connecting to it won't have to be.
Filling
Any number of different libraries can fill in field values. I'm partial to iText (java) or iTextSharp (c#). I wrote one in Java a number of years ago. It's not that hard). There are lots. Search SO, you'll find 'em.
Locking
There are a couple different levels of "lock the fields".
Each field has a "read only" flag. This is pretty much a courtesy as far as other libraries capable of setting field values are concerned. In fact, it's generally considered to mean "the ui cannot make changes". Form script can, regardless.
Form flattening: Draw the fields directly into the page and removing all the interactivity.
Each one has pros and cons.
Flag: None too secure. Form data still easily accessible. Scrolling fields still scroll.
Flattening: Pretty much the exact opposite. It's harder to modify (though far from impossile). The form data can only be extracted via text extraction (which is hard, but becoming increasingly common). List & text fields that contain more stuff than is visible will no longer scroll.
The ability to flatten forms is relatively rare. Again, iText can do it (as can iTextSharp), but I'm not aware of any other third party libraries that can... I'm sure they exist, I just can't name them off the top of my head.

What data generators?

I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.
So, links? And if you have used such a product,
what features did you find it was missing?
Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and
supports weighting. It is XML based (sorry, folks) and lets you say things like:
<pick distribute="20,80" >
<datafile file="femalenames.dat"/>
<datafile file="malenames.dat"/>
<pick/>
to select female names about 20% of the time and male names 80% of the time.
But the purpose of this question is not to describe my product but to get info on other tools.
Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest
That can be a one-liner in R where I use the littler scripting front-end:
# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd#ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),
x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv",
quote=FALSE, row.names=FALSE)'
edd#ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd#ron:~$
You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.
R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.
I asked a similar question some months ago:
Tools for Generating Mock Data?
I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).
Features I was looking for were:
Generate mock data to fill existing database tables
Quick to generate > 1 million rows
Produce either SQL script format or flat file suitable for importing
Scriptable command-line interface, not a GUI
Not dependent on Microsoft Windows environment
Nice-to-have features:
Extensible/configurable
Open-source, free license
Written in a dynamic language like Perl/PHP/Python
Point it at a database and let it "discover" the metadata
Integrated with testing tools (e.g. DbUnit)
Option to fill directly into the database as it generates data
The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.
I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.