How to create your own package for interaction with word, pdf etc - pdf

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?

It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.
First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Headless LibreOffice or OpenOffice as a PDF report generator?

I hope it’s Ok to post a complete naive question here for LO or OO experts.
I’m looking for advice on whether scripting LibreOffice or OpenOffice would be suitable for the following:
General Question
I’m looking to generate PDF reports, based on a combination of a “template” and a set of data (currently in JSON format) and inserted images.
This would act as a headless service that gets invoked when necessary from a web server, when a user requests a PDF report (on linux).
We have a need to frequently modify/customise/generate new templates, hence the reluctance to go down a route of using something like Reportlab (plus I don't know Reportlab at all, so face huge learning curve that way
Background
This is in contrast to using an approach of using a PDF library like Reportlab directly within the web server, and having to build up the template/report programmatically.
As LibreOffice/OpenOffice is obviously a lot faster for generating good looking report "templates", this is a question about doing both the template generation, plus final template + data -> PDF generation all directly within LibreOffice.
Some more specifics
The data values would mostly either be substituted into fields in the template, with no to minimal processing of values required.
However, there would be situations where some of the data is in “sets” that would be shown in a table type view, and the number of fields (and so number of table rows for instance) would need to vary per report, based on the number of values in that particular JSON data.
Additionally, I’d need to be able to include (“import”) images into the report. Some of the JSON data would be paths to image files, and I’d like to include those. Again for these, the number of image may vary between each report.
This wouldn't be high frequency at all, so would not need to run either LO/OO as a service, but could simply invoke when required with a sys call. Conceptually something like "LibreOffice --template 'make_fancy.report' <data.json> <output_file.pdf>"
If this approach would be reasonable in either LO or OO, what languages are best to script in? (Hopefully python3).

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Migrating RMS to RDB

We're approaching the migration of legacy OpenVMS RMS files into relational database (both MS SQL 2012 and Oracle 10g are available).
I wonder if there are:
Tools to retrieve schema of indexed files
Tools to parse indexed files
Tools to deal with custom RMS data formats (zoned decimals etc)
as a bundle/API/Library
Perhaps I should change the approach?
There are several tools available, notably through ODBC vendors (I work for one: Attunity).
1 >> Tools to retrieve schema of indexed files
Please clarify. Looking for just record/column layout and indexes within the files or also relationships between files.
1a) How are the files currently being used? Cobol, Basic, Fortran programs? Datatrieve?
They will be using some data definition method, so you want a tool which can exploit that.
Connx, and Attunity Connect can 'import' CDD definitions, BASIC - MAP files, Cobol Copybooks. Variants are typically covered as well. I have written many a (perl/awk) script to convert special definition to XML.
1b ) Analyze/RMS, or a program with calling RMS XAB's can get available index information. Atunity connect will know how to map those onto the fields from 1a)
1c ) There is no formal, stored, relationship between (indexed) files on OpenVMS. That's all in the program logic. However, some modestly smart Perl/Awk/DCL script can often generate a tablem of likely foreign/primary keys by looking at filed names and datatypes matches.
How many files / layouts / gigabytes are we talking about?
2 >> Tools to parse indexed files
Please clarify? Once the structure is known (question 1), the parsing is done by reading using that structure right? You never ever want to understand the indexed file internals. Just tell RMS to fetch records.
3 >> Tools to deal with custom RMS data formats (zoned decimals etc) as a bundle/API/Library
Again, please clarify. Once the structure is known just use the 'right' tool to read using that structure and surely it will honor the detailed data definitions.
(I know it is quite simple to write one yourself, just thought there would be something in the industry)
Famous last words... 'quite simple'. Entire companies have been build and thrive doing just that for general cases. I admit that for specific cases it can be relatively straightforward, but 'the devil is in the details'.
In the Attunity Connect case we have a UDT (User Defined data Type) to handle the 'odd' cases, often involving DATES. Dates in integers, in strings, as units since xxx are all available out of the box, but for example some have -1 meaning 'some high date' which needs some help to be stored in a DB.
All the databases have some bulk load tool (BCP, SQL$LOADER).
As long as you can deliver data conforming to what those expect (tabular, comma-seperated, quoted-or-not, escapes-or-not) you should be in good shape.
The EGH tool Vselect may be a handy, and high performance, way to bulk read indexed files, filter and format some and spit out sequential files for the DB loaders. It can read RMS indexed file faster than RMS can! (It has its own metadata language though!)
Attunity offers full access and replication services.
They include a CDC (change data capture) to not a only load the data, but to also keep it up to date in near-real-time. That's useful for 'evolution' versus 'revolution'.
Check out Attunity 'Replicate'. Once you have a data dictionary, just point to the tables desired (include, exlude filters), point to a target DB and click to replicate. Of course there are options for (global or per-table) transformations (like an AREA-CODE+EXHANGE+NUMBER to single phone number, or adding a modified date columns ).
Will this be a single big switch conversion, or is there desire to migrate the data and keep the old systems alive for days, months, years perhaps, all along keeping the data in close sync?
Hope this helps some,
Hein van den Heuvel.
OP: Perhaps I should change the approach? Probably.
You might consider finding data migration vendors, some which likely have off-the-shelf solutions, if not as a COTS tool, more likely packaged as a service (I don't think this is a big market).
What this won't help you with is what I think of as much bigger problem with the application code: who is going to change all the code that is making RMS calls, in the corresponding code that makes relational DB calls? How will the entity ("Joe Programmer", or some tool), know where the data migrated to, so that he can write the correct call? What are you doing to do about the fact that the data representation is like to change?
Ideally you'd like an automated migration tool, that will move the data itself (therefore knows that datalayouts and representation changes), and will make the code changes that correspond. You can look for these kind of vendors, too.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.