Populating PDF fields from a database - pdf

I have a PDF file (not created by me - I have no control over the design etc.) which allows users to fill in some form fields in Adobe Reader and save the result. I want to automate the process of populating the fields, using the following steps:
Fetch data from database.
Open PDF template.
Populate form fields with data.
Save modified file to a separate location on disk.
Lock modified file so that the form fields can no longer be edited.
Send file to user.
I'm happy to use PHP, Perl, Python or Java to do steps 2-5 (in descending order of preference), but whatever I use has to work under Linux (i.e. it mustn't rely on libraries which are only available on Windows for example).
The end result should be a PDF which the average user can open and print, but not modify (I'm sure advanced users could find a way to do so, but I accept that I can't guarantee complete security against modification). I don't want to change the structure of the PDF, merely populate the form fields.
Is there a standard piece of software for doing this? I've seen mentions of FDF Toolkit, but I'm not entirely sure if that's what I want and whether it will allow me to lock the file afterwards, and whether what I want to do fits in with the EULA.
Edit: Final answer is to use iText (as suggested by Mark Storer) but to implement it as a web service which allows you to pass in an array of form field names and values and the PDF file 'template'. The web service will be open source (and available on GitHub once I've written it), as per the AGPL, but anything connecting to it won't have to be.

Filling
Any number of different libraries can fill in field values. I'm partial to iText (java) or iTextSharp (c#). I wrote one in Java a number of years ago. It's not that hard). There are lots. Search SO, you'll find 'em.
Locking
There are a couple different levels of "lock the fields".
Each field has a "read only" flag. This is pretty much a courtesy as far as other libraries capable of setting field values are concerned. In fact, it's generally considered to mean "the ui cannot make changes". Form script can, regardless.
Form flattening: Draw the fields directly into the page and removing all the interactivity.
Each one has pros and cons.
Flag: None too secure. Form data still easily accessible. Scrolling fields still scroll.
Flattening: Pretty much the exact opposite. It's harder to modify (though far from impossile). The form data can only be extracted via text extraction (which is hard, but becoming increasingly common). List & text fields that contain more stuff than is visible will no longer scroll.
The ability to flatten forms is relatively rare. Again, iText can do it (as can iTextSharp), but I'm not aware of any other third party libraries that can... I'm sure they exist, I just can't name them off the top of my head.

Related

Setting text to be read column-wise by screen-reader in iText 7

I have a page in my PDF that consists of several columns. I would like the screen-reader to read each column individually before moving on to the next column. Currently it just reads the text that appears from left to right. Is there any way to do this in iText 7?
The answer depends on whether you create this document by yourself with iText or you want to fix this issue in already existing PDF document.
In the first case you simply need to specify that you want to create document logical structure along with document content. In order to achieve this, you need to call PdfDocument#setTagged() method upon creation of PdfDocument instance. Document logical structure is something that tools like screen readers would rely on in order to get the correct logical order of the contents.
In the second scenario, when you already have a document with several columns, however it's reading order is messed up, it is most likely that this document doesn't have proper logical structure in it (or in other words it is not tagged properly). The task of fixing the issue you described in already existing PDF document (this task is sometimes called structure recognition) is extremely difficult in general case and cannot be performed automatically as of nowadays. There are several tools that would allow you to fix such documents manually or semi-automatically (like Adobe Acrobat) but iText 7 doesn't provide structure recognition functionality right now.

PDF file search then display that page only

I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Concatenate PDFs and preserve Extended Features in Acrobat Reader

We are using iText to automatically fill in form fields on a number of documents and then concatenating those documents into one resulting PDF.
Adobe has introduced the Extend Features in Acrobat Reader option to allow users of Acrobat Reader to save the PDF with changes to the form fields.
This is a proprietary Adobe feature that iText can only work around.
I have been able to execute the work around for one specific document using the PdfStamper class in append mode. Since the PDF's contain form fields, we use the PdfCopyFields class to perform the concatenation. PdfCopyFields does not have an append mode.
Is there another way to do an append of a PDF into a preexisting PDF with iText (any version)?
It's possible, but would require you to know enough to modify PdfCopyFields so that it saves in append mode.
You could duplicate the functionality and use it on top of PdfStamper (in your own class or otherwise), subclass PdfCopyFields, or modify PdfCopyFields directly.
Big Stumbling Block
All fields with the same name in a PDF share the same value as well. If you have two copies of the same form in your resulting PDF, then you have two views of the same data.
Even with different forms, if you happen to have a name collision ("City" over here might be part of a current address, while over there it might be the city they were born in), they'll glom together the same value.
If you have a Comprehensive System such that all your naming collisions will be deliberate, that's great, go for broke. If "FirstName" is always referring to the same person, and changing it SHOULD change the value across all the forms in question, you're golden. If not... that's why PdfStamper's flattening ability is so popular.
The alternative becomes "rename all your fields before gluing the forms together" to avoid such collisions.
Even with a Comprehensive System, I still suggest whipping up a little tool that'll go through the forms you propose to merge and look for collisions. Maybe list them along with their values in some test data. You might catch something along the lines of "Fly: House, Common" vs "Fly: Southwest Airlines".
Probably not that particular example, but who knows? ;)

Generate dynamic PDF from ASP content

I'm trying to dynamically create a PDF from a background image and some text from a database.
When a user visits our site, they can click on a button to print a voucher that can be redeemed in store. The voucher consists of a background image and some text printed over the top of it; Name, Voucher Code (generated from a hash algorithm) and Email Address.
Currently, the user can print this out right away or have it emailed. However, I want to turn the voucher into a PDF so that the user can save the voucher for later.
The site is built in classic ASP, so i would need a solution that can create a PDF of the voucher image and the text from the database.
I wondered if anyone had any suggestions on how I could approach this?
Thank you.
User ABCpdf
You can create an Acrobat form using Adobe Acrobat (i.e. a PDF containing form fields), and then populate those form fields from the database. That way, your "form" PDF can have as much complex artwork etc. as you like, and you can create it in pretty much any tool you choose (Word, Illustrator, Quark, etc.). You'll need Acrobat (not just Reader) to actually create the form fields.
As for populating the form fields, there are many tools that do this, but I think the best (and certainly the cheapest - it's free) is iText or its C# equivalent iTextSharp. You can populate the fields and flatten the document (so that the fields are no longer fields) in a dozen lines of code.
See the iText site for details.
EDIT: oops, I see now that you said classic ASP, so iTextSharp may not be ideal for you unless you're prepared to make a COM wrapper for it. There are, however, other COM-based tools to do the same job. Adobe used to supply an FDF toolkit but I believe they no longer support it.
One of the approach is to save the snapshot in a word-document [ it can be done using ASP ] and then use software like CutePdf, which converts the word document into PDF. Doing that, you can easily achieve your requirement.
The other approach would be, using open source libraries, which given flexibility to generate PDF documents. But that requires, a lot of understanding [ in terms of usage of the library and its API's ]
I think you need to weigh this requirement carefully. Either you are going to need to buy in some additional software or just possiblly there may be away to do it for free but that'll come with some significant hassle (which in turn translates into cost).
I know of no simple free ways to acheive this in classic ASP. I've used many such sites which offer a printable "ticket", most use the "print this web page now or print the email we've just sent you" approach.
Unless you believe the download a PDF approach is so appealing to your customers that it creates a significant differentiator over your competition it might not be worth bothering.