My company has requested a Java web service implementation of extracting data from PDF forms to initiate straight through processing capabilities for client operations using Apache PDFBox. Easy enough. The tough part is that forms are being submitted from clients of my firm on behalf of end customers, but the end customer signature has to be validated.
The business case for signing these forms is through informal electronic signature (digital representation of a wet signature) processes like the signature "stamp" in Adobe Reader with an image of the customer's signature, or touch screen drawing on an iPad. So far, I have been unable to consistently validate this type of signature, and even been unable to consistently maintain the PDF state such that it can still be read by PDFBox after this type of signature ceremony.
Validating signatures through the digital signature form field is trivial, and I have communicated that to our business. However, since the signer in those cases is typically the owner of the digital cert on whatever machine is being used and the assumption is that most of these interactions will take place in the client office.
I've got a few choices here:
Figure out how to identify electronic signatures consistently and reproduce the lossless signing ceremony for client education.
Make a change to the digital signature form field if possible to accept electronic signatures, if that's even possible.
I have a slight workaround using the most recent release of Acrobat in putting an image form field over the signature area, which works great except for one thing: all the software I've tried reads this form field type as a button. Is there any way to force it to recognize an image, or any PDF reading software that is more up to date and can detect those fields?
I would like to upload a couple sample PDFs, but of course they're all company proprietary information. Suffice it to say that we don't have any wizards doing amazing things with the forms... they are all your basic AcroForms and I'm trying to figure out how to configure the signature area.
Thank you.
Concerning your actual question:
I have a slight workaround using the most recent release of Acrobat in putting an image form field over the signature area, which works great except for one thing: all the software I've tried reads this form field type as a button. Is there any way to force it to recognize an image, or any PDF reading software that is more up to date and can detect those fields?
Any PDF reading software that recognizes those fields as a button is up-to-date, at least in that respect, because... there are no "image form fields" in the PDF file format!
Some PDF creators emulate image form fields using a button form field which by means of JavaScript gets the behavior of an image form field. This emulation is incomplete, of course. In particular the image in such a field is not the value of the form field but merely its appearance.
Thus, if you want to implement reading the value from such an emulated image form field, you have to extract the appearance of the button.
Some remarks on the whole scenario:
... the end customer signature has to be validated.
The business case for signing these forms is through informal electronic signature (digital representation of a wet signature) processes
In contrast to certificate based digital signatures you can hardly do anything with such signatures that deserves to be called "Validation".
Ok, you can look for an image in the PDF in some emulated image field, but you have no guarantee that the person whose wet signature can be seen on that image backs the data in that form let alone has indeed signed it personally. Just as likely someone else simply has scanned that person's signature from some different hand-signed document and filled the form using that scan...
So far, I have been unable to consistently validate this type of signature
It should be possible to extract most such wet signatures
either as bitmap images added directly or indirectly to the page content,
or as bitmap images added directly or indirectly to some annotation appearance (e.g. a button),
or as an InkList or Path of an Ink annotation,
or as the Vertices or Path of a PolyLine annotation.
In case of bitmap images don't forget to also extract the image mask if applicable. Numerous applications fill the base image with the pen color and contain the actual signature graph in the mask.
and even been unable to consistently maintain the PDF state such that it can still be read by PDFBox after this type of signature ceremony.
That sounds like a misbehavior of the software that executed that signing ceremony. Unless you share examples for that, though, one can hardly help you analyze the problem.
Related
I want to place same externally signed signature container (signature value) at multiple places in a PDF.
I have referred the page 'How to place the Same Digital signatures to Multiple places in PDF using itextsharp.net'.
While working with the above mentioned work-around, I observed that whenever I tried to place multiple signatures on single page like 4-5 times, it never worked. Always shows only one valid signature field and other fields as unsigned (unsigned PDF form fields). So couldn't understand the problem.
Now I wanted to know whether any reference material is available to see how PdfLiteral and PdfIndirectReference works? I have gone through the itextsharp reference document but couldn't get enough information. In addition to this is there any limitation on how many annotations/signature fields one can add in a PDF?
And If I have to use BlankSignatureContainer and MakeSignature.SignDeferred then how the signature will get attached to all the fields because in,
MakeSignature.SignDeferred(pdfreader, "Sig", output, externalcontainer)
we have to pass only one signature field name.
Thank you.
You are asking for something of which mkl wrote:
Beware: While this procedure creates something which does not violate
the letter of the PDF specifications (which only forbid the cases
where the same field object is referenced from multiple pages, be it
via the same or via distinct widgets), it clearly does violate its
intent, its spirit. Thus, this procedure might also become forbidden as part of a Corrigenda document for the specification.
Actually, what you are asking does violate the specification. See section 12.7.5.5 of the ISO standard for PDF:
Allow me to repeat the last line of this screen shot:
signature fields shall never refer to more than one annotation.
There is a shall in this sentence, not a should. A should isn't normative. It means that you should or shouldn't do something, but that you are not in violation with the spec if don't or do. Not respecting results in a PDF document that is in violation with the PDF specification, and that in the strict sense isn't a real PDF file.
That is a path you don't want to go, because being in violation with the PDF specification voids your right to use a series of PDF patents owned by Adobe. Adobe owns patents that can be used by everyone for free (perpetual, non-exclusive, royalty-free,...) on condition that you respect the ISO specification.
For that reason, please do not expect an answer to your question, except for the recommendation to abandon your requirement. PDF viewers that comply with the PDF specification won't expect a single signature to be placed at different locations because that's not allowed by the spec, so even if you would adapt your software to create more than one widget annotation / appearance for a single signature field, there is no guarantee that a PDF viewer will understand what you're trying to do.
I am verifying a PDF with two signatures (Adobe Acrobat), both valid. One of them has a text say "cambio(s) varios" (my Adobe Acrobat is in Spanish) translating to Enghish "change(s) various", my question is I don´t know what it mean. Signatures are valid and the PDF is correct.
Thanks in advance
First of all, to outline what this is about, the Adobe Acrobat Reader signature panel looks like this for the document at hand
and the question is about the
1 Miscellaneous Change(s)
in-between.
According to Adobe Documentation
In a number of documents Adobe enumerates possible modification entries and characterizes "Miscellaneous Change(s)" like this:
Miscellaneous: Some changes which occur in memory or cannot be explicitly listed are labelled miscellaneous.
(e.g. in "Digital Signatures Workflow Guide for the Adobe® Acrobat Family of Products")
Now this documentation obviously is no help at all...
According to Adobe Acrobat
Fortunately Adobe Acrobat can be asked to show "Document Integrity Properties":
(Adobe Acrobat 9.5 output on "Signature Properties" - "Legal" - "View Document Integrity Properties...")
I assume it is this detail that makes Adobe Reader warn about miscellaneous changes.
In Your Document
Looking for a transfer function use in your document one quickly indeed finds one in a ExtGState resource of page 1:
The TR entry in that graphics state dictionary sets the transfer function here.
Interestingly the transfer function used is the Identity function! I assume that in most normal use cases setting the transfer function to Identity changes nothing...
What to Do
Thus, I would propose you change your original document creation to not include transfer functions, in particular not Identity transfer functions. Alternatively pre-process your documents before applying the first signature and remove such functions.
I have a pdf of which the content stream of the pdf doc looks like image1.
But once I open the pdf in adobe dc and tried to change the reading order. The entire content stream is changed. (Please see image2)
And here is the link to source pdf https://drive.google.com/file/d/1V2K3-2GdWG5DuTUv1fyfIIT54en70kI2/view
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
Thanks in advance !
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
First of all, both streams are proper, there merely are different (and in the case at hand considerably different) ways to create the same text on screen, each of them as valid as each other, and different PDF processors use different ways.
The processor that created your original PDF appears to have approached the task by dividing the text in small pieces (less than a text line) and draw these pieces as independently as possible, i.e. as separate text objects (BT..ET) with text properties set in each (Tm, Tf, Tc), positioned and rescaled by transformation matrix changes (cm), enveloped in save/restore graphics state instructions (q..Q).
Adobe Acrobat, on the other hand, appears to prefer the page main text to be contained in a single text object with text properties only set when they change and no text object or graphics state switches in-between.
Neither of these is more "proper" or more "graphical" than the other. If anything, these structures mirror how these instructions are stored or processed internally by the respective PDF processor.
That being said, you do want to convert from the former style into the latter.
The main problem is that the latter style is not standardized (at least there is no published document normatively describing it). So, while you can surely attempt to follow the lead of the example you have, you can never be sure that you understood the style exactly. Thus, you always have to expect differences emerging in special, not yet encountered situations. Furthermore, there is no guarantee Adobe will meticulously adhere to that style across software versions.
Nonetheless, you can of course attempt to follow the style (as you perceive it) as well as possible.
An implementation will have to walk through the respective content stream, keeping track of the current graphics state, and transform the text drawing (and related) instructions into a single text object for as long as possible.
You have tagged your question both itext and pdfbox. Thus, you appear to be undecided with which PDF library to implement this. Here some ideas for both choices:
For processing content streams and keeping track of the current graphics state, iText offers its com.itextpdf.text.pdf.parser API, in particular the PdfContentStreamProcessor (iText 5.x) / its com.itextpdf.kernel.pdf.canvas.parser API, in particular the PdfCanvasProcessor (iText 7.x).
You can extend them to in addition to analyzing the current contents also replace the content stream in question with an updated version, e.g. like I did in this answer for iText 5 or in this answer for iText 7.
PDFBox for the same task offers the class hierarchy based on the PDFStreamEngine. Based on these classes it should similarly be possible to create a graphics state aware content stream editor.
Both libraries also offer simpler classes for parsing the content streams into sequences of instructions, but those classes don't keep track of the graphics state, leaving that for you to implement.
I have a theoretical question about PAdES. I want to know if it is possible to revoke a signature in PDF or remove it?
I don't know what exactly you technically mean by revoking a signature.
But it clearly is possible to remove a signature: An integrated PDF signature usually consists of a signature form field with a value that contains a CMS signature container.
You have the choice of either removing only that value or the whole field with the value.
The former option leaves an empty signature field, which can easily be used for a new signature with a visualization at the same location as your original signature (if it has any to start with).
The latter option removes your signature completely.
Two caveats, though:
If you don't merely want the signature not to appear anymore, make sure that
you don't save this edit as an incremental update - if it was done as an incremental update, the document version with your signature could easily be restored;
you don't merely remove the reference to the the value from the signature field but that you actually clear the value object - the signature value object might be referenced from other locations in the PDF, too, so if you don't clear it, its information might remain accessible inside the PDF.
If your PDF contains multiple signatures or document timestamps, and if the signature you want to remove is not the newest one, manipulating it will break at least all newer signatures / time stamps. This is due to the way multiple signatures are applied to PDFs:
As you can recognize in this sketch, the bytes signed by newer signatures contain all older signatures.
In such a situation, therefore, don't only implement "remove a single target signature" but instead "remove all signature starting at a single target signature".
For some more technical backgrounds on integrated PDF signatures cf. this answer and documents referenced from there.
Background
The idea is this:
Person provides contact information for online book purchase
Book, as a PDF, is marked with a unique hash
Person downloads book
PDF passwords are easy to circumvent, or share
The ideal process would be something like:
Generate hash based on contact information
Store contact information and hash in database
Acquire book lock
Update an "include" file with hash text
Generate book as PDF (using pdflatex)
Apply hash to book
Release book lock
Send email with book download link
Technologies
The following technologies can be used (other programming languages are possible, but libraries will likely be limited to those supplied by the host):
C, Java, PHP
LaTeX files
PDF files
Linux
Question
What programming techniques (or open source software) should I investigate to:
Embed a unique hash (or other mark) to a PDF
Create a collusion-attack resistant mark
Develop a non-fragile (e.g., PDF -> EPS -> PDF still contains the mark) solution
Research
I have looked at the following possibilities:
Steganography
Natural Language Processing (NLP)
Convert blank pages in PDF to images; mark those images; reassemble PDF
LaTeX watermark package
ImageMagick
Issues
The possible solutions I have researched have the following issues:
Steganography. (a) Requires a master copy of the images, which are converted to EPS, which is CPU-intensive and time-consuming; (b) would the watermark survive PDF -> EPS -> PDF, or other types of conversion; (c) most images are drawings or screen captures, not photographs in PNG format.
LaTeX. Creates an image cache; any steganographic solution would have to intercept that process somehow.
NLP. Introduces grammatical errors; could change meaning of technical words.
Blank Pages. Immediately suspect; it is easy to replace suspicious blank pages.
Watermark Package. Draws visible marks.
ImageMagick. Draws visible marks.
What other solutions are possible?
Related Links
http://www.tcpdf.org/
invisible watermarks in images
Thank you!
I've done this for another project with PDFlib. We needed traceability for the generated PDFs in case the file was leaked. Basically:
Created a source template PDF with the content in place, set the document master password with the required options (no edit, no print, no screen-reader, etc...) set
At runtime, we applied a few watermarks (imposed page footer saying "This document checked out to user #12345", set a few of the metadata fields with user ID, download IP, download date/time, added a "this document copyright by..." cover page, etc...)
Optionally attach a user password to force a PW prompt when document is opened.
Since the latest PDF versions use AES-128 for their encryption, we just set a suitable randomly generated 128char high-entropy password - no one would ever be typing it in by hand so hard-to-typedness was irrelevant to us and actually preferable. The master password prevented end-users from making any changes to the document. The various noprint/no screen read options are actually enforced by the PDF reader and therefore bypassable, but can't hurt to set them anyways.
The downside to this is that PDFlib's licensing is fairly steep. I don't know if any of the free php PDF libraries support the latest PDF encryption schemes, especially the master password stuff, but if you budget can support it, PDFlib's the way to go for secure document production.