How to interpret signatureCoversWholeDocument() == false? - pdf

When trying to validate a certain pdf's signature, I used the following code.
Validation results look good, apart from one thing: SignatureUtil#signatureCoversWholeDocument returns false.
It's obvious what that means. But I'm not sure about how to interpret this.
How can I determine which parts of the document aren't covered by the signature?
Can some evil guy change the document's content (if it's uncovered) while still keeping a valid signature?
In other words: how can I assure that this ain't nothing to be worried about?

You say that it's obvious what it means that SignatureUtil#signatureCoversWholeDocument returns false but just to be sure, first some backgrounds.
What Does It Mean When a PDF Signature Does Not Cover the Whole Document
At the moment they are applied, PDF signatures cover their respective whole document (except, of course, the embedded signature container itself, or more exactly a placeholder for it which might be a bit larger):
The ranges of signed bytes (from the file start to the placeholder start and from after the placeholder end to the file end) are specified in the signed part of the PDF.
Now the PDF format allows adding to a PDF document not only by re-building the whole document from scratch but alternatively also by adding changes after its end in so called incremental updates.
As the byte ranges signed by a PDF signature are specified in the document, this mechanism can even be used to add changes to a signed PDF without cryptographically breaking the signature.
This mechanism can be used for example to apply multiple PDF signatures to a document:
But the mechanism can also be used for a myriad other kinds of changes.
Allowed and Disallowed Changes
If one can add arbitrary changes to a signed PDF without breaking the signature cryptographically, one may wonder what the value of the signature is to start with.
Of course one can always extract and display/process the PDF revision a PDF signature covers (simply take the partial document from its start to the end of the second signed byte range). Thus, it is clear what the original PDF fully covered by the signature looked like. So a signed PDF can be considered a collection of logical documents, loosely based on one another: for each signature the document covered by it plus (if there are additional unsigned additions) the full document.
There actually are use cases where this makes sense, for example a document being created by a number of authors each signing of their respectively edited document version.
But the number of use cases in which that view is too diffuse is larger (or at least more important) still. In particular there are numerous use cases with multiple signatures in which one wants a PDF to represent a single logical document signed by multiple persons, at most with a few extra form fill-ins after the first signature.
To support such use cases the PDF specification defines a number of sets of allowed changes. Such a set can be selected by the first signature of the document. For details on these sets of allowed changes see this answer. In particular such allowed changes may encompass
adding or editing annotations,
supplying form field values, or
digitally signing.
Determining the Changes in a PDF and Checking Whether They Are Allowed in Practice
In the light of the previous section the question of the OP burns down to how one can determine the nature of the changes in incremental updates and how one can determine whether they are allowed or disallowed.
Determining which low-level objects in a PDF have changed actually is not that difficult to determine, see the PdfCompare and PdfRevisionCompare classes in this answer.
The real problem is checking whether these changes in low-level objects can be considered to only serve an allowed change as specified (or do not change the document semantically at all)!
Here even the "gold standard" (i.e. Adobe Acrobat) errs again and again, both in failing to recognize disallowed changes (see e.g. the "Attacks on PDF Certification" on pdf-insecurity.org for examples that meanwhile have been fixed) and in failing to recognize allowed changes (see e.g. here and here).
Essentially this is a very difficult task. And it is very unlikely you will find a good implementation in some open source project.
In particular iText 7 does not include such an analysis. If you want it, therefore, you'll have to implement it yourself.
You can simplify the task a bit if you expect the changes to be applied by a specific software. In that case you can analyze how (in terms of low-level objects) that software applies allowed changes and only accept such low-level changes. For example Adobe Acrobat is really good at recognizing allowed changes applied by Adobe software to PDFs created by Adobe software.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.
First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Setting text to be read column-wise by screen-reader in iText 7

I have a page in my PDF that consists of several columns. I would like the screen-reader to read each column individually before moving on to the next column. Currently it just reads the text that appears from left to right. Is there any way to do this in iText 7?
The answer depends on whether you create this document by yourself with iText or you want to fix this issue in already existing PDF document.
In the first case you simply need to specify that you want to create document logical structure along with document content. In order to achieve this, you need to call PdfDocument#setTagged() method upon creation of PdfDocument instance. Document logical structure is something that tools like screen readers would rely on in order to get the correct logical order of the contents.
In the second scenario, when you already have a document with several columns, however it's reading order is messed up, it is most likely that this document doesn't have proper logical structure in it (or in other words it is not tagged properly). The task of fixing the issue you described in already existing PDF document (this task is sometimes called structure recognition) is extremely difficult in general case and cannot be performed automatically as of nowadays. There are several tools that would allow you to fix such documents manually or semi-automatically (like Adobe Acrobat) but iText 7 doesn't provide structure recognition functionality right now.

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

embed serial number to PDF file?

To prevent the casual distribution of pdf document, is there any way such as embedding the serial number to the file?
My idea is to embed the id bound to user and enable to find who distribute the file.
I know it's not preventing the distribution but may discourage the casual distribution by the certain level.
Any other solution is also welcome.
Thanks!
Common way is placing of meta data, but you can easily remove them.
Let's search hideouts (most of them low-level)!
Non-mark text
Text under overlapping objects
Objects of older versions (doesn't noticed by reader, but there with redundant information)
Marks in streams between BX-EX (with weird information from readers point of view)
Information before %PDF-X
Information above %%EOF
Substitution of names for some elements (like font name)
Steganography
Manipulation from used fonts
Whitespacing
Images with setganograpy
My favorite are steganography and BX-EX block within stream, with proper compression and/or encryption it is hard to find (if do not know, where it is). To make search harder wrap some normal blocks with BX-EX.
Some of ways are easy to remove, some harder, but decided attacker will be able to find and sanitize them all. Think about copy-paste of text or print trough PDF-printer.
You can render transparent text. You can write text outside the media box of a page. You can add custom document property. There are plenty of ways to do this.
Why not create a digital id on the documents?

Populating PDF fields from a database

I have a PDF file (not created by me - I have no control over the design etc.) which allows users to fill in some form fields in Adobe Reader and save the result. I want to automate the process of populating the fields, using the following steps:
Fetch data from database.
Open PDF template.
Populate form fields with data.
Save modified file to a separate location on disk.
Lock modified file so that the form fields can no longer be edited.
Send file to user.
I'm happy to use PHP, Perl, Python or Java to do steps 2-5 (in descending order of preference), but whatever I use has to work under Linux (i.e. it mustn't rely on libraries which are only available on Windows for example).
The end result should be a PDF which the average user can open and print, but not modify (I'm sure advanced users could find a way to do so, but I accept that I can't guarantee complete security against modification). I don't want to change the structure of the PDF, merely populate the form fields.
Is there a standard piece of software for doing this? I've seen mentions of FDF Toolkit, but I'm not entirely sure if that's what I want and whether it will allow me to lock the file afterwards, and whether what I want to do fits in with the EULA.
Edit: Final answer is to use iText (as suggested by Mark Storer) but to implement it as a web service which allows you to pass in an array of form field names and values and the PDF file 'template'. The web service will be open source (and available on GitHub once I've written it), as per the AGPL, but anything connecting to it won't have to be.
Filling
Any number of different libraries can fill in field values. I'm partial to iText (java) or iTextSharp (c#). I wrote one in Java a number of years ago. It's not that hard). There are lots. Search SO, you'll find 'em.
Locking
There are a couple different levels of "lock the fields".
Each field has a "read only" flag. This is pretty much a courtesy as far as other libraries capable of setting field values are concerned. In fact, it's generally considered to mean "the ui cannot make changes". Form script can, regardless.
Form flattening: Draw the fields directly into the page and removing all the interactivity.
Each one has pros and cons.
Flag: None too secure. Form data still easily accessible. Scrolling fields still scroll.
Flattening: Pretty much the exact opposite. It's harder to modify (though far from impossile). The form data can only be extracted via text extraction (which is hard, but becoming increasingly common). List & text fields that contain more stuff than is visible will no longer scroll.
The ability to flatten forms is relatively rare. Again, iText can do it (as can iTextSharp), but I'm not aware of any other third party libraries that can... I'm sure they exist, I just can't name them off the top of my head.