Difference between the ID of a pdf read from iTextSharp and pdf.js - pdf

I am trying to parse the ID of a particular pdf (this) using iTextSharp as mentioned in this answer. But I get null array for ID whereas I can see that another pdfReader (pdf.js) can read the id as 77a2a5c4fc17dc3a91a072c46fe69ec0. Why is this behaviour different? Am I expected to read the ID field from some place other than the trailer?

Open a regular PDF with an ID in a text editor like this:
Right before where it says startxref, you see a dictionary (it starts with <<). That's the trailer dictionary of the PDF. One of the (optional) entries is the /ID which is an array containing two PDF strings.
If your PDF has such an entry, then the answer to the question Extract ID of a PDF document using iTextSharp won't return null.
Now open your PDF in a text editor:
Again you see a dictionary (the trailer dictonary) before startxref. However, in this case, the dictionary only has three entries: /Size (the number of objects in the cross-reference table), /Info (a reference to the dictionary containing the metadata) and /Root (a reference to the catalog dictionary).
There is no /ID entry, hence iText (and iTextSharp) should return null (and you confirmed that they do).
Now search for the value 77a2a5c4fc17dc3a91a072c46fe69ec0 in the PDF you've opened in a text editor. You won't find that value anywhere because it's just not there!
Summarized: your question Am I expected to read the ID field from some place other than the trailer? is wrong. You are asking how to read something that isn't there. Your question should be: Why is pdf.js creating an ID for PDFs that don't have one, and how do I retrieve it? The answer to the first part is reasonable: even iText tries to create an /ID when you manipulate a PDF because it's good practice for a PDF to have an ID. The answer to the second part is: look in the trailer (but you already knew that).
Conclusion: based on feedback in the comments, it turns out the the OP is using the fingerprint() method in pdf.js. This method returns the first element of the ID if and ID is present. If no ID is found, and MD5 hash is returned. See the source code of the fingerprint() method in pdf.js.

Related

How do PDF readers validate form fields?

I was looking at the source code of several pdf files which were digitally signed (and had annotations and form fields as well).
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
I just attached an example pdf at this link:
https://www.mediafire.com/file/q8ed9rkf35kgxgq/output.txt/file
Some Misconceptions
There apparently are a number of misconceptions to clear up here.
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
The M entry of annotation dictionaries is optional, so you cannot count on it being there.
The format of the annotation M value essentially only is a String; it is merely recommended to contain dates as specified in the PDF specification but not required, so you might find a value like "my mother's 42nd birthday" in it.
The annotation M value is not backed by a digital timestamp, so a forger could put anything there.
Furthermore, the M entry of a signature dictionary also is optional, and by itself it also cannot be trusted.
Thus, no, this represents no means to validate anything.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
First of all, as explained above, the M values cannot be used at all for modification detection, so whether some objects do or do not have them, is irrelevant.
Furthermore, a form XObject by itself is not a form field meant by the document modification detection and prevention settings of a signature. The form fields meant are AcroForm form fields (or, in a deprecated special case, XFA form fields). A form XObject may be used as appearance of an AcroForm form field but in that case the pivotal check point is the form field itself.
How to Validate Changes
(For some backgrounds on PDF signatures you may want to read this answer first.)
Depending on the document modification detection and prevention (DocMDP) settings of the signatures of a document only certain changes are allowed to a document, see this answer.
But even the allowed changes may not be applied by changing the original objects in the PDF. That would after all change the digest over the originally signed byte ranges and so invalidate the signature. Thus, the changed and added objects are appended to the PDF, capped off by a cross reference table or streams for these objects.
Thus, what you have to do for DocMDP validation, is determining the extend of the signed revision in the PDF file, finding out that way what has been appended, and analyzing whether those additions change the signed revision in allowed or disallowed ways.
While this may sound simple at first, it is not, in particular because "allowed" and "disallowed" changes are characterized by their effects on document appearance and behavior, not by the actual PDF objects that may be affected.
Here currently ETSI working groups are attempting to transform those characterizations into criteria for PDF objects; the results are to be published as ETSI TS 119 102-3, probably in multiple parts.
Some Details
In comments you asked
how do you tell from the appearance of a modified object, whether it was added before or after the digital sign?
Well, as mentioned above:
First you determine the extend of the signed revision in the PDF file.
I.e. you take the ByteRange entry of the signature dictionary and take the start of the lower range and the end of the higher range. E.g. if that entry is
[ 0 67355 72023 6380 ]
the the signed revision starts at offset 0 and ends at offset 72023+6280-1=78302 inclusively.
(Obviously some sanity checks are indicated, in particular that the start offset is 0, that the gap in the signed byte ranges exactly contains the signature dictionary Contents value, that the signed revision as a whole is a valid PDF and all its cross references point to indirect objects completely contained in that signed revision, and that that signed revision indeed is a previous revision of the whole PDF, i.e. that the chain of cross reference streams or tables contains the cross reference stream or table of that revision.)
Then you find out that way what has been appended.
I.e. you compare the cross reference stream or table of the whole file and the cross reference stream or table of the signed revision.
If some object is referenced for an object number now but was not referenced for that object number in the signed revision, you found a change to check.
(Strictly speaking you should iterate along the chain of cross reference streams or tables from the signed revision to the whole file, i.e. revision by revision, and check the changes in each revision.)
For this procedure you obviously have to use the original file, not some version uncompressed by tools like qpdf, otherwise you cannot do the offset tests.
Is it possible for an attacker to add a new annotation object before the xref table corresponding to the digital signature, and adjust the previous xref table values, so that a broken document passes as an accepted document?
No. The signed revision including its cross references is covered by the signature. Manipulating those bytes will invalidate the signature mathematically.

PDFBOX2.X - Read wrong Data from acroForm

We are facing issue while reading data from acroform.
We are using PDFBOX2.x version to create PDF file. Our goal was to make pdf executable means we can download pdf file which contains acroform. We can collect data and later we can upload it to sync with database.
We are facing issue in which PDFBox debugger or we can say in upload file. Our textbox value is getting changed automatically.
PLease see more details in below image.
We have used PDF Debugger tool to check PDF content. You can see that totalBuyoffCount value is 0. But it should be 3.
I have used iTextDebugger to check same field
It is totally random behavior and we have noticed following things
Sometimes 0 or 1 value became N or Y
Very few fields are affected but it causes NumberFormat exception if it’s converted into String value.
It’s makes our whole file corrupted.
If it cannot be fixed then could you tell me in which area we need to see so that we can understand and debug why it’s value changed or from where value is retrieved so in case if we find then we can change or override this behavior
Looking at the PDF object in question (1499, gen 0) one finds
1499 0 obj
<<
/FT /Tx
/Q 0
/V (0)
/Ff 1
/Rect [0.0 1.0 1.0 1.0]
/Type /Annot
/Subtype /Widget
/T <77A2A671303FC282631C0C903EA8F40F>
/DA <2C85B77C2A5D81D53C5A3EB571EDBA1C>
/F 6
>>
endobj
So one might be tempted to say you see "/V (0)". And not 3.
While this is correct, it unfortunately does not mean a lot because the file is encrypted!
Thus, the question burns down to whether the string 0 in object 1499, generation 0, decrypts to "0" or to "3".
I have not implemented a PDF decrypter myself, so I cannot claim to check this with my own code.
The second best I can do is check against what Adobe decrypts that value to. My good old Adobe Acrobat 9.5 Preflight shows:
Apparently Adobe just like iText decrypts this 0 to "3". Additional checks with an online PDF decrypter or two support this decryption result.
Thus, it appears that PDFBox does not properly decrypt this 0 string.
Considering the OP's further observation "Sometimes 0 or 1 value became N or Y Very few fields are affected" it looks like PDFBox sometimes does not correctly decrypt single character strings.
An alternative option would be that there is some issue in the encryption parameters of the file in question. I don't really believe this but I cannot preclude it.
The bug
As Tilman already hinted at in his comments to PDFBOX-4453, the bug is due to the way PDFBox and in particular the SecurityHandler keeps track of which objects already have been decrypted and which still have to be: The already decrypted objects are stored in the HashSet SecurityHandler.objects; when asked to decrypt an object, SecurityHandler.decrypt first checks whether that object is in that set, and only if it is not, it is actually decrypted and added to the set.
Thus, if a still encrypted object equals an already decrypted one, a call to decrypt this encrypted object won't do anything at all.
In the file at hand there has been a string before that has been decrypted to "0". Thus, when the encrypted value of totalBuyoffCount, 0, is sent to the SecurityHandler for decryption, the value falsely is assumed to already be decrypted, so it remains as it is.
For longer strings this usually is no issue as their encrypted versions usually are completely garbled, so they won't be found among the already decrypted objects. Short strings, in particular single-character ones, on the other hand might have encrypted versions that make sense, so collisions may happen.
Options to fix this are discussed in the referenced Apache Jira issue. One option would be to replace the mentioned set by a flag of the individual objects in question but other options also are possible.

iText doesn't process correct signature fields on strange situations

I have a pdf and i'm processing it with itext 5.4.4 (i have tried the same with 5.5.5 and i got the same errors), while processing the file i try to verify if all signatures cover the full document and verify it with this code:
boolean resp = false;
InputStream inputStream = null;
PdfReaderversionReader = null;
PdfReader originalReader = null;
String signatureName = "Signature1";
... loading orginal pdf and signature names ...
try {
inputStream = reader.getAcroFields().extractRevision(signatureName);
versionReader = new PdfReader(inputStream);
} catch (IOException e) {
log.warn("unable to get revision for signature FIELD ", e);
}
//After that i have tried to use the actual one, but it still fails...
if (versionReader == null) {
versionReader = originalReader;
}
resp = versionReader.getAcroFields().signatureCoversWholeDocument(signatureName);
My first trouble happens while the creating the versionReader it fails on parsing bytes.
Opening with rups the pdf lokks like have two signature fields tags with the same field name.
- The first one contains byte range /ByteRange: [0, 160, 9634, 121571]
- and the second one /ByteRange: [0, 131726, 1131728, 3904]
iText just recover the first of them and after that fails.
while i was debugging code i have found that in method
com.itextpdf.text.pdf.AcroFields.fill()
in com.itextpdf.text.pdf.AcroFields line 241 this code
if (fields.containsKey(name))
continue;
so it's clearly discarding this information, i don't know but it's possible iText has a bug? or i'm doing something wrong while reading the pdf file?
The point is Adobe Acrobat reader validate all signatures without problems...
This is the PDF with the problem.
All help is well received, thank you in advance.
The PDF issue
As the OP already found out himself, his PDF contains two signature form fields with the same name:
Both fields have a partial name Signature1 and no parent, so that also is the fully qualified name of both fields.
ISO 32000-1, the PDF specification, states on such fields:
In particular, field dictionaries with the same fully qualified field name shall have the same field type (FT), value (V), and default value (DV).
(section 12.7.3.2 Field Names)
In case of the OP's PDF the respective values of those fields are clearly different. Thus, the PDF form field structure is not valid.
A possible iText issue
While in a situation like this obviously a PDF library may ignore any extra field with a name it already has found a field for, the search order of iText is unfortunate:
PDF form fields without parents are expected to be referenced from the PDF's AcroForm dictionary Fields array. Old PDF forms often had form fields only referenced from the page Annots array and not from the AcroForm/Fields. To also support such forms, iText (just like Adobe Reader) considers such fields, too.
In case of the PDF at hand, there only is one field in the AcroForm dictionary Fields array:
The signature field referenced from here is the field ignored by iText. It should be the other way around, iText should prefer fields referenced from the AcroForm dictionary Fields array over those only referenced from page Annots arrays.
Checking the other signature
The OP in a comment said
if i force, by debugging, to get the second one Field, the problem is that the digest is not the correct one,
As Adobe Reader verifies that other signature (at least it does not complain about its hash, merely about missing trust in its signer), this would have meant that iText or Adobe Reader has a bug in hash verification. Thus, I wrote a small test checking the signatures directly or indirectly referenced from the AcroForm dictionary Fields array. You can find the source here. The result for your file:
* A named entry: Signature1
FQP: Signature1
Type: /Sig
Value: present
Signed range: 0 131726 1131728 3904 (covers whole file)
Validity: true
So the newer signature (the only one in AcroForm/Fields, ignored by the iText AcroFields) covers the whole file and verifies ok. So your digest mismatch seems to have been a debugging artifact.
As Micheal says, if you change AcroFields in line 268 to comment
// if (this.fields.containsKey(name)) {
// continue;
// }
the PDF then is fully verified as Adobe Acrobat Reader says, the bad point is you must overwrite
com.itextpdf.text.pdf.AcroFields
or go throug low level iText code.

Special characters in PDF form fields and global and fieldbased DR

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäü߀#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}
I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.

How to obtain PDF table of contents (outline) data in iOS (iPad)?

I am building an iPad application that displays PDFs, and I'd like to be able to display the table of contents and the let user navigate to the relevant pages.
I have invested several hours in research at this point, and it appears that since PDFKit is [not supported in iOS], my only option is to parse the PDF meta data manually.
I have looked at several solutions, but all of them are silent on one point - how to associate a page in the "outline" metadata with the real page number of the item. I have examined my PDF document with [the Voyeur tool] and I can see the outline in the tree.
[This solution] helped me figure out how to navigate down the Outline/A/S/D tree to find the "Dest" object, but it performs some kind of object comparison using [self.pages indexOfObjectIdenticalTo:destPageDic] that I don't understand.
I have read the [official PDF spec from adobe], and section "12.3.2.3 Named Destinations" describes the way that an outline entry can point to a page:
Instead of being defined directly with
the explicit syntax shown in Table
151, a destination may be referred to
indirectly by means of a name object
(PDF 1.1) or a byte string (PDF 1.2).
And continues with this line which is utterly incomprehensible to me:
The value of this entry shall be a
dictionary in which each key is a
destination name and the corresponding
value is either an array defining the
destination, using the syntax shown in
Table 151, or a dictionary with a D
entry whose value is such an array.
This refers to page 366, "12.3.2.2 Explicit Destinations" where a table describes a page: "In each case, page is an indirect reference to a page object"
So is the result of CGPDFDocumentGetPage or CGPDFPageGetDictionary an "indirect reference to a page object"?
I found a [thread on lists.apple.com] that discusses. [This comment] implies that you can compare the address (in memory?) of a CGPDFPageGetDictionary object for a given page and compare it to the pages in the "Outline" tree of the PDF meta data.
However, when I look at the address of page objects in the Outline tree and compare them to addresses they are never the same. The line used in that thread "TTDPRINT(#"%d => %p", k+1, dict);" is printing "dict" as a pointer in memory.. there's no reason to believe that an object returned there would be the same as one returned somewhere else.. they'd be in different places in memory!
My last hope was to look at the source code from apple's command line "outline" tool [mentioned in this book] (as [suggested by this thread]), but I can't find it anywhere.
Bottom line - does anyone have some insight into how PDF outlines work, or know of some open source code (preferably objective-c) that reads PDF outlines?
ARGG: I had all kinds of links posted here, but apparently a new user can only post one link at a time
The result of CGPDFDocumentGetPage is the same as an indirect page reference you get when resolving a destination in an outline item. Both are essentially dictionaries and you can compare them using ==. When you have a CGPDFDictionaryRef that you want to know the page number of, you can do something like this:
CGPDFDocumentRef doc = ...;
CGPDFDictionaryRef outlinePageRef = ...;
for (int p=1; p<=CGPDFDocumentGetNumberOfPages(doc); p++) {
CGPDFPageRef page = CGPDFDocumentGetPage(doc, p);
if (page == outlinePageRef) {
printf("found the page number: %i", p);
break;
}
}
An explicit destination however is not a page, but an array with the first element being the page. The other elements are the scroll position on the page etc.