How do PDF readers validate form fields? - pdf

I was looking at the source code of several pdf files which were digitally signed (and had annotations and form fields as well).
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
I just attached an example pdf at this link:
https://www.mediafire.com/file/q8ed9rkf35kgxgq/output.txt/file

Some Misconceptions
There apparently are a number of misconceptions to clear up here.
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
The M entry of annotation dictionaries is optional, so you cannot count on it being there.
The format of the annotation M value essentially only is a String; it is merely recommended to contain dates as specified in the PDF specification but not required, so you might find a value like "my mother's 42nd birthday" in it.
The annotation M value is not backed by a digital timestamp, so a forger could put anything there.
Furthermore, the M entry of a signature dictionary also is optional, and by itself it also cannot be trusted.
Thus, no, this represents no means to validate anything.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
First of all, as explained above, the M values cannot be used at all for modification detection, so whether some objects do or do not have them, is irrelevant.
Furthermore, a form XObject by itself is not a form field meant by the document modification detection and prevention settings of a signature. The form fields meant are AcroForm form fields (or, in a deprecated special case, XFA form fields). A form XObject may be used as appearance of an AcroForm form field but in that case the pivotal check point is the form field itself.
How to Validate Changes
(For some backgrounds on PDF signatures you may want to read this answer first.)
Depending on the document modification detection and prevention (DocMDP) settings of the signatures of a document only certain changes are allowed to a document, see this answer.
But even the allowed changes may not be applied by changing the original objects in the PDF. That would after all change the digest over the originally signed byte ranges and so invalidate the signature. Thus, the changed and added objects are appended to the PDF, capped off by a cross reference table or streams for these objects.
Thus, what you have to do for DocMDP validation, is determining the extend of the signed revision in the PDF file, finding out that way what has been appended, and analyzing whether those additions change the signed revision in allowed or disallowed ways.
While this may sound simple at first, it is not, in particular because "allowed" and "disallowed" changes are characterized by their effects on document appearance and behavior, not by the actual PDF objects that may be affected.
Here currently ETSI working groups are attempting to transform those characterizations into criteria for PDF objects; the results are to be published as ETSI TS 119 102-3, probably in multiple parts.
Some Details
In comments you asked
how do you tell from the appearance of a modified object, whether it was added before or after the digital sign?
Well, as mentioned above:
First you determine the extend of the signed revision in the PDF file.
I.e. you take the ByteRange entry of the signature dictionary and take the start of the lower range and the end of the higher range. E.g. if that entry is
[ 0 67355 72023 6380 ]
the the signed revision starts at offset 0 and ends at offset 72023+6280-1=78302 inclusively.
(Obviously some sanity checks are indicated, in particular that the start offset is 0, that the gap in the signed byte ranges exactly contains the signature dictionary Contents value, that the signed revision as a whole is a valid PDF and all its cross references point to indirect objects completely contained in that signed revision, and that that signed revision indeed is a previous revision of the whole PDF, i.e. that the chain of cross reference streams or tables contains the cross reference stream or table of that revision.)
Then you find out that way what has been appended.
I.e. you compare the cross reference stream or table of the whole file and the cross reference stream or table of the signed revision.
If some object is referenced for an object number now but was not referenced for that object number in the signed revision, you found a change to check.
(Strictly speaking you should iterate along the chain of cross reference streams or tables from the signed revision to the whole file, i.e. revision by revision, and check the changes in each revision.)
For this procedure you obviously have to use the original file, not some version uncompressed by tools like qpdf, otherwise you cannot do the offset tests.
Is it possible for an attacker to add a new annotation object before the xref table corresponding to the digital signature, and adjust the previous xref table values, so that a broken document passes as an accepted document?
No. The signed revision including its cross references is covered by the signature. Manipulating those bytes will invalidate the signature mathematically.

Related

PDFBOX2.X - Read wrong Data from acroForm

We are facing issue while reading data from acroform.
We are using PDFBOX2.x version to create PDF file. Our goal was to make pdf executable means we can download pdf file which contains acroform. We can collect data and later we can upload it to sync with database.
We are facing issue in which PDFBox debugger or we can say in upload file. Our textbox value is getting changed automatically.
PLease see more details in below image.
We have used PDF Debugger tool to check PDF content. You can see that totalBuyoffCount value is 0. But it should be 3.
I have used iTextDebugger to check same field
It is totally random behavior and we have noticed following things
Sometimes 0 or 1 value became N or Y
Very few fields are affected but it causes NumberFormat exception if it’s converted into String value.
It’s makes our whole file corrupted.
If it cannot be fixed then could you tell me in which area we need to see so that we can understand and debug why it’s value changed or from where value is retrieved so in case if we find then we can change or override this behavior
Looking at the PDF object in question (1499, gen 0) one finds
1499 0 obj
<<
/FT /Tx
/Q 0
/V (0)
/Ff 1
/Rect [0.0 1.0 1.0 1.0]
/Type /Annot
/Subtype /Widget
/T <77A2A671303FC282631C0C903EA8F40F>
/DA <2C85B77C2A5D81D53C5A3EB571EDBA1C>
/F 6
>>
endobj
So one might be tempted to say you see "/V (0)". And not 3.
While this is correct, it unfortunately does not mean a lot because the file is encrypted!
Thus, the question burns down to whether the string 0 in object 1499, generation 0, decrypts to "0" or to "3".
I have not implemented a PDF decrypter myself, so I cannot claim to check this with my own code.
The second best I can do is check against what Adobe decrypts that value to. My good old Adobe Acrobat 9.5 Preflight shows:
Apparently Adobe just like iText decrypts this 0 to "3". Additional checks with an online PDF decrypter or two support this decryption result.
Thus, it appears that PDFBox does not properly decrypt this 0 string.
Considering the OP's further observation "Sometimes 0 or 1 value became N or Y Very few fields are affected" it looks like PDFBox sometimes does not correctly decrypt single character strings.
An alternative option would be that there is some issue in the encryption parameters of the file in question. I don't really believe this but I cannot preclude it.
The bug
As Tilman already hinted at in his comments to PDFBOX-4453, the bug is due to the way PDFBox and in particular the SecurityHandler keeps track of which objects already have been decrypted and which still have to be: The already decrypted objects are stored in the HashSet SecurityHandler.objects; when asked to decrypt an object, SecurityHandler.decrypt first checks whether that object is in that set, and only if it is not, it is actually decrypted and added to the set.
Thus, if a still encrypted object equals an already decrypted one, a call to decrypt this encrypted object won't do anything at all.
In the file at hand there has been a string before that has been decrypted to "0". Thus, when the encrypted value of totalBuyoffCount, 0, is sent to the SecurityHandler for decryption, the value falsely is assumed to already be decrypted, so it remains as it is.
For longer strings this usually is no issue as their encrypted versions usually are completely garbled, so they won't be found among the already decrypted objects. Short strings, in particular single-character ones, on the other hand might have encrypted versions that make sense, so collisions may happen.
Options to fix this are discussed in the referenced Apache Jira issue. One option would be to replace the mentioned set by a flag of the individual objects in question but other options also are possible.

How can I test the equality of two MediaStreams?

I'm wondering if there is a way to determine if two MediaStreams are equal.
What do you mean by "equal"?
I'd like to determine if the two streams are using the same hardware sources (Same microphone and camera are being used).
Acquiring streamB with the exact same constraints as streamA would mean they are equal.
Here is what I've tried so far:
comparing via the MediaStream id e.g.: streamA.id == streamB.id
This falls away since according to the spec:
When a MediaStream object is created, the User Agent must generate an identifier string, and must initialize the object's id attribute to that string. A good practice is to use a UUID [rfc4122], which is 36 characters long in its canonical form. To avoid fingerprinting, implementations should use the forms in section 4.4 or 4.5 of RFC 4122 when generating UUIDs.
Compare the id's of the MediaStreamTracks - same story, a UUID is generated per track.
Compare the tracks labels, which in the current Chrome contain names/identifiers of the hardware. This is very close to what I'm looking for, however (emphasis mine):
User Agents may label audio and video sources (e.g., "Internal microphone" or "External USB Webcam"). The label attribute must return the label of the object's corresponding source, if any. If the corresponding source has or had no label, the attribute must instead return the empty string
Is there a different approach I could take? Should I never end up in a situation where I compare two media streams? Would you say I can trust the label attribute?
Thanks for your time.
groupId together with kind is probably the closest thing you will get. Until you get multiple mics/cams on the same device...

Difference between the ID of a pdf read from iTextSharp and pdf.js

I am trying to parse the ID of a particular pdf (this) using iTextSharp as mentioned in this answer. But I get null array for ID whereas I can see that another pdfReader (pdf.js) can read the id as 77a2a5c4fc17dc3a91a072c46fe69ec0. Why is this behaviour different? Am I expected to read the ID field from some place other than the trailer?
Open a regular PDF with an ID in a text editor like this:
Right before where it says startxref, you see a dictionary (it starts with <<). That's the trailer dictionary of the PDF. One of the (optional) entries is the /ID which is an array containing two PDF strings.
If your PDF has such an entry, then the answer to the question Extract ID of a PDF document using iTextSharp won't return null.
Now open your PDF in a text editor:
Again you see a dictionary (the trailer dictonary) before startxref. However, in this case, the dictionary only has three entries: /Size (the number of objects in the cross-reference table), /Info (a reference to the dictionary containing the metadata) and /Root (a reference to the catalog dictionary).
There is no /ID entry, hence iText (and iTextSharp) should return null (and you confirmed that they do).
Now search for the value 77a2a5c4fc17dc3a91a072c46fe69ec0 in the PDF you've opened in a text editor. You won't find that value anywhere because it's just not there!
Summarized: your question Am I expected to read the ID field from some place other than the trailer? is wrong. You are asking how to read something that isn't there. Your question should be: Why is pdf.js creating an ID for PDFs that don't have one, and how do I retrieve it? The answer to the first part is reasonable: even iText tries to create an /ID when you manipulate a PDF because it's good practice for a PDF to have an ID. The answer to the second part is: look in the trailer (but you already knew that).
Conclusion: based on feedback in the comments, it turns out the the OP is using the fingerprint() method in pdf.js. This method returns the first element of the ID if and ID is present. If no ID is found, and MD5 hash is returned. See the source code of the fingerprint() method in pdf.js.

Special characters in PDF form fields and global and fieldbased DR

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäü߀#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}
I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.

How does [0] and [3] wơrk in ASN1?

I'm decoding ASN1 (as used in X.509 for HTTPS certificates). I'm doing pretty well, but there is a thing that I just cannot find and understandable documentation for.
In this JS ASN1 parser you see a [0] and a [3] under a SEQUENCE element, the first looking like this in data: A0 03 02 01 02 .... I want to know what this means and how to decode it.
Another example is Anatomy of an X.509 v3 Certificate, there is a [0] right after the first two SEQUENCE elements.
What I don't understand is how A0 fits with the scheme where the first 2 bits of the tag byte are a class, the next a primitive/constructed bit and the remaining 5 are supposed to be the tag type. A0 is 10100000 which means that the tag type value would be zero.
It sounds like you need an introduction to ASN.1 tagging. There are two angles to approach this from. X.690 defines BER/CER/DER encoding rules. As such, it answers the question of how tags are encoded. X.680 defines ASN.1 itself. As such, it defines the syntax and rules for tagging. Both specifications can be found on the ITU-T website. I'll give you a quick overview.
Tags are used in BER/DER/CER to identify types. They are especially useful for distinguishing the components of a SEQUENCE and the alternatives of a CHOICE.
A tag combines a tag class and a tag number. The tag classes are UNIVERSAL, APPLICATION, PRIVATE, and CONTEXT-SPECIFIC. The UNIVERSAL class is basically used for the built-in types. APPLICATION is typically used for user-defined types. CONTEXT-SPECIFIC is typically used for the components inside constructed types (SEQUENCE, CHOICE, SEQUENCE OF). Syntactically, when tags are specified in an ASN.1 module, they are written inside brackets: [ tag_class tag_number ]; for CONTEXT-SPECIFIC, the tag_class is omitted. Thus, [APPLICATION 10] or [0].
While every ASN.1 type has an associated tag, syntactically, there is also the "TaggedType", which is used by an ASN.1 author to specify the tag to encode a type with. Basically, a TaggedType puts a tag prefix ahead of a type. For example:
MyType ::= SEQUENCE {
field_with_tagged_type [0] UTF8String
}
The tag in a TaggedType is either explicit or implicit. If explicit, this means that I want the original tag to be explicitly encoded. If implicit, this means I am happy to have only the tag that I specified be encoded. In the explicit case, the BER encoding results in a nested TLV (tag-length-value): the outer tag ([0] in the example above), the length, and then another TLV as the value. In the example, this inner TLV would have a tag of [UNIVERSAL 12] for the UTF8String.
Whether the tag is explicit or implicit depends upon how you write the tag and the tagging environment. For example:
MyType2 ::= SEQUENCE {
field_with_explicit_tag [0] EXPLICIT UTF8String OPTIONAL,
field_with_implicit_tag [1] IMPLICIT UTF8String OPTIONAL,
field_with_tag [2] UTF8String OPTIONAL
}
If you specify neither IMPLICIT nor EXPLICIT, there are some rules that define whether the tag is explicit or implicit (see X.680 31). These rules take into consideration the tagging environment defined for the ASN.1 module. The ASN.1 module may specify the tagging environment as IMPLICIT TAGS, EXPLICIT TAGS, or AUTOMATIC TAGS. Roughly speaking, if you don't specify IMPLICIT or EXPLICIT for a tag, the tag will be explicit if the tagging environment is EXPLICIT and implicit if the tagging environment is IMPLICIT or AUTOMATIC. An automatic tagging environment is basically the same as an IMPLICIT tagging environment, except that unique tags are automatically assigned for members of SEQUENCE and CHOICE types.
Note that in the above example, the three components of MyType2 are all optional. In BER/CER/DER, a decoder will know what component is present based on the encoded tag (which obviously better be unique).
ASN.1 BER and DER use ASN.1 TAGS to unambiguously identify certain components in an encoded stream. There are 4 classes of ASN.1 tags: UNIVERSAL, APPLICATION, PRIVATE, and context-specific. The [0] is a context-specific tag since there is no tag class keword in front of it. UNIVERSAL is reserved for built-in types in ASN.1. Most often you see context specific tags to eliminate potential ambiguity in a SEQUENCE which contains OPTIONAL elements.
If you know you are receiving two items that are not optional, one after the other, you know which is which even if their tags are the same. However, if the first one is optional, the two must have different tags, or you would not be able to tell which one you had received if only one was present in the encoding.
Most often today, ASN.1 specification use "AUTOMATIC TAGS" so that you don't have to worry about this kind of disambiguation in messages since components of SEQUENCE, SET and CHOICE will automatically get context specific tags starting with [0], [1], [2], etc. for each component.
You can find more information on ASN.1 tags at http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html where two free downloadable books are available.
Another excellent resource is http://asn1-playground.oss.com where you can try variations of ASN.1 specifications with different tags in an online compiler and encoder/decoder. There you can see the effects of tag changes on encodings.
I finally worked through this and thought that I would provide some insight for anyone still trying to understand this. In my example, as in the one above, I was using an X.509 certificate in DER format. I came across the "A0 03 02 01 02" sequence and could not figure out how that translated to a version number of 2. So if you are having the same problem, here is how that works.
The A0 tells you it is a "Context-Specific" field, a "Constructed" tag, and has the type value of 0x00. Immediately, the context-specific tells you not to use the normal type fields for DER/BER. Instead, given this is a X.509 certificate, the type value is labeled in the RFC 5280, p 116. There you will see four fields with markers on them of [0], [1], [2], and [3], standing for "version", "issuerUniqueID", "subjectUniqueID", and "extension", respectively. So in this case, a value of A0 tells you that this is one of the X.509 context-specific fields, specifically the "version" type. That takes care of the "A0" value.
The "03" value is just your length, as you might expect.
Since this was identified as "Constructed", the data should represent a normal DER/BER object. The "02 01 02" is the actual version number you are looking for, expressed as an Integer. "02" is the standard BER encoding of Integer, "01" is your length, and "02" is your value, or in this case, your version number.
So given that X.509 defines 4 context-specific types, you should expect to see "A0", "A1", "A2", and "A3" anywhere in the certificate. Hopefully the information provided above will now make more sense and help you better understand what those marker represent.
[0] is a context-specific tagged type, meaning that to figure out what meaning it gives to the fields (if the "Constructed" flag is set) or data value (if "Constructed" flag is not set) it wraps; you have to know in what context it appears in.
In addition, you also need to know what kind of object the sender and receiver are exchanging in the DER stream, ie. the "ASN.1 module".
Let's say they're exchanging a Certificate Signing Request, and [0] appears as the 4th field inside a SEQUENCE inside the root SEQUENCE:
SEQUENCE
SEQUENCE
INTEGER 0
SEQUENCE { ... }
SEQUENCE { ... }
[0] { ... }
}
}
Then according to RFC2968, which defines the DER contents for Certificate Signing Request, Appendix A, which defines the ASN.1 Module, the meaning of that particular field is sneakily defined as "Attributes" and "Should have the Constructed flag set":
attributes [0] Attributes{{ CRIAttributes }}
You can also go the other way and see that "attributes" must be the 4th field inside the first sequence inside the root sequence and tagges as [0] by looking at the root sequence definition (section 4: "the top-level type CertificationRequest"), finding the CertificationRequestInfo placement inside that, and finding where the "attributes" item is located inside the CertificationRequestInfo, and finally seeing how it is tagged.