PDFBOX2.X - Read wrong Data from acroForm - pdfbox

We are facing issue while reading data from acroform.
We are using PDFBOX2.x version to create PDF file. Our goal was to make pdf executable means we can download pdf file which contains acroform. We can collect data and later we can upload it to sync with database.
We are facing issue in which PDFBox debugger or we can say in upload file. Our textbox value is getting changed automatically.
PLease see more details in below image.
We have used PDF Debugger tool to check PDF content. You can see that totalBuyoffCount value is 0. But it should be 3.
I have used iTextDebugger to check same field
It is totally random behavior and we have noticed following things
Sometimes 0 or 1 value became N or Y
Very few fields are affected but it causes NumberFormat exception if it’s converted into String value.
It’s makes our whole file corrupted.
If it cannot be fixed then could you tell me in which area we need to see so that we can understand and debug why it’s value changed or from where value is retrieved so in case if we find then we can change or override this behavior

Looking at the PDF object in question (1499, gen 0) one finds
1499 0 obj
<<
/FT /Tx
/Q 0
/V (0)
/Ff 1
/Rect [0.0 1.0 1.0 1.0]
/Type /Annot
/Subtype /Widget
/T <77A2A671303FC282631C0C903EA8F40F>
/DA <2C85B77C2A5D81D53C5A3EB571EDBA1C>
/F 6
>>
endobj
So one might be tempted to say you see "/V (0)". And not 3.
While this is correct, it unfortunately does not mean a lot because the file is encrypted!
Thus, the question burns down to whether the string 0 in object 1499, generation 0, decrypts to "0" or to "3".
I have not implemented a PDF decrypter myself, so I cannot claim to check this with my own code.
The second best I can do is check against what Adobe decrypts that value to. My good old Adobe Acrobat 9.5 Preflight shows:
Apparently Adobe just like iText decrypts this 0 to "3". Additional checks with an online PDF decrypter or two support this decryption result.
Thus, it appears that PDFBox does not properly decrypt this 0 string.
Considering the OP's further observation "Sometimes 0 or 1 value became N or Y Very few fields are affected" it looks like PDFBox sometimes does not correctly decrypt single character strings.
An alternative option would be that there is some issue in the encryption parameters of the file in question. I don't really believe this but I cannot preclude it.
The bug
As Tilman already hinted at in his comments to PDFBOX-4453, the bug is due to the way PDFBox and in particular the SecurityHandler keeps track of which objects already have been decrypted and which still have to be: The already decrypted objects are stored in the HashSet SecurityHandler.objects; when asked to decrypt an object, SecurityHandler.decrypt first checks whether that object is in that set, and only if it is not, it is actually decrypted and added to the set.
Thus, if a still encrypted object equals an already decrypted one, a call to decrypt this encrypted object won't do anything at all.
In the file at hand there has been a string before that has been decrypted to "0". Thus, when the encrypted value of totalBuyoffCount, 0, is sent to the SecurityHandler for decryption, the value falsely is assumed to already be decrypted, so it remains as it is.
For longer strings this usually is no issue as their encrypted versions usually are completely garbled, so they won't be found among the already decrypted objects. Short strings, in particular single-character ones, on the other hand might have encrypted versions that make sense, so collisions may happen.
Options to fix this are discussed in the referenced Apache Jira issue. One option would be to replace the mentioned set by a flag of the individual objects in question but other options also are possible.

Related

How do PDF readers validate form fields?

I was looking at the source code of several pdf files which were digitally signed (and had annotations and form fields as well).
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
I just attached an example pdf at this link:
https://www.mediafire.com/file/q8ed9rkf35kgxgq/output.txt/file
Some Misconceptions
There apparently are a number of misconceptions to clear up here.
I noticed that each "Annot" dictionary has a "M" value which stores the latest time it was modified - which can then be checked with the "M" value for the "Sig" dictionary which stores when the pdf file was digitally signed.
The M entry of annotation dictionaries is optional, so you cannot count on it being there.
The format of the annotation M value essentially only is a String; it is merely recommended to contain dates as specified in the PDF specification but not required, so you might find a value like "my mother's 42nd birthday" in it.
The annotation M value is not backed by a digital timestamp, so a forger could put anything there.
Furthermore, the M entry of a signature dictionary also is optional, and by itself it also cannot be trusted.
Thus, no, this represents no means to validate anything.
However, I noticed that dictionaries with type "XObject" and subtype "Form" do not have an "M" value - i.e. do not store the time at which said form was modified. In such cases, how do pdf readers validate whether a change to form field is allowed (for eg, in a digital sign where no changes are allowed, no form field can be changed after the digital sign is done - how is this verified?)?
First of all, as explained above, the M values cannot be used at all for modification detection, so whether some objects do or do not have them, is irrelevant.
Furthermore, a form XObject by itself is not a form field meant by the document modification detection and prevention settings of a signature. The form fields meant are AcroForm form fields (or, in a deprecated special case, XFA form fields). A form XObject may be used as appearance of an AcroForm form field but in that case the pivotal check point is the form field itself.
How to Validate Changes
(For some backgrounds on PDF signatures you may want to read this answer first.)
Depending on the document modification detection and prevention (DocMDP) settings of the signatures of a document only certain changes are allowed to a document, see this answer.
But even the allowed changes may not be applied by changing the original objects in the PDF. That would after all change the digest over the originally signed byte ranges and so invalidate the signature. Thus, the changed and added objects are appended to the PDF, capped off by a cross reference table or streams for these objects.
Thus, what you have to do for DocMDP validation, is determining the extend of the signed revision in the PDF file, finding out that way what has been appended, and analyzing whether those additions change the signed revision in allowed or disallowed ways.
While this may sound simple at first, it is not, in particular because "allowed" and "disallowed" changes are characterized by their effects on document appearance and behavior, not by the actual PDF objects that may be affected.
Here currently ETSI working groups are attempting to transform those characterizations into criteria for PDF objects; the results are to be published as ETSI TS 119 102-3, probably in multiple parts.
Some Details
In comments you asked
how do you tell from the appearance of a modified object, whether it was added before or after the digital sign?
Well, as mentioned above:
First you determine the extend of the signed revision in the PDF file.
I.e. you take the ByteRange entry of the signature dictionary and take the start of the lower range and the end of the higher range. E.g. if that entry is
[ 0 67355 72023 6380 ]
the the signed revision starts at offset 0 and ends at offset 72023+6280-1=78302 inclusively.
(Obviously some sanity checks are indicated, in particular that the start offset is 0, that the gap in the signed byte ranges exactly contains the signature dictionary Contents value, that the signed revision as a whole is a valid PDF and all its cross references point to indirect objects completely contained in that signed revision, and that that signed revision indeed is a previous revision of the whole PDF, i.e. that the chain of cross reference streams or tables contains the cross reference stream or table of that revision.)
Then you find out that way what has been appended.
I.e. you compare the cross reference stream or table of the whole file and the cross reference stream or table of the signed revision.
If some object is referenced for an object number now but was not referenced for that object number in the signed revision, you found a change to check.
(Strictly speaking you should iterate along the chain of cross reference streams or tables from the signed revision to the whole file, i.e. revision by revision, and check the changes in each revision.)
For this procedure you obviously have to use the original file, not some version uncompressed by tools like qpdf, otherwise you cannot do the offset tests.
Is it possible for an attacker to add a new annotation object before the xref table corresponding to the digital signature, and adjust the previous xref table values, so that a broken document passes as an accepted document?
No. The signed revision including its cross references is covered by the signature. Manipulating those bytes will invalidate the signature mathematically.

ftell/fseek fail when near end of file

Reading a text file (which happens to be a PDS Member FB 80)
hFile = fopen(filename,"r");
and have reached up to the point in the file where there is only an empty line left.
FilePos = ftell(hFile);
Then read the last line, which only contains a '\n' character.
fseek(hFile, FilePos, SEEK_SET);
fails with:-
errno=(27) EDC5027I The position specified to fseek() was invalid.
The position specified to fseek() was returned by ftell() a few lines earlier. It has the value 841 in the specific error case I have seen. Checking through the debugger, this is also the value returned by ftell a few lines earlier. It has not been corrupted.
The same code works at other positions in the file, and only fails at the point where there is a single empty line left to read when the position is remembered.
My understanding of how ftell/fseek should work is succinctly captured by another answer on SO.
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
It would seem that I cannot rely on the one thing I should be able to rely on.
My questions is, why does fseek fail in this way?
As z/OS has some file formats that are unique you might find the answer in this Knowledge Center article.
Given that you are processing a PDS member I would suspect that this is record level I/O which is handled differently than stream I/O which is more common in distributed implementations.
I do not know why fseek fails in this way, but if your common usage pattern is to use ftell to get the position and then fseek to go to that position, I strongly suggest using fgetpos and fsetpos instead for data set I/O. Not only will you avoid this problem that you are finding, but it is also better performing for certain data set characteristics.

google authenticator vs vbscript

I have implemented this http://jacob.jkrall.net/totp/ in vbscript.
My code given the same hex gives the right 6-digit otp, so that part is working.
I've also verified the HMAC-SHA-1. encoding against an online generator, http://www.freeformatter.com/hmac-generator.html#ad-output, same input gives same output.
My time is the same as http://www.currenttimestamp.com/
I've generated a qrcode at http://www.qr-koder.dk/ with the string otpauth://totp/$LABEL?secret=$SECRET and the google authenticator app reads the code and starts outputting the 6 digit code changing every 30 seconds.
BUT THE CODES FROM THE APP DOES NOT MATCH THE 6-DIGIT CODE THE VBSCRIPT GENERATES!
I've even tried trunc(time/30) +/-7500 steps to see if it was a timezone/daylight saving problem, to no avail.
As the other parts of the routine to generate the 6 digits seem to work I've come to the conclusion I don't understand this:
the url on the qr-code is
otpauth://totp/$LABEL?secret=$SECRET
with the explanation
LABEL can be used to describe the key in your app, while SECRET is the
16-character base32-encoded shared secret, which is now known to both
the client and the server.
So when I calculate HMAC-SHA-1(SECRET, time()/30)
should the SECRET be the same string given to both the app and the calculation?
If I select a secret of 1234567890, the base32 is GEZDGNBVGY3TQOJQ according to http://emn178.github.io/online-tools/base32_encode.html.
Should I then take
HMAC-SHA-1("1234567890", time()/30)
or
HMAC-SHA-1("GEZDGNBVGY3TQOJQ", time()/30)
?
I believe I've tried both, and neither works.
The system unix time is correct.
I guess the problem might be with the secret in your HMAC-SHA-1 function. It very much depends on what the HMAC-SHA-1 expects.
Your string "123456790" might be a binary string. Is it an ascii representation or utf8? I.e. is this string 10 bytes or 20 bytes long?
I recommend getting the input string in your VBScript right.
On the other hand, instead of writing your own VBScript, you can also use a ready made solution like the privacyIDEA authentication server, which is open source and also comes with TOTP.

Difference between the ID of a pdf read from iTextSharp and pdf.js

I am trying to parse the ID of a particular pdf (this) using iTextSharp as mentioned in this answer. But I get null array for ID whereas I can see that another pdfReader (pdf.js) can read the id as 77a2a5c4fc17dc3a91a072c46fe69ec0. Why is this behaviour different? Am I expected to read the ID field from some place other than the trailer?
Open a regular PDF with an ID in a text editor like this:
Right before where it says startxref, you see a dictionary (it starts with <<). That's the trailer dictionary of the PDF. One of the (optional) entries is the /ID which is an array containing two PDF strings.
If your PDF has such an entry, then the answer to the question Extract ID of a PDF document using iTextSharp won't return null.
Now open your PDF in a text editor:
Again you see a dictionary (the trailer dictonary) before startxref. However, in this case, the dictionary only has three entries: /Size (the number of objects in the cross-reference table), /Info (a reference to the dictionary containing the metadata) and /Root (a reference to the catalog dictionary).
There is no /ID entry, hence iText (and iTextSharp) should return null (and you confirmed that they do).
Now search for the value 77a2a5c4fc17dc3a91a072c46fe69ec0 in the PDF you've opened in a text editor. You won't find that value anywhere because it's just not there!
Summarized: your question Am I expected to read the ID field from some place other than the trailer? is wrong. You are asking how to read something that isn't there. Your question should be: Why is pdf.js creating an ID for PDFs that don't have one, and how do I retrieve it? The answer to the first part is reasonable: even iText tries to create an /ID when you manipulate a PDF because it's good practice for a PDF to have an ID. The answer to the second part is: look in the trailer (but you already knew that).
Conclusion: based on feedback in the comments, it turns out the the OP is using the fingerprint() method in pdf.js. This method returns the first element of the ID if and ID is present. If no ID is found, and MD5 hash is returned. See the source code of the fingerprint() method in pdf.js.

Special characters in PDF form fields and global and fieldbased DR

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäü߀#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}
I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.