I'm trying to make a timestamp of the PDF document using our library that we're making. I added a new section to the PDF document. I added new annotation object for the signature and the signature object containing the actual signature, and also a new xref table for the new section. When I check the xref entries everything seems right.
When I try to verify my signature in Acrobat I get the following error message "there are errors in the formatting of information contained in this signature (The signature byte range is invalid)".
However, when I check the byte range, everything seems right. I goes from the beginning of the document to the opening brackets of the Contents part, and from the end of it to the end of the document. I compared it with the document that has valid signature and it seems that byte ranges look the same.
I really don't understand what's wrong and why is Acrobat showing this error.
Here's the link to the signed file if anybody wants to have a look: https://ufile.io/mckajk9h
PS: I can share some part of the code, but the actual question is about the Acrobat reader and how it interprets PDF signatures, not my code. So, the relevant part should be the resulting PDF file that I shared.
There are some issues in your PDF.
The Major Issue
The major one which results in the non-intuitive error message: The first entry in your incremental update cross reference table is one byte too short:
As you can see, the first cross reference table entry (0000000000 65535 f\n) is one byte too short, according to the specification it has to be exactly 20 bytes long but yours is only 19 bytes long.
Whenever Adobe Acrobat sees structurally broken cross reference tables, it internally repairs the file. In the repaired file the objects are rearranged which renders the signature byte range invalid.
As soon as I had fixed this by adding a space between f and \n, Adobe Acrobat did not complain about a formatting error anymore. Of course it claimed that the document had been altered or corrupted, after all I had altered it. But at least it accepted the signature structure.
Minor Issues
Some of your offsets are incorrect, in your incremental update you have
xref
0 2
0000000000 65535 f
0000003029 00000 n
12 2
0000003144 00000 n
0000003265 00000 n
Thus, object 12 should start at 3144 but it actually starts at 3145.
Furthermore, you have
startxref
19548
%%EOF
So the xref keyword should start at 19548 but it starts at 19549.
Related
I'm trying to create an open source library for Digital Signing of PDF files.
Bad parameter
I got most of it done, but I have a problem that the signature shows the following error:
Error during signature verification.
Adobe Acrobat error.
Bad parameter.
I tried to find the problem, but until now have not found it.
I have created 2 pdf files that are striped of almost all other data, except the needed info.
Does anyone know where this error might originate from?
I have already tried different online and offline validators, but non of them pointed me in the right direction.
Does anyone know if this error might originate from the certificate and not the pdf struture itself?
Invalid byte range
While creating this post I also tested it on other pdf file too, but got the error:
Error during signature verification.
Unexpected byte range values defining scope of signed data.
Details: The signature byte range is invalid
Note a slice of the pdf describes it as:
...
/SubFilter/adbe.pkcs7.detached
/ByteRange[0 4197 22193 30080 ]
/Contents<30820...
I have multiple times recalculated the ByteRange attribute and even tried changing it by one byte in each direction, but that will always result in Signature processing error..
I don't know what else can be incorrect about the ByteRange. (the added spaces are the same as how Acrobat pads the byterange.)
If anyone might have an idea on what the problem might is, let me know.
Files
Here are my resulting files:
result_bad_param_with_image.pdf (mirror1) (mirror2)
result_bad_param_no_image.pdf (mirror1) (mirror2)
result2_invalid_byte_range_with_image.pdf (mirror1) (mirror2)
result2_invalid_byte_range_no_image.pdf (mirror1) (mirror2)
A signature file (same as the Contents field in a pdf, excepts directly in separate file):
signature.der (mirror1) (mirror2)
The content of the signature.der is printed here: https://pastebin.com/W4EGJ2fX
(using openssl cms -inform DER -in signature.der -cmsout -print command)
(I know the signature is self signed and does not contains a lot of info, but this should not matter for this, I think, this was just to create these examples)
Edit: New links after solving some problems and added some extra files:
result.pdf
signature.der
signed_content.der
There are some errors in your signature and an uncommon structure which in the context of digital signatures may result in rejection by a validator.
Incorrect Signed Hash Value Inside Signature Container
Signing in CMS signature containers with signed attributes makes use of two hash values:
the hash value of the signed byte ranges of the PDF; that value is correct in your example files;
the hash value of the signed attributes in the SignerInfo of the signature container; that value is not correct in your example files.
PS: Looking into the mismatch once again, it turns out that your signed attributes are not DER encoded: The DER encoding in particular sorts the elements of a SET in a specific order, and in your case the attribute order is not the DER order. The specification requires the signed attributes to be DER encoded, though.
PPS: In a comment you argued
I just checked the order of the SET and I can not find anything that is wrong with it. Here is my reasoning, let me know what part is incorrect. The items should be order according to the 'key' which in my case is an Oid.
First of all, this reasoning is flawed: The type in question is a set type (more exactly an ASN.1 SET OF), not some map type; and the DER encoding rule set only knows the ASN.1 base types. Thus, that OID (which just is an arbitrary part of the attribute structure) cannot be the generic ordering key.
And indeed, a quick glance at the specification shows:
11.6 Set-of components
The encodings of the component values of a set-of value shall appear in ascending order, the encodings being compared
as octet strings with the shorter components being padded at their trailing end with 0-octets.
NOTE – The padding octets are for comparison purposes only and do not appear in the encodings.
(ISO/IEC 8825-1 / ITU-T Rec. X.690, section 11 "Restrictions on BER employed by both CER and DER")
Thus, in case of the signed attributes essentially you first DER encode the attribute elements and then sort the resulting byte arrays as described above.
As an aside, this issue of your signature merely causes problems in probably half the validators around. Some validators do not check or DER re-encode the signed attributes, so they get the same hash as you get. Others either check the encoding up front (and, therefore, throw an error because of the issue) or simply re-encode the attributes in DER (and, therefore, get a different hash than you get).
Problematic Extended Key Usage of Signer Certificate
Your signer certificate has an extended key usage value 1.3.6.1.4.1.311.80.1 (Microsoft's OID for Document Encryption) and only that. Adobe validation used to only support certificates with either no extended key usage or one or more of the following:
emailProtection
codeSigning
anyExtendedKeyUsage
1.2.840.113583.1.1.5 (Adobe Authentic Documents Trust)
See Enterprise Toolkit » Digital Signatures Guide for IT » A: Changes Across Releases.
Incorrect Incremental Updates
You sign in an incremental update to the original PDF. This in general is a good idea as it allows to extract the unsigned original document.
But one needs to add the incremental update correctly, and in case of result2_invalid_byte_range_no_image.pdf and result2_invalid_byte_range_with_image.pdf it is done incorrectly: The original revision there is created using cross reference tables but your incremental updates use pure cross reference streams. This is incorrect, you have to continue with the same kind of cross references.
When opening documents with a mix of cross reference tables and pure cross reference streams, Adobe Acrobat internally repairs this which in particular relocates signatures and so makes byte ranges incorrect.
Uncommon Signature Field Structure
You use an uncommon signature field structure in your example PDFs, you separate the widget from the field and only update the field, not the widget, in signing.
While this strictly speaking is ok, I would implement the common structures while making the code work at all, and deviate only thereafter.
PS: In a comment you asked whether I could elaborate on this.
Your signing implementation in a first step adds an incremental update with an empty signature field and a widget as indirectly referenced kid, e.g.:
16 0 obj
<<
/Type /Annot
/F 4
/Subtype /Widget
/BS << /Type /Border /S /S /W 0 >>
/Parent 17 0 R
/P 2 0 R
/Rect [141.75 664.89 276.75 702.39]
/AP << /N 18 0 R >>
/MK << /BC [.1882353 .1882353 .1882353] /BG [1.00 1.00 1.00] /R 0 >>
/DA (/TiRo 0 Tf 0 0 0 rg\r
)
>>
endobj
17 0 obj
<<
/Kids [16 0 R]
/FT /Sig
/T (eyJ1c2VySWQiOiIyNzIifQ==)
>>
In another incremental update you then sign the field with a direct signature value but don't change the widget, e.g.:
17 0 obj
<<
/Kids [16 0 R]
/FT /Sig
/T (eyJ1c2VySWQiOiIyNzIifQ==)
/V << /Type/Sig ... >>
>>
This is uncommon in some ways:
Usually for signatures the option to merge field and widget is used.
Usually for signatures (except for usage right signatures) the signature dictionary is not a direct but an indirect value of the key V.
Usually the appearance of a signature field is updated together with the signature dictionary if there is an appearance at all.
Also, unless other form fill-ins shall happen between adding an empty signature field and signing it, fields usually are added and filled in the same document update.
Thus, more common would be a single incremental update (or even full re-save) containing something like this:
92 0 obj
<<
/AP << /N 94 0 R >>
/DA (/MyriadPro-Regular 0 Tf 0 Tz 0 g)
/F 132
/FT /Sig
/MK <<>>
/P 1 0 R
/Rect [117.575 499.561 515.968 520.938]
/Subtype /Widget
/T (Signature3)
/Type /Annot
/V 93 0 R
>>
endobj
93 0 obj
<<
/ByteRange [ 0 3227714 5751810 2789]
...
>>
As said above, though, your structure strictly speaking is ok, too. But the "Bad parameter" only occurs when validating from the widget in the document or from the signature panel, but it does not occur when validating your signature using the "Validate Signature" button of the "Signature Properties" dialog. Because of that I think it's possible that Adobe is iritated by an uncommon structure.
This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.
What the next parametres of decoding does mean?
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<4DC888EB77E2D649AEBD54CA55A09C54><227DCAC2C364E84A9778262D41602AD4>]/Info 37 0 R/Length 69/Root 39 0 R/Size 38/Type/XRef/W[1 2 1]>>
I know, that Filter/FlateDecode -- it's filter, which was used to compress the stream. But what are ID, Info, Length, Root, Size? Are these parametres realeted with compression/decompression?
Please consult ISO-32000-1:
You are showing the dictionary of a compressed cross reference table (/Type/XRef):
7.5.8 Cross-Reference Streams
Cross-reference streams are stream objects, and contain a dictionary and a data stream.
Flatedecode: the way the stream is compressed.
Length: This is the number of bytes in the stream. Your PDF is at least a PDF 1.5 file and it has a compressed xref table.
DecodeParms: contains information about the way the stream is encoded.
A Cross-reference stream has some typical dictionary entries:
W: An array of integers representing the size of the fields in a single cross-reference entry. In your case [1 2 1].
Size: The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary.
I also see some entries that belong in the /Root dictionary (aka Catalog) of a PDF file:
14.4 File Identifiers
File identifiers shall be defined by the optional ID entry in a PDF
file’s trailer dictionary. The ID entry is optional but should be
used. The value of this entry shall be an array of two byte strings.
The first byte string shall be a permanent identifier based on the
contents of the file at the time it was originally created and shall
not change when the file is incrementally updated. The second byte
string shall be a changing identifier based on the file’s contents at
the time it was last updated. When a file is first written, both
identifiers shall be set to the same value.
14.3.3 Document Information Dictionary
What you see is a reference to another indirectory object that is a dictionary called the Info dictionary:
The optional Info entry in the trailer of a PDF file shall hold a
document information dictionary containing metadata for the document.
Note: this question isn't really suited for StackOverflow. StackOverflow is a forum where you can post programming problems. Your question isn't a programming problem. You are merely asking us to copy/paste quotes from ISO-32000-1.
I'm reading PDF specs and I have a few questions about the structure it has.
First of all, the file signature is %PDF-n.m (8 bytes).
After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
After that, there should be a body, a xref table and a trailer and an %%EOF.
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?
First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
Which docs do you have? The PDF specification ISO 32000-1 says:
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater.
Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are
preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
followed by an end-of-line, i.e. CR, LF, or CR LF.
So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.
(sigh, I just saw you and #Jongware cleared that in comments while I wrote this...)
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?
As you commented you indeed mean a minimal valid PDF.
Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. #plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?
Normally it would be the last cross reference table/stream as the usual use case is
you start with a PDF which has but one cross reference section;
you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
etc...
The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.
And as you can of course apply incremental updates to a linearized document, you can have mixed forms.
In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.
I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :
25 50 44 46 2D
This works fine except that I'm seeing a few files that starts with a byte order mark instead:
EF BB BF 25 50 44 46 2D
^-------^
I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7
Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.
So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)
So: are BOM markers allowed at the start of PDF documents ?
No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.
But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.
Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:
3.4.1, “File Header”
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
...
3.4.4, “File Trailer”
Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.
And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.
No, a BOM is not valid at the front a PDF file.
A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.
I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.