How to tell what encryption is used in PDF? - pdf

I'm trying to read PDF in my app. I've stumbled upon AES-encrypted PDFs.
As far as I undestood from the PDF reference, in AES scheme, strings and streams should be padded up to multiple of 16 length. First 16 bytes are the IV in unencoded form.
However, in my PDF, strings have lengths, which are not multiple of 16. Decryption of such strings predictably fails.
Here are the details of the PDF:
Header:
%PDF-1.6
Encrypt dictionary:
<</Length 128/Filter/Standard/O(...)/P -1340/R 3/U(...)/V 2>>
So, PDF version is 1.6, encryption version is 2 and revision is 3 - under this condition, PDF reference states that AES scheme is used.
Example of encoded string:
Author(Í¡c$5N8 ¶ŽÜß*绫ÈÙÏ)
which has the length of 29.
Note that I'm able to verify user password as stated in algorithm "Authenticating the user password", and it shows to be valid.
EDIT:
The only logical answer would be "Because it is not AES-encrypted". Indeed, when forcing encryption to RC4, PDF decrypts just fine.
So the question is rather "How to tell if AES or RC4 is used?"

Related

Can you add a timestamped no-tamper-proof to a PDF without "signing" it?

When signing a PDF using digital signature, one can use a trusted timestamping service to add a time-stamp token that is signed by the timestamping authority. When viewing the signature of the PDF then, it will say that it contains a signed timestamp and that it has not been tampered with since that time (if it hasn't).
Technically what happens isn that the hash of the pdf content gets sent to the TSA (RCF3161), that hash is put into a structure together with the current timestamp (as determined by the timestamping authority) plus some metadata and that is then signed and sent back. This then provides proof that a PDF has not been changed since this point in time.
Technically it should be possible therefore to create such a timestamp proof without signing the document itself with an additional signature. Is that somehow supported though by the PDF standard (and also in terms of Acrobat Reader then being able to show this timestamp somehow)?
Of course I could just do it manually, take the SHA-256 hash of the file's binary representation, send it to the TSA service and store the received token in an external file, but preferrably I'd like to embedd the no-tamper proof into the PDF and such that Acrobat Reader can display it.
Is this possible? If so, how?
You can embed pure RFC 3161 time stamps in a PDF. This construct is called a document timestamp.
This structure has been originally specified in ETSI TS 102 778-4 (Annex A.2) in 2009 as a means to purely timestamp a previously signed PDF with some validation related information added in revisions after the signed one. As PAdES developed, this specification finally found its way into ETSI EN 319 142-1 (section 5.4.3).
While ETSI could only specify the structure as extension to ISO 32000-1 (PDF 1.7), the responsible ISO committee added it to the core ISO 32000-2 (PDF 2) in 2017.
Concerning your questions in comments:
Is this compatible with PDF/A?
I think they are not compatible with PDF/A-1, PDF/A-2, and PDF/A-3. As PDF/A-4 is based on ISO 32000-2, though, I assume it will be compatible. (I have not yet had a look at ISO 19005-4...)
Is there a way to create those with Acrobat Reader?
It should be possible with some Adobe Acrobat version. It is (currently) not possible with the base Adobe Acrobat Reader version. Probably, though, Adobe Acrobat Reader with some of its fee-based, built-in tools can create them.
optimally I'd like to have a cli tool or be able to do it through some library
Any not outdated general PDF signing library should support the creation of document time stamps.
but first I want to test how they are displayed later in Acrobat Reader
Like this:
The first entry is a Signature with an embedded signature timestamp, the second entry is a document time stamp.

Unknown encoding used in PDF strings

I am writing code to extract URLs from PDF files. In most files, the URLs appear as plain ascii. However, in some PDF files, such as the PDF specification itself (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) the URLs appear in hexadecimal form with seemingly no structure.
For example, in the above file, in the main metadata, the author appears as:
/Author <F240D629CD72348F>
This is decoded by Atril and other PDF viewers as "Jim King". The hexadecimal strings are double the length of the literal value as expected, but scrambled beyond recognition. Assuming a 1:1 mapping of byte value to characters, the "i" is encoded both as 0x40 and 0x72.
Actual URL value:
<EB345AA632781A90E90781A4A0BF42680D1F1AD67910B293798B0AFFED8407CE12684F21B7F471D96DCE4864CAB970A98E7F911C207A12C6E6900D789BC13AE87E76A9D6B8EDDADE7A53EAA521E6421295EA31305C>
Should decode to:
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502
I have also looked at PyPDF2 source code which manages to decode these strings, but I have not found the answer.
How do I find the encoding used for annotations in a PDF document ?
The example pdf is encrypted as you can determine by looking at its trailer which contains an Encrypt entry:
/Encrypt 126988 0 R
Thus, all strings and streams in that pdf (with a very few exceptions) are encrypted.
(If you wonder why you don't have to enter a password when opening the file: the pdf standard defines a default password which a pdf processor tries before asking the user to supply a password. This default password is used here.)
Thus, before analyzing the strings you have to decrypt them. If you don't want to implement the decryption yourself, you can use a tool like qpdf to do that in preparation for your code.

We receive signed PDF documents with ulterior modifications

Maybe this one would fit better on so security? I'm not sure...
These are the facts:
We have a web app where users download a PDF document with a form, they fullfill this form, sign it with their electronic certificate and upload it back to our environment.
We've shown cases where the uploaded document is signed, but it show some fields that have been altered after the signature. If we check the integrity of PDF signatures, it shows that have been data alteration after the signature, but the signature it's fine and valid.
If we right-click on the signature and select "See signed version" we see the real data loaded on the moment of the signature.
Now, this goes against my general perception of electronic signature functionality. If any change is made to the document (or the data loaded into it) after I make a signature, this signature should become invalid, as the document has been altered.
The behaviour of the PDF seems to be different, as not only the signature still is valid, also the "default version" that you see when you open the document is the last one, not the signed one.
Now I'm wondering
Is this some kind of bug or is a expected behaviour?
There is any place where info on the matter can be found? (google keeps redirecting me once and again to "how to sign a PDF" articles).
If this is a defined behaviour, how do you deal with it?
Now, this goes against my general perception of electronic signature functionality. If any change is made to the document (or the data loaded into it) after I make a signature, this signature should become invalid, as the document has been altered.
The behaviour of the PDF seems to be different, as not only the signature still is valid, also the "default version" that you see when you open the document is the last one, not the signed one.
Is this some kind of bug or is a expected behaviour?
It is expected behavior.
You have to be aware of two special factors here:
A PDF signature field contains the information of the byte ranges signed. Obviously not the whole file can be signed as the signature itself is embedded and cannot be part of the signed bytes. Thus, the signed bytes ranges need to be recorded somewhere. Cf. this answer on Information Security Stack Exchange:
Additions to a PDF can be made by appending to the existing document, a process called an incremental update. These updates can again be signed etc., also cf. the answer referenced above:
Thus, making changes to a PDF by means of an incremental update, the existing integrated signatures in the document still correctly sign their respective signed by range. They still are mathematically valid in spite of the added changes.
Furthermore the current contents of a PDF are defined in particular by the newest incremental update, so when you open the document it shows the content including the last changes, not the signed one.
Now, while this sounds like PDF signatures have no meaning, this is not the case. The specification ISO 32000-1 clearly defines which changes are allowed to be made in an incremental update to a certified (= signed with some special flags) base version of a document, and Adobe in their Acrobat and Reader software have extrapolated restrictions from this for signed but not certified documents, cf. this answer on stack overflow.
In particular at most the following changes are allowed:
Adding signature fields
Adding or editing annotations
Supplying form field values
Digitally signing
If this is a defined behaviour, how do you deal with it?
As the documents originate from you, you can start by applying a certificate signature to the document which only allows as little changes as possible in your use case.
Then you can define signature lock information for the signature fields your users are to sign. In these lock information you can e.g. prescribe that after signing the given signature field, a number of form fields shall be read-only.
Finally you only accept back PDFs which still contain your certification signature and to which no disallowed changes were added.
There actually are numerous PDFs which are certified and contain a number of fields for additional approval signatures, and each of the approval signature fields is coupled with some form fields which will not be editable anymore after signing. After all the signature fields are signed, all fields are read-only.
There is any place where info on the matter can be found? (google keeps redirecting me once and again to "how to sign a PDF" articles).
You should in particular look at the PDF specification ISO 32000-1 and some Adobe documents on the behavior of their software. You'll find links at the bottom of the stack overflow documentation page the above mentioned links point to.

Verifying digital signatures in PDF documents

I'm trying to verify PDF's digital signatures.
I know that when a PDF is signed, a byterange is defined, the certificates get embedded, and from what i've read, the signed message digest and the timestamp are also stored in the PDF.
I already can extract the certificates and validate them.
Now I'm trying to validate the pdf's integrity and my problem is I don't know where the signed message digest is located.
In this sample signed pdf from Adobe (http://blogs.adobe.com/security/SampleSignedPDFDocument.pdf), i can clearly identify the digest since it is down below the embedded certificates: /DigestMethod/MD5/DigestValue/ (line 1520).
But that PDF sample seems to be from 2009, and I suspect the message digest is stored in a different way now, because I signed a PDF with Adobe Reader and also with iText, and I can't find any message digest field like the previous one.
Can someone tell if the digests are now stored in a different way? Where are they located?
Anyway, for now I'm using that sample document from Adobe, and trying to verify its integrity.
I'm getting the document's bytes to be signed acording to the specified byterange, and digesting them with MD5 algorithm, but the digest value I get doesn't match with the one from the message digest field...
Am I doing something wrong? Is the digest also signed with the signer's private key?
I appreciate any help.
There are numerous details to get right when calculating the hash for integrated PDF signatures, among them:
Extract the correct bytes from the PDF to hash. The ByteRange tells you exactly which byte ranges are signed. To be accepted in modern signing contexts, the ranges must cover the whole PDF file revision with the exception of the value of Contents.
Beware, the value of Contents includes the the leading '<' and the trailing '>' brackets.
Don't use a regular text editor or text processing instructions (like readln or writeln) to process PDFs. PDFs are binary in nature, even if they look textual to the naked eye. Copying PDF parts using such text related operations most likely changes them in details, definitively breaking the signature hash value.
When in doubt, don't guess but read the specification. A copy of ISO 32000-1 has been made available by Adobe here, and much you need to know about the PDF format to start processing them can be found there and in other public standards referenced in there. A very short introduction to integrated PDF signatures can be found in this answer and documents referenced from there.

Hash computation in Google safe browsing V2 implementation

I am trying to test my implementation of the google safe browsing api version 2.
To test a part of my code that sends requests for full hashes for a given prefix, I captured a short session of traffic where I visited a known currently blacklisted url "utfvq.portrelay.com" and firefox sent a request to google for full hashes and google responded with a list of hashes.
The prefixes firefox sent are(Hex encoded) : 2e2e372e,2e26382e,2e2e382e,6545382e
The 4 matching full hashes it received are :
2e26382e2e2e436d2e2e2e2e322e3b2e2e2e2e2e4a2e2e2e7b2e2e2e6a492e2e
6545382e2e2a5b792e652e2e2e2e2e2e2e2e70442e7d2e2e2e222e2e502e2e2e
2e2e382e6c36252e2e522e2e592e2e2e2e2e3f592e2e2e782e2e572e4e2e2e2e
2e2e372e2e2e2e2e55682e542e51622e552e2e68352e2e2e2e2e2e2e2ed2755
In my implementation however, the hash prefixes I generate do not seem to match the hash prefixes that firefox sent. Hence, I am not getting any full hash matches in my client.
I have followed googles description of the API closely and made sure the previous steps such as url canonicalization are implemented properly.
The url, SHA256 hashes I get are
utfvq.portrelay.com/ : 5c2383012676e63656c13167e1cc4f55309c4e1b73c22556e36ec1487e8b8697
portrelay.com/ : 842638fe92ee436da7808d0232d03bcaa0f5c8b64ad5eee97bf28dbb6a49f8ae
Can some one point out why the hashes do not match. I have followed the API guide to the best of my knowledge. Is there some implementation detail I am missing ?
It turned out to be a basic character encoding error on my part.
The SHA hashes I compute in my code are correct. The way I looked at the hashes that firefox sent was wrong. I copied the characters from a text file where it stored any byte not with in regular ascii range as a dot(.). Then I converted these to Hex values which is a kind of "lossy" encoding. This is why there were so many "2e" hex chars in the hashes. Now I am using just the original bytes and they match.