Unknown encoding used in PDF strings - pdf

I am writing code to extract URLs from PDF files. In most files, the URLs appear as plain ascii. However, in some PDF files, such as the PDF specification itself (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) the URLs appear in hexadecimal form with seemingly no structure.
For example, in the above file, in the main metadata, the author appears as:
/Author <F240D629CD72348F>
This is decoded by Atril and other PDF viewers as "Jim King". The hexadecimal strings are double the length of the literal value as expected, but scrambled beyond recognition. Assuming a 1:1 mapping of byte value to characters, the "i" is encoded both as 0x40 and 0x72.
Actual URL value:
<EB345AA632781A90E90781A4A0BF42680D1F1AD67910B293798B0AFFED8407CE12684F21B7F471D96DCE4864CAB970A98E7F911C207A12C6E6900D789BC13AE87E76A9D6B8EDDADE7A53EAA521E6421295EA31305C>
Should decode to:
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502
I have also looked at PyPDF2 source code which manages to decode these strings, but I have not found the answer.
How do I find the encoding used for annotations in a PDF document ?

The example pdf is encrypted as you can determine by looking at its trailer which contains an Encrypt entry:
/Encrypt 126988 0 R
Thus, all strings and streams in that pdf (with a very few exceptions) are encrypted.
(If you wonder why you don't have to enter a password when opening the file: the pdf standard defines a default password which a pdf processor tries before asking the user to supply a password. This default password is used here.)
Thus, before analyzing the strings you have to decrypt them. If you don't want to implement the decryption yourself, you can use a tool like qpdf to do that in preparation for your code.

Related

TCPDF SetProtection method is not working as expected

I started writing here:
PHP PDF password protection (no open without password)
But I can't add comments due to my reputation here (I'm better on AskUbuntu but I can't take my rep points from there). I also started a bounty there, and if someone will answer here in two days with an acceptable solution, I will award there.
Now, the problem: SetProtection method is not working as expected.
Wanted behaviour: create a protected/encrypted PDF document with TCPDF library so that the document view is always granted to everyone without asking any password, but if one tries to edit, a password is requested.
I use the following syntax:
$pdf->SetProtection(array('modify', 'copy', 'annot-forms', 'fill-forms', 'extract', 'assemble'), null, 'mypwd', 1);
I can open the file with a pdf viewer as expected.
If I try to open the file with Libreoffice Draw, the password is requested (as expected), but I'm able to edit the document BOTH with mypwd (expected) AND giving a blank password (NOT expected).
What is the right syntax, if any, to have pdf readable by everyone BUT editable ONLY with "mypwd" provided?
EDIT:
here you are with a file with a blank user password and a strong master password. Ilovepdf.com finds it UNLOCKED, Libreoffice Draw can edit it.
This is NOT the expected behaviour.
https://www.dropbox.com/s/864p8xjh1ue041z/tracking_12750_16.pdf?dl=0
As far as I can see your example PDF is encrypted just the way you wanted, with an empty user password and a non-empty owner password. Thus, TCPDF does just what it was asked to do.
Most likely the problem is that your expectation is too strong: If a program can open a PDF for reading, that program can do anything with the PDF, no matter how restricted it is configured to be. The permissions and different owner and user roles require the cooperation of the software in question, they are not technically enforced.
This already is clear from the specification:
Once the document has been opened and decrypted successfully, a PDF reader technically has access to the entire contents of the document. There is nothing inherent in PDF encryption that enforces the document permissions specified in the encryption dictionary. PDF readers shall respect the intent of the document creator by restricting user access to an encrypted PDF file according to the permissions contained in the file.
(ISO 32000-2, section 7.6.4 Standard security handler)
Apparently Libreoffice Draw simply does not behave as required by the PDF specification, i.e. it is not properly restricting user access to an encrypted PDF file according to the permissions contained in the file. Probably by design, probably just a programming glitch.
You should simply be aware that your expectation to
create a protected/encrypted PDF document with TCPDF library so that the document view is always granted to everyone without asking any password, but if one tries to edit, a password is requested.
cannot be implemented using standard PDF encryption facilities for arbitrary PDF processors, merely for those that follow the PDF specification requirement quoted above.
There are some providers of PDF DRM software solutions which are not so easy to circumvent, but I doubt any of them can withstand a determined hacker. (Unless the solution in question is not giving the PDF to the user at all but only images in a custom, webservice-based viewer; but this is not your use case.)
Depending on your actual requirements, you might want to investigate into using digital signatures instead of encryption; if your objective is to make sure that any recipient can be sure that he got your document contents and not what someone else edited into it, this appears more apropos.

How to make a generated pdf non editable in Objective C?

I have a banking client for whom I have designed an iOS app where we will populate all the client details onto the account opening application pdf forms and generate the final pdf with all the client details. I am generating a pdf using CoreGraphics. But the pdf is editable in Adobe Acrobat Pro and they are able to edit the contents of the application form. Is there any method to restrict the editing of the pdf after it is generated from CoreGraphics? I have encrypted the pdf with a password But the client needs the pdf to be non editable.
See Protecting PDF Content
Reading between the lines a bit — because the docs are not overly clear — I think that when you create the PDF context using CGPDFContextCreate(), you pass a dictionary into its auxiliaryInfo, using the key kCGPDFContextOwnerPassword and a value that's some arbitrary password string. This encrypts the document so that only the owner (there people with that password) can work with the contents. It doesn't say it prevents editing explicitly, but I'm guessing that's implied because it list out special keys to block printing and copying (preventing editing seems like the thing one would always want when encrypting a pdf).

PDF cannot be read

I programmatically created a PDF with object streams and encryption, but while several PDF viewers can read it, some fail.
PDF readers, which can read it:
Foxit
Google Chrome
Nuance
Nitro
pdf.js
PDF readers, which cannot read it:
Adobe Reader
PDF X/Change
Currently i am blind to see, what the problem inside the PDF is. Can anyone help? The PDF can be downloaded at https://www.doxisafe.me/#!/retrieve/ivqkli
The PDF is encrypted with an owner password "owner" and no user password.
Today i found a solution, that Adobe just requires the Catalog dictionary not to be in an object stream, when the file is encrypted. This is not following the pdf spec, which claims, that only the following objects shouldn't be inside the object stream:
Stream objects
Objects with a generation number other than zero
A document’s encryption dictionary (see 7.6, "Encryption")
An object representing the value of the Length entry in an object stream dictionary
In linearized files (see Annex F), the document catalog, the linearization dictionary, and page objects shall not appear in an object stream.
My file is not linearized, so the last condition shall not apply.

How to tell what encryption is used in PDF?

I'm trying to read PDF in my app. I've stumbled upon AES-encrypted PDFs.
As far as I undestood from the PDF reference, in AES scheme, strings and streams should be padded up to multiple of 16 length. First 16 bytes are the IV in unencoded form.
However, in my PDF, strings have lengths, which are not multiple of 16. Decryption of such strings predictably fails.
Here are the details of the PDF:
Header:
%PDF-1.6
Encrypt dictionary:
<</Length 128/Filter/Standard/O(...)/P -1340/R 3/U(...)/V 2>>
So, PDF version is 1.6, encryption version is 2 and revision is 3 - under this condition, PDF reference states that AES scheme is used.
Example of encoded string:
Author(Í¡c$5N8 ¶ŽÜß*绫ÈÙÏ)
which has the length of 29.
Note that I'm able to verify user password as stated in algorithm "Authenticating the user password", and it shows to be valid.
EDIT:
The only logical answer would be "Because it is not AES-encrypted". Indeed, when forcing encryption to RC4, PDF decrypts just fine.
So the question is rather "How to tell if AES or RC4 is used?"

Verifying digital signatures in PDF documents

I'm trying to verify PDF's digital signatures.
I know that when a PDF is signed, a byterange is defined, the certificates get embedded, and from what i've read, the signed message digest and the timestamp are also stored in the PDF.
I already can extract the certificates and validate them.
Now I'm trying to validate the pdf's integrity and my problem is I don't know where the signed message digest is located.
In this sample signed pdf from Adobe (http://blogs.adobe.com/security/SampleSignedPDFDocument.pdf), i can clearly identify the digest since it is down below the embedded certificates: /DigestMethod/MD5/DigestValue/ (line 1520).
But that PDF sample seems to be from 2009, and I suspect the message digest is stored in a different way now, because I signed a PDF with Adobe Reader and also with iText, and I can't find any message digest field like the previous one.
Can someone tell if the digests are now stored in a different way? Where are they located?
Anyway, for now I'm using that sample document from Adobe, and trying to verify its integrity.
I'm getting the document's bytes to be signed acording to the specified byterange, and digesting them with MD5 algorithm, but the digest value I get doesn't match with the one from the message digest field...
Am I doing something wrong? Is the digest also signed with the signer's private key?
I appreciate any help.
There are numerous details to get right when calculating the hash for integrated PDF signatures, among them:
Extract the correct bytes from the PDF to hash. The ByteRange tells you exactly which byte ranges are signed. To be accepted in modern signing contexts, the ranges must cover the whole PDF file revision with the exception of the value of Contents.
Beware, the value of Contents includes the the leading '<' and the trailing '>' brackets.
Don't use a regular text editor or text processing instructions (like readln or writeln) to process PDFs. PDFs are binary in nature, even if they look textual to the naked eye. Copying PDF parts using such text related operations most likely changes them in details, definitively breaking the signature hash value.
When in doubt, don't guess but read the specification. A copy of ISO 32000-1 has been made available by Adobe here, and much you need to know about the PDF format to start processing them can be found there and in other public standards referenced in there. A very short introduction to integrated PDF signatures can be found in this answer and documents referenced from there.