Tool / Command for decrypting in source code of a PDF file? - pdf

I am using the qpdf command to view the raw code (source code) of PDF files. Specifically I am using the command:
qpdf --qdf original.pdf unpacked.pdf
However a lot of PDF metadata is encrypted in this unpacked file and has a lot of unprintable ASCII charactars. I am interested in some data of pdf files which is actually encrypted. Assuming that I have the password for the pdf file (say pwd="passwd"), how can I get an output similar to the output of the qpdf command, but where data has been decrypted?
Edit:
An example file is attached in the link. Please check lines 1841 - 3258. Specifically, in the whole file I am not able to find the TransformParams dictionary, although I have added permissions. I believe it may be inside this encrypted text.
Link:
https://www.mediafire.com/file/b7rf383zxdevgmx/unpacked.txt/file

As already assumed in a comment to the question, the PDF file is not encrypted at all.
Please check lines 1841 - 3258
The lines 1841 - 3258 are part of a stream from line 1739 (OTTO...) to 3258 and contain an embedded OpenType font, compare the preceding stream dictionary
57 0 obj
<<
/Subtype /OpenType
/Length 58 0 R
>>
and the font descriptor referring to it:
<<
/Ascent 952
/CapHeight 674
/CharSet (/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quotesingle/parenleft/parenright/asterisk/plus/comma/hyphen/period/slash/zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon/less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright/asciicircum/underscore/grave/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde/bullet/Euro/bullet/quotesinglbase/florin/quotedblbase/ellipsis/dagger/daggerdbl/circumflex/perthousand/Scaron/guilsinglleft/OE/bullet/Zcaron/bullet/bullet/quoteleft/quoteright/quotedblleft/quotedblright/bullet/endash/emdash/tilde/trademark/scaron/guilsinglright/oe/bullet/zcaron/Ydieresis/space/exclamdown/cent/sterling/currency/yen/brokenbar/section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot/hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior/acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine/guillemotright/onequarter/onehalf/threequarters/questiondown/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/ydieresis)
/Descent -250
/Flags 32
/FontBBox [
-157
-250
1126
952
]
/FontFamily (Myriad Pro)
/FontFile3 57 0 R
/FontName /MyriadPro-Regular
/FontStretch /Normal
/FontWeight 400
/ItalicAngle 0
/StemV 88
/Type /FontDescriptor
/XHeight 484
>>
Specifically, in the whole file I am not able to find the TransformParams dictionary, although I have added permissions.
Well, the shared version of the file neither is encrypted (so no permissions have to be applied) nor is it digitally signed (so in particular there are no signature transform methods applied, so no TransformParams are there).
Maybe the information you search have been removed by uncompressing the PDF with qpdf, maybe they weren't there to start with. Thus, you probably should analyze the original file instead. Or you may want to explain your expectations more thoroughly, maybe there is an error in them.

Related

PDF Signature: "Expected a dict object"

I'm creating a library for digitally signing a PDF document. During my quest I stumbled upon an other problem.
In Acrobat I'm getting the error:
Error during signature verification.
Adobe Acrobat error.
Expected a dict object.
I know it expects a dictionary object somewhere. But I have no idea where.
This problem shows up when I add the image to the AP of the signature.
For this I'm basing my implementation on the spec, and " Insert multiple digital approval signatures without invalidating the previous one "
Most of this seems to work correctly, but when the image is present it results in the error. The image is correctly visible.
Current working:
(This is a very short overview of the part where the error is, it might be slightly different, but hope this helps)
I update the signature annotation. Add link to object that contains normal appearance.
16 0 obj
<<
/Type/Annot
/Subtype/Widget
...snip...
/AP<<
/N 21 0 R
>>
>>
Add image as XObject
20 0 obj
<<
/Type/XObject
/Subtype/Image
...snip...
/Length 29569
>>
stream
...snip...
endstream
endobj
Add XObject (Normal appearance)
21 0 obj
<<
/Type/XObject
/Subtype/Form
/Resources<<
/XObject<<
/UserSignature272 20 0 R
>>
>>
/BBox[0 0 135 37.5]
/Length 44
>>stream
q
135 0 0 37.5 0 0 cm
/UserSignature272 Do
Q
endstream
endobj
I think the problem happens somewhere in obj (21 0), but I'm not sure.
Here is a minimal file that can be used for testing.
https://drive.google.com/file/d/17sdz2xJy3VhN6i9YiuPrJ6x2s5kU2sra/view?usp=sharing
Any help, or hints would be welcome.
(This post is a continuation of PDF Digital Signature has "Bad parameter" in Acrobat, but is about a different problem, same subject area.)
You're running into a bug of Adobe Acrobat here: If you display a XObject from inside your signature appearance stream, it expects that XObject to have a Resources entry. This may make sense in case of form XObjects but it doesn't for image XObjects like in your case.
A work around is to add an empty Resources dictionary to your image XObject.
I checked this by replacing the /BBox[1 0 0 1 0 0] in your image XObject (which is not needed there anyways) by /Resources<< >>.
When Adobe Acrobat creates its own signature appearances, it creates a hierarchy of form XObjects here with Resource dictionaries all over including those for the "layers". I assume Adobe Reader, seeing the Do operator attempts to collect information on such "layers", not expecting to immediately be confronted with an image XObject.

How to test ClamAV service for potential threats

As part of an enterprise software project, our application connects to an antivirus service backed by ClamAV, using ICAP as communication protocol. I would like to test the antivirus service response to malicious documents but, of course, I cannot use a document which is actually infected with something malicious. I found EICAR Anti Malware Testfile, but it only seems to come as either a .txt or a .zip and the system only allows upload of Word or PDF. The antivirus service only recognizes EICAR if it is send to it "as-is" but not when embedded inside a Word or PDF.
My question is: how can I create a Word and/or PDF document that is recognized by ClamAV as a threat despite it is actually not harmful at all?
I initially suggested
Since docx is a zip you could try rename eicar.zip as eicar.docx it proves only that a docx is reviewed/scanned similar to a zip, not that the AV can detect malicious VBA macros which would be a different payload.
However, the uploading step, involving Apache Tika file verification, blocked that simplistic approach, as the file type was not as expected.
My second suggestion was
Take a valid docx rename to zip drop the eicar text into it with explorer (or use zip add) and rename to docx as that's likely to bypass Tika checking.
Apparently that worked.
Likewise it should be possible to embed eicar.txt inside a PDF however detection again would not mean the av is scanning for JavaScript exploitation, just that the plain text signature is seen in a PDF file, thus only hints that a PDF is scanned.
This is more difficult due to PDF encryption, but with a hand crafted text file attachment in an editor, it may not be encoded, simply stored as plain text, sufficient basic for the eicar trigger to be seen.
It could look something like this but cut and pasting this binary shown as text will likely fail storage as eicar.pdf due to ansi line endings encoding. so grab a binary copy from link below
%PDF-1.4
%µ¶
1 0 obj
<</Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[3 0 R]/Type/Pages>>
endobj
3 0 obj
<</Contents 4 0 R/MediaBox[0 0 500 800]/Parent 2 0 R/Resources<</Font<</F1 5 0 R>>>>/Type/Page>>
endobj
4 0 obj
<</Length 57>>
stream
q BT /F1 24 Tf 1 0 0 1 50 720 Tm (Hello World!) Tj ET Q
endstream
endobj
5 0 obj
<</BaseFont/Courier/Subtype/Type1/Type/Font>>
endobj
xref
0 6
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000227 00000 n
0000000333 00000 n
trailer
<</Size 6/Root 1 0 R/ID[<89311A609A751F1666063E6962E79BD5><FDDAE606D8247DFCBA7D13E1833DEDE3>]>>
startxref
395
%%EOF
%X5O!P%#AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
%%EOF
temporarily available from https://gofile.io/d/53fylg should look like this
assuming your antivirus allows download :-) try save download as text otherwise I will need to upload as RAR
However those two "Positives" would be just as good a detection as telltales that any AV is searching those file types for current known exploits.
I recommend download the live script running version bottom of this article for deeper testing.
https://blog.didierstevens.com/2015/08/28/test-file-pdf-with-embedded-doc-dropping-eicar/

Why does ghostscript replace fontnames to "CairoFont"?

I use ghostscript to optimize pdf files (mostly with respect to size), for which it does a great job. The command that I use is:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-dCompatibilityLevel=1.4 -sOutputFile=out.pdf in.pdf
However, it seems that this replaces fonts (or subsets them) and does not preserve their names. It replaces it by CairoFont. How could I get ghostscript to preserve the fontnames?
Example:
A simple pdf file (created with Inkscape), with a single text element in it (Nimbus Roman) as an input (in.pdf):
for which pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
PMLNBT+NimbusRomanNo9L Type 1 yes yes yes 5 0
However, after running ghostscript over the file pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
OEPSCM+CairoFont-0-0 Type 1C yes yes no 8 0
So, is there a way to have ghostscript (or libcairo?) preserve the name of the font?
The input file is uploaded here.
Ghostscript doesn't change the font name, but there are, in fact, several different font 'names' in a PDF file.
In the case of your file the PDF FontDescriptor object has a name
<<
/Type /FontDescriptor
/FontName /PMLNBT+NimbusRomanNo9L
/Flags 4
/FontBBox [ -168 -281 1031 924 ]
/ItalicAngle 0
/Ascent 924
/Descent -281
/CapHeight 924
/StemV 80
/StemH 80
/FontFile 7 0 R
>>
which refers to a FontFile stream
/FontFile 7 0 R
That stream contains the following:
%!PS-AdobeFont-1.0: NimbusRomNo9L-Regu 1.06
%%Title: NimbusRomNo9L-Regu
%Version: 1.06
%%CreationDate: Thu Aug 2 13:14:49 2007
%%Creator: frob
%Copyright: Copyright (URW)++,Copyright 1999 by (URW)++ Design &
%Copyright: Development; Cyrillic glyphs added by Valek Filippov (C)
%Copyright: 2001-2005
% Generated by FontForge 20070723 (http://fontforge.sf.net/)
%%EndComments
FontDirectory/NimbusRomNo9L-Regu known{/NimbusRomNo9L-Regu findfont dup/UniqueID known pop false {dup
/UniqueID get 5020931 eq exch/FontType get 1 eq and}{pop false}ifelse
{save true}{false}ifelse}{false}ifelse
11 dict begin
/FontType 1 def
/FontMatrix [0.001 0 0 0.001 0 0 ]readonly def
/FontName /CairoFont-0-0 def
Do you see the FontName in the actual font ? Its called CairoFont-0-0
This brings me back to a point which I reiterate frequently here and elsewhere; when you process a PDF file with Ghostscript and emit a new PDF file using the pdfwrite device you are not 'optimising', 'converting', 'subsetting' or in a general sense manipulating the content of the original PDF file.
What Ghostscript does is interpret the PDF file, ths produces a set opf marking operations (such as 'stroke', 'fill', 'image' etc) which it sends to the selected Ghostscript device. Most Ghostscript devices will then use the graphics library to render the operations to a bitmap and when the page is complete will write the bitmap to a file. The 'high level' or 'vector' devices instead repackage the operations into another Page Description Language. In the case of pdfwrite, that's a PDF file.
What this means in practice is that the emitted PDF file has nothing (apart from appearance) in common with the original PDF file. In particular the description of the objects may be different.
So in your case, the pdfwrite device doesn't know what the font was called in the original PDF object. It does know that the font that was defined was called Cairo-0-0 so that's what it calls the font when it emits it.
Frankly this is another piss-poor example from Cairo, to go along with defining each page as containing transparency whether it does or not, the FontName in the Font object is supposed to be the same as the name in the Font stream.
Its pretty clear that the FontName has been altered, given the rest of the boilerplate there.

How is password removed from a pdf file programmatically?

One of password protected PDF I encountered has trailer and encryption dictionary as follows:
Trailer Dictionary:
trailer
<<
/Encrypt 64 0 R
/Info 65 0 R
/Root 63 0 R
/Size 66
/ID [xxxxxxxx]>>
Encryption Dictionary:
64 0 obj
<<
/R 3
/P -3904
/O (xxxxxxxxxxxxx)
/Filter /Standard
/Length 128
/V 2
/U (/xxxxxxxxxxxxx) >>
endobj
In comments the OP clarified that by not using any software he meant
Any software is also a code by which we remove password. I want internal working of that code i.e how that software is removing password, what it is actually doing internally.
Thus, this question is not about manually removing PDF password protection but about understanding how PDF password protection is removed programmatically.
PDF passwords are applied by encryting nearly all strings and streams in the PDF and adding the information the OP already identified. Consequentially PDF passwords are removed by decrypting the formerly encrypted strings and streams in the PDF and removing the added information.
The details of this are explained in section 7.6 Encryption in the PDF specification ISO 32000-1 and are too extensive for an answer on stackoverflow. Fortunately Adobe has provided a free copy of that specification only missing the ISO logo and copyright notices here in which one can study the section in question and more.

Vertically backward JPEG (or other) images in PDF streams

I am trying to understand the PDF format, and so far I have seen that images can appear in JPG format (but apparently without the "JFIF" bytes).
Or, is it something different but very similar to JPEG?
I am seeing that images appear reversed vertically.
What is the purpose of this reversion?
Can images appear normally in a PDF stream?
Here is an example from a PDF. As far as I know, this uploaded file is identical to the done contained in the PDF file, packed like this (using NASM assembly data definitions to make it simple):
db "19 0 obj",0x0D
db "<< /Type /XObject /Subtype /Image /Width 367 /Height 475 /BitsPerComponent 8 ",0x0D
db "/ColorSpace 12 0 R /Length 37575 /Filter /DCTDecode >> ",0x0D
db "stream",0x0D,0x0A
incbin "19_0_obj.bin.jpg"
db "endstream",0x0D
db "endobj",0x0D