PAC2021 reports "Error while parsing the PDF document (Operator 'cm' is not allowed in this current state)" - but why? - pdf

I am using the PAC2021 application to analyze some PDFs for compliance towards the PDF/UA standards.
For some files (all generated by the software of my company) the following error is reported:
Error while parsing the PDF document (Operator 'cm' is not allowed in this current state)
There are no further explanations, just this phrase. An example of PDF page content that raises the above error is:
/P << /MCID 0 >> BDC
BT
0 g
1 0 0 1 81 639.75 cm
/F0 10 Tf
1 0 0 1 0 4.45 Tm
(Hello, world!) Tj
ET
EMC
Apparently PAC2021does not like the "cm" in the fourth line, but why?
I went through the documents of specifics for PDF and could not find an explanation on why this should be considered a syntax error. All PDF readers I have tried do not complain about such content, also I tried running the same document through Adobe Preflight for PDF/UA, it reported the document as fully compliant.
So I'm wondering: is this content violating a special restriction of the PDF/UA format? If so, where can I find its definition? Or is it an error in the PAC2021 report?

Related

PDF Signature: "Expected a dict object"

I'm creating a library for digitally signing a PDF document. During my quest I stumbled upon an other problem.
In Acrobat I'm getting the error:
Error during signature verification.
Adobe Acrobat error.
Expected a dict object.
I know it expects a dictionary object somewhere. But I have no idea where.
This problem shows up when I add the image to the AP of the signature.
For this I'm basing my implementation on the spec, and " Insert multiple digital approval signatures without invalidating the previous one "
Most of this seems to work correctly, but when the image is present it results in the error. The image is correctly visible.
Current working:
(This is a very short overview of the part where the error is, it might be slightly different, but hope this helps)
I update the signature annotation. Add link to object that contains normal appearance.
16 0 obj
<<
/Type/Annot
/Subtype/Widget
...snip...
/AP<<
/N 21 0 R
>>
>>
Add image as XObject
20 0 obj
<<
/Type/XObject
/Subtype/Image
...snip...
/Length 29569
>>
stream
...snip...
endstream
endobj
Add XObject (Normal appearance)
21 0 obj
<<
/Type/XObject
/Subtype/Form
/Resources<<
/XObject<<
/UserSignature272 20 0 R
>>
>>
/BBox[0 0 135 37.5]
/Length 44
>>stream
q
135 0 0 37.5 0 0 cm
/UserSignature272 Do
Q
endstream
endobj
I think the problem happens somewhere in obj (21 0), but I'm not sure.
Here is a minimal file that can be used for testing.
https://drive.google.com/file/d/17sdz2xJy3VhN6i9YiuPrJ6x2s5kU2sra/view?usp=sharing
Any help, or hints would be welcome.
(This post is a continuation of PDF Digital Signature has "Bad parameter" in Acrobat, but is about a different problem, same subject area.)
You're running into a bug of Adobe Acrobat here: If you display a XObject from inside your signature appearance stream, it expects that XObject to have a Resources entry. This may make sense in case of form XObjects but it doesn't for image XObjects like in your case.
A work around is to add an empty Resources dictionary to your image XObject.
I checked this by replacing the /BBox[1 0 0 1 0 0] in your image XObject (which is not needed there anyways) by /Resources<< >>.
When Adobe Acrobat creates its own signature appearances, it creates a hierarchy of form XObjects here with Resource dictionaries all over including those for the "layers". I assume Adobe Reader, seeing the Do operator attempts to collect information on such "layers", not expecting to immediately be confronted with an image XObject.

How to test ClamAV service for potential threats

As part of an enterprise software project, our application connects to an antivirus service backed by ClamAV, using ICAP as communication protocol. I would like to test the antivirus service response to malicious documents but, of course, I cannot use a document which is actually infected with something malicious. I found EICAR Anti Malware Testfile, but it only seems to come as either a .txt or a .zip and the system only allows upload of Word or PDF. The antivirus service only recognizes EICAR if it is send to it "as-is" but not when embedded inside a Word or PDF.
My question is: how can I create a Word and/or PDF document that is recognized by ClamAV as a threat despite it is actually not harmful at all?
I initially suggested
Since docx is a zip you could try rename eicar.zip as eicar.docx it proves only that a docx is reviewed/scanned similar to a zip, not that the AV can detect malicious VBA macros which would be a different payload.
However, the uploading step, involving Apache Tika file verification, blocked that simplistic approach, as the file type was not as expected.
My second suggestion was
Take a valid docx rename to zip drop the eicar text into it with explorer (or use zip add) and rename to docx as that's likely to bypass Tika checking.
Apparently that worked.
Likewise it should be possible to embed eicar.txt inside a PDF however detection again would not mean the av is scanning for JavaScript exploitation, just that the plain text signature is seen in a PDF file, thus only hints that a PDF is scanned.
This is more difficult due to PDF encryption, but with a hand crafted text file attachment in an editor, it may not be encoded, simply stored as plain text, sufficient basic for the eicar trigger to be seen.
It could look something like this but cut and pasting this binary shown as text will likely fail storage as eicar.pdf due to ansi line endings encoding. so grab a binary copy from link below
%PDF-1.4
%µ¶
1 0 obj
<</Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[3 0 R]/Type/Pages>>
endobj
3 0 obj
<</Contents 4 0 R/MediaBox[0 0 500 800]/Parent 2 0 R/Resources<</Font<</F1 5 0 R>>>>/Type/Page>>
endobj
4 0 obj
<</Length 57>>
stream
q BT /F1 24 Tf 1 0 0 1 50 720 Tm (Hello World!) Tj ET Q
endstream
endobj
5 0 obj
<</BaseFont/Courier/Subtype/Type1/Type/Font>>
endobj
xref
0 6
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000227 00000 n
0000000333 00000 n
trailer
<</Size 6/Root 1 0 R/ID[<89311A609A751F1666063E6962E79BD5><FDDAE606D8247DFCBA7D13E1833DEDE3>]>>
startxref
395
%%EOF
%X5O!P%#AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
%%EOF
temporarily available from https://gofile.io/d/53fylg should look like this
assuming your antivirus allows download :-) try save download as text otherwise I will need to upload as RAR
However those two "Positives" would be just as good a detection as telltales that any AV is searching those file types for current known exploits.
I recommend download the live script running version bottom of this article for deeper testing.
https://blog.didierstevens.com/2015/08/28/test-file-pdf-with-embedded-doc-dropping-eicar/

PDF that renders in Chrome but not in Acrobat

%PDF-1.7
4 0 obj
<</Type/ObjStm/N 3/First 14/Length 139>>
stream
1 0 2 41 3 76 <</Type/Catalog/Version/1.7/Pages 2 0 R>><</Type/Pages/Kids[3 0 R]/Count 1>><</Type/Page/MediaBox[0 0 200 200]/Parent 2 0 R>>
endstream
endobj
5 0 obj
<<
/Root 1 0 R
/ID[<7F1FE2C507E6DB4CB0787E660F2B0C65><2450E4E8FF5FC84380428886C0DD4C2F>]
/Size 6
/Index[1 5]
/W[1 4 1]
/Type/XRef
/Length 68
/Filter[/ASCIIHexDecode]
>>
stream
020000000400
020000000401
020000000402
010000000A00
01000000E500
endstream
endobj
startxref
229
%%EOF
The PDF above opens in Chrome (or Edge), but in Adobe Acrobat (Reader) it crashes. Ghostscript regards it as fine too. Note that it assumes CRLF for line breaks.
I read the parts of the PDF spec that are relevant for a basic PDF, and it seems that the above syntax follows it. Why doesn't Adobe like it?
Here is a link to the PDF. Notice how it opens in Chrome, but crashes in Adobe Acrobat. (This PDF uses LF for line breaks, and has a Resources dictionary on the page, based on the comments.)
Acrobat has the following 2 quirks, both of which do not follow the specs:
If the XRef Stream has a single filter, an array must not be used. So /Filter[/FlateDecode] won't work, and /Filter/FlateDecode will. This may apply to any Stream Object, not sure.
An XRef Stream must use the FlateDecode filter. ASCIIHexDecode won't work. A predictor is not required.
Here is a link to the above PDF, fixed up for Acrobat.

How can I convert a PDF from Google Docs to images? [or: GoogleDocs' PDF export is horrible!]

I exported a document from Google Docs as PDF (just simple pages and one of the pre-defined themes) and, like I do usually, I used ImageMagick's convert to get pages converted to images, but it failed (even with the latest version) and showed no errors.
GhostScript also failed.
Other tools such as pdfinfo, mutool or qpdf don't report any error, yet it still fails even if rebuild or clean commands are applied.
Only pdfimages complains and gives me Syntax Error: Missing or invalid Coords in shading dictionary
Ok, I tried to reproduce some bugs, using Google Slides.
However, my bugs are different from yours. Read on for some details...
Google Docs does indeed create a horrible PDF syntax today. I say 'today', because I gave up with Google Docs years ago. The reason: it was always very unstable for me in the past. GoogleDocs' developers seem to change the code they activate for users all the time, and debugging the created PDFs for me was always a moving target.
When I exported to PDF the slideshow I created, and then did run the tools you mentioned on it,...
... I got 4 different results within 20 minutes!
In one case, Mac OS X's Preview.app was unable to render anything else but 3 white pages, while Adobe's Acrobat Pro rendered it (without error message) somehow garbled and different from the GoogleDocs web preview.
In another case, Acrobat Pro showed 3 white pages, while Preview.app rendered it in a garbled way!
Unfortunately, I didn't save the different versions for closer inspection. The lastest PDF I analysed gave however the following details.
Ghostscript:
pdfkungfoo#mbp:> gs -o PDFExportBug-%03d.jpg -sDEVICE=jpeg PDFExportBug.pdf
GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 3.
Page 1
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 2
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 3
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
ImageMagick:
convert creates white-only images from the PDF pages.
(That's no wonder because it does not process the PDFs directly, but employs Ghostscript as it's delegate to convert the PDF to a raster format first, which is then familiar ground for ImageMagick to continue with processing... You can see details of this process by adding -verbose to your ImageMagick command line.)
qpdf
Using qpdf --check yields this result:
pdfkungfoo#mbp:> qpdf --check PDFExportBug.pdf
qpdf --check PDFExportBug.pdf
checking GoogleSlidesPDFExportBug.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
PDFExportBug.pdf (file position 9269):
unknown token while reading object (0.0000-11728996)
pdfimages:
Unlike what you discovered, my error message was this:
pdfkungfoo#mbp:> pdfimages -list PDFExportBug.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Warning (9276): Badly formatted number
Syntax Warning (9292): Badly formatted number
Syntax Warning (9592): Badly formatted number
Syntax Warning (9608): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (9908): Badly formatted number
Syntax Warning (9924): Badly formatted number
Syntax Warning (8212): Badly formatted number
Syntax Warning (8212): Badly formatted number
When I check with a text editor the file-offsets of 9276, 9292, ... 8212 for numbers, I indeed do find the following lines in the PDF code:
Line 412: 0.0000-11728996
Line 413: 0.0000-11728996
Line 466: 0.0000-11728996
Line 467: 0.0000-11728996
Line 522: 0.0000-11728996
Line 523: 0.0000-11728996
PDF code in text editor:
Looking at the context of these lines, one sees the following:
32
0
obj
<<
/ShadingType
2
/ColorSpace
/DeviceRGB
/Function
<<
/FunctionType
2
/Domain
[
0
1
]
/Range
[
0
1
0
1
0
1
]
/C0
[
0.5882353
0.05882353
0.05882353
]
/C1
[
0.78431374
0.1254902
0.03529412
]
/N
1
>>
/Coords
[
0.000000000000053689468
0.0000
-11728996
0.0000
-11728996
26.832815
]
/Extend
[
true
true
]
>>
endobj
That's true! GoogleDocs gave me a PDF that created a newline after each single token!
PDF code, if Google had formatted it less horribly:
These lines are part of a code snippet that should probably be formatted like this, if the Google PDF export wasn't as horrible as it in fact is:
32 0 obj
<<
/ShadingType 2
/ColorSpace /DeviceRGB
/Function << /FunctionType 2
/Domain [ 0 1 ]
/Range [ 0 1 0 1 0 1 ]
/C0 [ 0.5882353 0.05882353 0.05882353 ]
/C1 [ 0.78431374 0.1254902 0.03529412 ]
/N 1
>>
/Coords [ 0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815 ]
/Extend [ true true ]
>>
endobj
PDF code compared to the PDF specification:
So GoogleDoc's PDF uses /ShadingType 2 (for axial shading). This Shading Type requires a 'shading dictionary' with an entry for the /Coords key that should have as value an array of 4 numbers [x0 y0 x1 y1]. These numbers would specify the starting and ending coordinates of the axis (expressed in the shading’s target coordinate space).
However, instead of a /Coords array of 4 numbers it uses one of 6 numbers: [0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815].
But Coords arrays with 6 numbers are to be used by /ShadingType 3 (radial shading).
The 6 numbers [x0 y0 r0 x1 y1 r1] then represent, according to ISO 32000:
"[...] the centres and radii of the starting and ending circles, expressed in the shading’s target coordinate space. The radii r0 and r1 shall both be greater than or equal to 0. If one radius is 0, the corresponding circle shall be treated as a point; if both are 0, nothing shall be painted."
15 minutes later, I exported the PDF again, but now I got these lines:
/Coords
[
0.000000000000053689468
0.0000-11728996
0.0000-11728996
26.832815
]
As you'll notice, now indeed the /Coords array has 4 entries -- but 0.0000-11728996 isn't a valid number!
In any case, the particular numbers in my objects 32, 33 and 34 do look funny somehow:
Either they are meant to be 6 numbers:
[0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815]
Then they can only be meant for a /ShadingType 3 (radial shading)
But they are noted in the context of /ShadingType 2 (axial shading)
Or they are meant to be 4 numbers:
[0.000000000000053689468 0.0000-11728996 0.0000-11728996 26.832815]
Then 0.0000-11728996 are not valid numbers.
Fix
So the fix could be in...
...either change the /ShadingType 2 to /ShadingType 3 and keep the array of 6 numbers
...or keep the /ShadingType 2 and throw away 2 of the 6 numbers to keep only 4 (but which?)
I decided (arbitrarily, by chance) to try with ShadingType 2 first and delete these two numbers: -11728996 0.0000.
I was lucky: the PDF now lets convert process the PDF pages into JPEGs (which means the Ghostscript command called by convert was also working correctly).
Good luck with your continued using of GoogleDocs when creating PDFs...
...but don't count me in!
Update
Here is a link to a GoogleDoc currently exhibiting one of the bug variants explained above:
To see the bug, save it as a PDF. Then open it in a text editor.
Should the doc from this link stop to export buggy PDFs and stop to exhibit one of the details I've described above, then Google has applied a fix... (until they break it again?!?)

PDF Flag annotations

I try to (programmatically) write the page numbers to all pages in a PDF file.
The object I use to write looks like this:
493 0 obj
<</Length 96>>
stream
Q
/2 12 Tf
/DeviceRGB cs
0 0 0 scn
q
1 0 -0 1 298 32 cm
BT
1 0 0 1 -3.6 1.884 Tm
(2) Tj
ET
Q
endstream
endobj
It worked fine, until I tried to do it on a page which uses the flag "/rotate" :
23 0 obj
<</Parent 2 0 R /Rotate 180 /Contents [492 0 R 24 0 R 493 0 R ] ... >>
...
When tried to do so, the number I wrote came upside down (and in the top of the page instead of bottom).
I read about this in the PDF manual, and found I can use the annotation flags, indicating I want the written number to be fixed, and not effected by page rotation.
For that, I tried to add to the 493 obj dictionary the corresponding flag (NoRotate):
493 0 obj
<</Length 96 /F 16>>
stream
...
The only thing that actually happens is that the number I try to write doesn't show at all.
I tried to load different numbers into the "/F", but they all lead to an invisible number.
I tried to look for examples in the manual and over the net, but didn't find.
What am I doing wrong?
Maybe I place the "/F" in the wrong location??
According to Adobe's PDF Reference v1.7 (link to PDF), 8.4.2 Annotation Flags, the flag /F only applies to annotations -- objects with a /Type of /Annot, and appearing in a PDF as sticky notes, text edits, and clickable rectangles.
It seems you have to provide the rotation yourself, using the Tm operator.