Internal structure of PDF file: decode params - pdf

What the next parametres of decoding does mean?
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<4DC888EB77E2D649AEBD54CA55A09C54><227DCAC2C364E84A9778262D41602AD4>]/Info 37 0 R/Length 69/Root 39 0 R/Size 38/Type/XRef/W[1 2 1]>>
I know, that Filter/FlateDecode -- it's filter, which was used to compress the stream. But what are ID, Info, Length, Root, Size? Are these parametres realeted with compression/decompression?

Please consult ISO-32000-1:
You are showing the dictionary of a compressed cross reference table (/Type/XRef):
7.5.8 Cross-Reference Streams
Cross-reference streams are stream objects, and contain a dictionary and a data stream.
Flatedecode: the way the stream is compressed.
Length: This is the number of bytes in the stream. Your PDF is at least a PDF 1.5 file and it has a compressed xref table.
DecodeParms: contains information about the way the stream is encoded.
A Cross-reference stream has some typical dictionary entries:
W: An array of integers representing the size of the fields in a single cross-reference entry. In your case [1 2 1].
Size: The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary.
I also see some entries that belong in the /Root dictionary (aka Catalog) of a PDF file:
14.4 File Identifiers
File identifiers shall be defined by the optional ID entry in a PDF
file’s trailer dictionary. The ID entry is optional but should be
used. The value of this entry shall be an array of two byte strings.
The first byte string shall be a permanent identifier based on the
contents of the file at the time it was originally created and shall
not change when the file is incrementally updated. The second byte
string shall be a changing identifier based on the file’s contents at
the time it was last updated. When a file is first written, both
identifiers shall be set to the same value.
14.3.3 Document Information Dictionary
What you see is a reference to another indirectory object that is a dictionary called the Info dictionary:
The optional Info entry in the trailer of a PDF file shall hold a
document information dictionary containing metadata for the document.
Note: this question isn't really suited for StackOverflow. StackOverflow is a forum where you can post programming problems. Your question isn't a programming problem. You are merely asking us to copy/paste quotes from ISO-32000-1.

Related

The signature byte range is invalid in Acrobat

I'm trying to make a timestamp of the PDF document using our library that we're making. I added a new section to the PDF document. I added new annotation object for the signature and the signature object containing the actual signature, and also a new xref table for the new section. When I check the xref entries everything seems right.
When I try to verify my signature in Acrobat I get the following error message "there are errors in the formatting of information contained in this signature (The signature byte range is invalid)".
However, when I check the byte range, everything seems right. I goes from the beginning of the document to the opening brackets of the Contents part, and from the end of it to the end of the document. I compared it with the document that has valid signature and it seems that byte ranges look the same.
I really don't understand what's wrong and why is Acrobat showing this error.
Here's the link to the signed file if anybody wants to have a look: https://ufile.io/mckajk9h
PS: I can share some part of the code, but the actual question is about the Acrobat reader and how it interprets PDF signatures, not my code. So, the relevant part should be the resulting PDF file that I shared.
There are some issues in your PDF.
The Major Issue
The major one which results in the non-intuitive error message: The first entry in your incremental update cross reference table is one byte too short:
As you can see, the first cross reference table entry (0000000000 65535 f\n) is one byte too short, according to the specification it has to be exactly 20 bytes long but yours is only 19 bytes long.
Whenever Adobe Acrobat sees structurally broken cross reference tables, it internally repairs the file. In the repaired file the objects are rearranged which renders the signature byte range invalid.
As soon as I had fixed this by adding a space between f and \n, Adobe Acrobat did not complain about a formatting error anymore. Of course it claimed that the document had been altered or corrupted, after all I had altered it. But at least it accepted the signature structure.
Minor Issues
Some of your offsets are incorrect, in your incremental update you have
xref
0 2
0000000000 65535 f
0000003029 00000 n
12 2
0000003144 00000 n
0000003265 00000 n
Thus, object 12 should start at 3144 but it actually starts at 3145.
Furthermore, you have
startxref
19548
%%EOF
So the xref keyword should start at 19548 but it starts at 19549.

Encoding of PDF dictionaries

I need to know the encoding of the values of PDF dictionaries (not the text displayed to the user but the "code behind").
I plan not to use any library for that.
Where can I find it?
the encoding of the values of PDF dictionaries
Values of PDF dictionaries are PDF objects.
You should take a look at the PDF specification ISO 32000-1, in particular chapter 7 Syntax, to find out about PDF objects. You will find:
The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character
set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and
certain types of arrays shall be defined using the ASCII character set.
Thus, most of the time you have to deal with ASCII values.
The situation is tricky with strings, though, because there are several types of strings which use the same string syntax options, so you have to interpret their contents according to their context.
Table 35 – String Object Types
Type Description
text string Shall be used for human-readable text, such as text
annotations, bookmark names, article names, and
document information. These strings shall be encoded
using either PDFDocEncoding or UTF-16BE with a
leading byte-order marker.
This type is described in 7.9.2.2, "Text String Type."
PDFDocEncoded string Shall be used for characters and glyphs that are
represented in a single byte, using PDFDocEncoding.
This type is described in 7.9.2.3, "PDFDocEncoded String
Type."
ASCII string Shall be used for characters that are represented in a
single byte using ASCII encoding.
byte string Shall be used for binary data represented as a series of
bytes, where each byte can be any value representable in
8 bits. The string may represent characters but the
encoding is not known. The bytes of the string need not
represent characters. This type shall be used for data
such as MD5 hash values, signature certificates, and Web
Capture identification values.
This type is described in 7.9.2.4, "Byte String Type."
If a string is the value e.g. of the Author metadata, it is a text string, so it is encoded using either PDFDocEncoding or UTF-16BE with a leading byte-order marker.
If on the other hand a string is the value e.g. of Contents in a signature dictionary, it is a byte string holding a binary object, any attempt to interpret it according to some encoding will fail.
The situation is even more tricky with streams.
First of all the stream content may be somehow processed, e.g. it may be compressed. To get to the actual stream contents, you first have to undo this processing.
The the content may either be binary, e.g. a font program, or text, e.g. JavaScript, or it may be a content stream, e.g. the page contents.
A content stream is a PDF stream object whose data consists of a sequence of instructions describing the
graphical elements to be painted on a page. The instructions shall be represented in the form of PDF objects,
using the same object syntax as in the rest of the PDF document.
Thus, they are mostly ASCII values. The exception again are string arguments to text drawing instructions. Their encoding depends entirely on the font currently selected when the string is drawn, and fonts may use standard encodings, but they may also use completely chaotic, ad-hoc encodings.
PS: If you happen to try and analyze an encrypted PDF, you will find that Encryption
applies to all strings and streams in the document's PDF file, with very few exceptions. In particular encryption does not apply to dictionary and array structures, numbers and names. Thus, someone not aware of this might not recognize that the PDF is encrypted but instead assume that strings and streams are encoded in a very weird way.
You find that in the PDF specification (http://www.adobe.com/devnet/pdf/pdf_reference.html). To elaborate a bit on the most important points in your question...
1) PDF dictionaries can contain a variety of value types (booleans, numbers, strings...). The encoding you are going to encounter depends on the type of value.
2) Mostly, the interesting and complex case is that where the type of object is a string.
3) For a string, read section 7.9.2 in the PDF specification. That explains what encodings can be used for such strings (PDFDocEncoding, Unicode encoding...) and how to recognise what encoding you have for a particular string.
To complement #mkl's and #DavidvanDriessche's excellent answers...
Here are three OpenSource command line tools which can help you to transform any PDF into different forms which expand/uncompress/decode object streams (Note, there is not one single, "the-one-and-only-correct" way to do this -- so the outputs of each of the tools will be different):
pdftk
mutool
qpdf
Each of these should be available via your favorite operating systems package manager.
pdftkexample usage:
pdftk in.pdf cat output out1.pdf uncompress
mutool example usage:
mutool clean -d in.pdf out2.pdf
qpdf example usage (my favorite tool for this purpose):
qpdf --qdf --object-streams=disable in.pdf out3.pdf
You should try each of these, compare their outputs for different input PDFs and then decide which one is your favorite (but never forget to remember the other tools when you encounter a case where your favorite shows unexpected results).

Minimal PDF size according to specs

I'm reading PDF specs and I have a few questions about the structure it has.
First of all, the file signature is %PDF-n.m (8 bytes).
After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
After that, there should be a body, a xref table and a trailer and an %%EOF.
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?
First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
Which docs do you have? The PDF specification ISO 32000-1 says:
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater.
Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are
preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
followed by an end-of-line, i.e. CR, LF, or CR LF.
So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.
(sigh, I just saw you and #Jongware cleared that in comments while I wrote this...)
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?
As you commented you indeed mean a minimal valid PDF.
Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. #plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?
Normally it would be the last cross reference table/stream as the usual use case is
you start with a PDF which has but one cross reference section;
you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
etc...
The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.
And as you can of course apply incremental updates to a linearized document, you can have mixed forms.
In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.

PDF format. function of %-started sequence

What is a function of hex sequence "25 E2 E3 CF D3", found at the beginning of some documents? It should be a comment as far as I understand, but it's content is not any meaningful text and the same sequence occurs in many documents.
It identifies the PDF file as containing binary data.
From the freely available PDF Reference (section 7.5.2, p. 40):
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the
beginning of a file to determine whether to treat the file’s contents as text or as binary.

dll files compared to gzip files

Okay, the title isn't very clear.
Given a byte array (read from a database blob) that represents EITHER the sequence of bytes contained in a .dll or the sequence of bytes representing the gzip'd version of that dll, is there a (relatively) simple signature that I can look for to differentiate between the two?
I'm trying to puzzle this out on my own, but I've discovered I can save a lot of time by asking for help. Thanks in advance.
Check if it's first two bytes are the gzip magic number 0x1f8b (see RFC 1952). Or just try to gunzip it, the operation will fail if the DLL is not gzip'd.
A gzip file should be fairly straight forward to determine as it ought to consist of a header, footer and some other distinguishable elements in between.
From Wikipedia:
"gzip" is often also used to refer to
the gzip file format, which is:
a 10-byte header, containing a magic
number, a version number and a time
stamp
optional extra headers, such as
the original file name
a body,
containing a DEFLATE-compressed
payload
an 8-byte footer, containing a
CRC-32 checksum and the length of the
original uncompressed data
You might also try determining if the gzip contains any records/entries as each will also have their own header.
You can find specific information on this file format (specifically the member header which is linked) here.