PDF parsing file trailer - pdf

It is not clear from the PDF ISO standard document (PDF32000-2008) whether a comment may follow the startxref keyword:
startxref
Byte_offset_of_last_cross-reference_section
%%EOF
The standard does seem to imply that comments may appear anywhere:
7.2.3 Comments
Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.
EXAMPLE The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.
abc% comment ( /%) blah blah blah
123
Comments (other than the %PDF–n.m and
%%EOF comments described in 7.5, "File
Structure") have no semantics. They
are not necessarily preserved by
applications that edit PDF files.
If they are allowed to appear after the startxref, parsing the file becomes more difficult because you do not know how far to back up from the %%EOF comment to start parsing to find the byte offset.
Any ideas?

ISO 32000 says the lines shall contain 'startxref' and the byte offset to the xref keyword. So, comments are not permitted. I checked the source for several PDF parsers (itext, Xpdf and commercial library) and all of them expected the byte offset immediately after startxref + whitespace.

Related

What do the ASCII characters preceding a carriage return represent in a PDF page?

This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

Minimal PDF size according to specs

I'm reading PDF specs and I have a few questions about the structure it has.
First of all, the file signature is %PDF-n.m (8 bytes).
After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
After that, there should be a body, a xref table and a trailer and an %%EOF.
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?
First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
Which docs do you have? The PDF specification ISO 32000-1 says:
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater.
Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are
preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
followed by an end-of-line, i.e. CR, LF, or CR LF.
So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.
(sigh, I just saw you and #Jongware cleared that in comments while I wrote this...)
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?
As you commented you indeed mean a minimal valid PDF.
Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. #plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?
Normally it would be the last cross reference table/stream as the usual use case is
you start with a PDF which has but one cross reference section;
you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
etc...
The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.
And as you can of course apply incremental updates to a linearized document, you can have mixed forms.
In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.

The PDF file starts with the header %pdf. Can a valid pdf file have more than 1 such headers?

The PDF file starts with the header %pdf. Can a valid pdf file have more than 1 such headers?
The Pdf specification says that it can have more than 1 trailers. But it does not talk about multiple headers. Can it have multiple headers?
It can have as many as you want, since the symbol % is used to represent comments inside a PDF file, so it's not actually a "header".
From the PDF specification:
7.2.3 Comments
Any occurrence of the PERCENT SIGN (25h) outside a string or stream
introduces a comment. The comment consists of all characters after the
PERCENT SIGN and up to but not including the end of the line,
including regular, delimiter, SPACE (20h), and HORZONTAL TAB
characters (09h). A conforming reader shall ignore comments, and treat
them as single white-space characters. That is, a comment separates
the token preceding it from the one following it.
EXAMPLE:
The PDF fragment in this example is syntactically equivalent to just
the tokens abc and 123. abc% comment ( /%) blah blah blah 123

How do you comment out code snippets within a seam pdf file

How do you comment out pieces of code within a seam PDF generation file. XML style comments don't seem to work and the commented out code appears as it is in the pdf file.
<p:font name="times-roman" size="12" style="bold normal">
<p:text value="Full Name."/>
</p:font>
<p:font name="times-roman" size="9" style="normal">
<p:text value=" #{abc.firstName} "/>
</p:font>
According to Adobe PDF Reference:
Any occurrence of the percent sign
character (%) outside a string or
stream introduces a comment. The
comment consists of all characters
between the percent sign and the end
of the line, including regular,
delimiter, space, and tab characters.
PDF ignores comments, treating them as
if they were single white-space
characters. That is, a comment
separates the token preceding it from
the one following it; thus, the PDF
fragment
abc% comment { /% ) blah blah blah
123
is syntactically equivalent
to just the tokens abc and 123.
I have not used JBoss Seam myself, but I guess you could try combining the two comment styles (xml and PDF) in your xml input file so that your comments are not visible in the resulting pdf file.