How are pdfs build? - pdf

Where do I find information about how a pdf is made up?
For example: A pdf I created named Dokname containing the string TEST opend in a text-editor looks like this:
(I replaced the parts the text-editor couldn't decode with [...])
%PDF-1.4
%Óëéá
1 0 obj
<</Title (Dokname)
/Producer (Skia/PDF m102 Google Docs Renderer)>>
endobj
3 0 obj
<</ca 1
/BM /Normal>>
endobj
5 0 obj
<</Filter /FlateDecode
/Length 160>> stream
[...]
endstream
endobj
2 0 obj
<</Type /Page
/Resources <</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/ExtGState <</G3 3 0 R>>
/Font <</F4 4 0 R>>>>
/MediaBox [0 0 596 842]
/Contents 5 0 R
/StructParents 0
/Parent 6 0 R>>
endobj
6 0 obj
<</Type /Pages
/Count 1
/Kids [2 0 R]>>
endobj
7 0 obj
<</Type /Catalog
/Pages 6 0 R>>
endobj
8 0 obj
<</Length1 14972
/Filter /FlateDecode
/Length 7164>> stream
[...]
endstream
endobj
9 0 obj
<</Type /FontDescriptor
/FontName /AAAAAA+ArialMT
/Flags 4
/Ascent 905.27344
/Descent -211.91406
/StemV 45.898438
/CapHeight 715.82031
/ItalicAngle 0
/FontBBox [-664.55078 -324.70703 2000 1005.85938]
/FontFile2 8 0 R>>
endobj
10 0 obj
<</Type /Font
/FontDescriptor 9 0 R
/BaseFont /AAAAAA+ArialMT
/Subtype /CIDFontType2
/CIDToGIDMap /Identity
/CIDSystemInfo <</Registry (Adobe)
/Ordering (Identity)
/Supplement 0>>
/W [0 [750] 40 54 666.99219 55 [610.83984]]
/DW 0>>
endobj
11 0 obj
<</Filter /FlateDecode
/Length 243>> stream
[...]
endstream
endobj
4 0 obj
<</Type /Font
/Subtype /Type0
/BaseFont /AAAAAA+ArialMT
/Encoding /Identity-H
/DescendantFonts [10 0 R]
/ToUnicode 11 0 R>>
endobj
xref
0 12
0000000000 65535 f
0000000015 00000 n
0000000365 00000 n
0000000098 00000 n
0000008721 00000 n
0000000135 00000 n
0000000573 00000 n
0000000628 00000 n
0000000675 00000 n
0000007925 00000 n
0000008159 00000 n
0000008407 00000 n
trailer
<</Size 12
/Root 7 0 R
/Info 1 0 R>>
startxref
8860
%%EOF
What do these obj-elements represent? Where is my TEST? Why did it get scrambled?
What I am searching for can probably all be found in adobe's documentations, but those have hundreds of pages which is very overwhelming. I get that this is a very complex topic and I am not trying to understand it completely. Just looking for an introduction or an overview. Unfontunately I didn't find anything like that on youtube or elsewhere..

Too complex for comments and yes you will only find snippets here and there including this and bits in my and others answers.
For a quick overview of the code sample you provided
A pdf is a collection of objects which are placed in no sequential order. So you start at the end before the last %%EOF (potentially one of many !) with startxref 8860 where 8860 is the decimal address of the Cross(XRef)erence table i.e. the files index.
There are many abbreviations (too many to list) and like a stack language most things may appear (literally) backwards so the xref points to each objects position in the file.
The prime target in this case is 7 0 obj <</Type /Catalog /Pages 6 0 R>> endobj since the catalog tells us about where the number of following pages will be found thus in object 6 /Pages /Count 1 /Kids [2 0 R] so its one page further defined in 2 0 obj
We now see there is an image and font(s) placed within /MediaBox [0 0 596 842] which is roughly (a tad wider) than a standard A4 page since 595/72" is closer to 210 mm.
Too much to describe about that one item alone, so skipping to Where is your text? and we see /Contents 5 0 R so that compressed stream of data that you need to decode is most likely your text but the length (/Length 160) is the binary flate encoded stream with placements not just your raw plain text.
The quantity of date sub setting the font seems odd and excessive for just 4 letters (if it was similar Helvetica it would not need including nor breaking the font as CID ArialMT) and without the full file its hard to say why the words /Image* is there, but it is Google Docs Renderer!
My suspicion is we may see characteristics of OCR in that stream.

Related

PDF sign and resign is not recognized when the signatures are visible

The idea is to be able to sign a PDF file multiple times with my own PDF parser.
Reference files: here.
When the signature isn't going to be visible, all work ok. I sign 1.pdf once (2.pdf) and then twice (3.pdf), Adobe Acrobat recognizes the signature.
The problem arises when the signature should be visible. The first signing works correctly (2.pdf). However the second (3.pdf) fails, Acrobat says the first signature is invalidated and the second is not recognized.
As far as I can tell, the only difference between visible and invisible is the adding of the text object. Why adobe invalidates the first signature and why the second isn't recognized?
28 0 obj
<</BaseFont/Helvetica/Type/Font/Subtype/Type1/Encoding/WinAnsiEncoding/Name/Helv>>
endobj
29 0 obj
<</BaseFont/ZapfDingbats/Type/Font/Subtype/Type1/Name/ZaDb>>
endobj
31 0 obj<</Font 32 0 R>>
endobj
32 0 obj<</FAdESFont2 33 0 R>>
endobj
33 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
34 0 obj
<</Length 90>>stream
BT
6 760 TD
/FAdESFont2 6 Tf
(m#turboirc.com MICHAIL CHOURDAKIS 1/23/2023 17:24:10) Tj
ET
endstream
endobj
26 0 obj
<</Type/XObject/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/Subtype/Form/BBox[0 0 0 0]/Matrix [1 0 0 1 0 0]/Length 8/FormType 1/Filter/FlateDecode>>stream
xœ
endstream
endobj
3 0 obj
<</Contents[34 0 R 24 0 R 12 0 R]/CropBox[0.0 0.0 612.0 792.0]/MediaBox[0.0 0.0 612.0 792.0]/Parent 2 0 R/Resources 13 0 R/Rotate 0/Type/Page/Annots[17 0 R 27 0 R]>>
endobj
2 0 obj
<</Count 1/Kids[3 0 R]/Type/Pages>>
endobj
1 0 obj
<</AcroForm<</Fields[17 0 R 27 0 R]/DR<</Font<</Helv 28 0 R/ZaDb 29 0 R>>>>/DA(/Helv 0 Tf 0 g )/SigFlags 3>>/AcroForm<</Fields[17 0 R]/DR<</Font<</Helv 18 0 R/ZaDb 19 0 R>>>>/DA(/Helv 0 Tf 0 g )/SigFlags 3>>/Pages 2 0 R/Type/Catalog>>
endobj
14 0 obj
<</Producer(AdES Tools https://www.turboirc.com)/ModDate(D:20230123152410+00'00')>>
endobj
xref
Why adobe invalidates the first signature and why the second isn't recognized?
Because you add the visualizations of the signatures in an inappropriate way.
You add visualizations of the signatures by adding to the static page content (the page content streams). This is the wrong approach if you want to be able to add signatures to already signed PDFs, because manipulation of the static page content after signing is a forbidden change, see this answer.
The appropriate way to add visualizations of PDF signatures is by adding an appearance stream to the respective signature field widget.
For details you may want to study the PDF specification ISO 32000.

Visible signature in PDF file (2)

Continuing from this question, the PDF is now constructed as such:
8 0 obj
<</F 132/Type/Annot/Subtype/Widget/Rect[2 198 100 190]/FT/Sig/DR<<>>/T(Signature1)/V 6 0 R/P 3 0 R/AP<</N 7 0 R>>>>
endobj
6 0 obj
<</Contents <...>/Type/Sig/SubFilter/ETSI.CAdES.detached/M(D:20230128131946+00'00')/ByteRange [0 830 60832 1714]/Filter/Adobe.PPKLite>>
endobj
9 0 obj
<</BaseFont/Helvetica/Type/Font/Subtype/Type1/Encoding/WinAnsiEncoding/Name/Helv>>
endobj
10 0 obj
<</BaseFont/ZapfDingbats/Type/Font/Subtype/Type1/Name/ZaDb>>
endobj
12 0 obj<</Font 13 0 R>>
endobj
13 0 obj<</FAdESFont1 14 0 R>>
endobj
14 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
15 0 obj
<</Length 90>>stream
BT
2 194 TD
/FAdESFont1 5 Tf
(m#turboirc.com MICHAIL CHOURDAKIS 1/28/2023 15:19:46) Tj
ET
endstream
endobj
7 0 obj
<</Type/XObject/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/Subtype/Form/BBox[2 198 100 190]/Length 90/FormType 1/Filter/FlateDecode>>stream
BT
2 194 TD
/FAdESFont1 5 Tf
(m#turboirc.com MICHAIL CHOURDAKIS 1/28/2023 15:19:46) Tj
ET
endstream
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R/Annots[8 0 R]>>
endobj
2 0 obj
<</Type/Pages/MediaBox[0 0 200 200]/Count 1/Kids[3 0 R]>>
endobj
1 0 obj
<</AcroForm<</Fields[8 0 R]/DR<</Font<</Helv 9 0 R/ZaDb 10 0 R>>>>/DA(/Helv 0 Tf 0 g )/SigFlags 3>>/Type/Catalog/Pages 2 0 R>>
endobj
11 0 obj
<</Producer(AdES Tools https://www.turboirc.com)/ModDate(D:20230128131946+00'00')>>
endobj
xref
0 4
0000000000 65535 f
0000061862 00000 n
0000061787 00000 n
0000061681 00000 n
6 10
0000000810 00000 n
0000061409 00000 n
0000000679 00000 n
0000060958 00000 n
0000061056 00000 n
0000062004 00000 n
0000061133 00000 n
0000061165 00000 n
0000061203 00000 n
0000061271 00000 n
trailer
<</Root 1 0 R/Prev 492/Info 11 0 R/Size 20/ID[<6BD3BF95416A5C19FFBC464EC610875C><54ACC00AA74869363131BCC04E65417F>]>>
startxref
62104
%%EOF
The idea is:
Create the annotation object (ID 8) which refers to the signature /V (6) and something to show ? /N (8).
The annotation object is a stream containing the text?
7 0 obj <</Type/XObject/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/Subtype/Form/BBox[2 198 100 190]/Length 90/FormType 1/Filter/FlateDecode>>stream
BT
2 194 TD
/FAdESFont1 5 Tf
(m#turboirc.com MICHAIL CHOURDAKIS 1/28/2023 15:19:46) Tj
ET
endstream
endobj
This time adobe accepts the signature and has a "box" in which I can click to show signature information, but the text (mail name date) is not displayed.
What am I missing?
In the previous mode I was changing the content of the original root by I learned from this question that this is an incorrect way of adding a visible signature and will not work for re-signing.
Your appearance stream in object 7 has some errors, in particular
Its resources dictionary does not contain a fonts section; so how should the text in it be rendered?
It claims to be flate-encoded but obviously is not.

Set User Units when generating a PDF

I need to change the default user unit in a generated pdf file. Here is a minimal example which displays, but without the correct document size.
%PDF-1.7
1 0 obj
<< /Type /Catalog
/Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages
/Kids [ 3 0 R ]
/Count 1 >>
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/UserUnit 2.83
/MediaBox [0 0 2440 1220]
/Contents 4 0 R >>
endobj
4 0 obj
<< /Length 44 >>
stream
0.3 0.5 0.2 0.1 k
100 100 400 400 re
f
endstream
endobj
xref
0 5
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000117 00000 n
0000000221 00000 n
trailer
<< /Size 5
/Root 1 0 R >>
startxref
309
%%EOF
If you open this file in a PDF viewer, it's as if the UserUnit default has not been changed.
I need to get the user units as close to millimetres as possible. The graphics in this file are to be printed onto board then cut out with a CNC machine so there needs to be some level of accuracy with the printing.
How do you set the UserUnit value correctly?
Never assume Apple Preview does the correct thing with PDF files.
If you open this in Adobe Acrobat, the reported page size is 2436 x 1218mm, which I believe is correct for your UserUnit value.
The box looks the same size proportionally as what is shown in Preview, so I'm going to assume that one is drawn correctly as well.

Writing multiline text in pdf page

I want to write a multiline text, I've tried this:
6 0 obj
<</Length 59>>
stream
BT /F1 24 Tf 100 520 Td (This is test\n This is test)Tj ET
endstream
endobj
But I am not getting a new line. Is there a simple way to achieve that or I must create another stream with position of the next line?
This is the full code:
%PDF-1.5
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 700] /Contents 6 0 R>>
endobj
4 0 obj<</Font <</F1 5 0 R>>>>
endobj
5 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
6 0 obj
<</Length 75>>
stream
BT
/F1 24 Tf
100 520 Td
(This is test) Tj
T*
(This is test) Tj
ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000059 00000 n
0000000116 00000 n
0000000219 00000 n
0000000259 00000 n
0000000328 00000 n
trailer <</Size 7/Root 1 0 R>>
startxref
454
%%EOF
You may want to do something like this:
BT
/F1 24 Tf
30 TL
100 520 Td
(This is test) Tj
T*
(This is test) Tj
ET
or the shorter form:
BT
/F1 24 Tf
30 TL
100 520 Td
(This is test) Tj
(This is test) '
ET
You might want to read up on section 9.4.3 Text-Showing Operators in the PDF specification ISO 32000-1.
P.S.: Added text leading TL operators.

Minimal PDF example in PDF specification

I took the minimal PDF example in the PDF specification from PDF Specification, copied it to NotePad, renamed the file to have the extension .pdf.
I can open it with other PDF viewer (PDF-XChange, SumatraPDF, MuPDF). But when I open it with Adobe Reader, it says the file is broken.
I am not sure if other viewers treat this "broken" file as blank file or not.
The file is supposed to display one blank page, since it is a minimal example.
In fact, I modify the minimal example. Because when I copy it from PDF specification to notepad, and open the .txt file by a Hex Editor, I see a new line in .txt file give me 2 space. For example,
1 0 obj
<< /Type /Catalog
gives me (in Hex Editor)
1 0 obj << /Type /Catalog
which is (in hex values)
31 20 30 20 6F 62 6A 0D 0A 3C 3C 20 2F 54 79 70
65 20 2F 43 61 74 61 6C 6F 67
The 2 spaces between j and < are 0D 0A.
Hence I don't make new lines in NotePad, and modify the values in the xref part.
Below is the full code.
Do you know what's wrong with this example? Why does Adobe Reader say it is broken? Is this because I gave the wrong values in xref?
%PDF-1.4 1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Kids [4 0 R] /Count 1 >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >> endobj 5 0 obj << /Length 35 >> stream … Page-marking operators … endstream endobj 6 0 obj [/PDF] endobj xref 0 7 0000000000 65535 f 0000000009 00000 n 0000000074 00000 n 0000000119 00000 n 0000000176 00000 n 0000000295 00000 n 0000000373 00000 n trailer << /Size 7 /Root 1 0 R >> startxref 395 %%EOF
First: when you 'copied' the example from the PDF specification, very likely a few things happened which made your copy to not work as expected:
...you didn't 'copy' by re-typing the example in a text editor, but
...you used copy'n'paste, using a PDF as the source file.
Depending on your text editor, that method probably caused the conversion of the newline convention to be changed from [cr]+[lf] to [cr] or vice-versa. This in turn means that the byte offset numbers in the object 'table of contents' (the 'xref'-table) are no longer valid.
Another problem with the PDF source code you posted is that it doesn't now contain any linebreaks at all. Some viewers may be able to still silently parse the thing, but not all are. And it certainly is against the spec, because according to the spec, in chapter 7.5.2 it is clearly spelled out that
"The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
Your header violates that rule.
Also, the 'stream' in 5 0 obj isn't any valid PDF code, it is just place holder text (… Page-marking operators …). Some viewers may be tilting when they come across such 'garbage'.
Lastly, your startxref value wasn't correct.
So here is a file that works. I repaired it in a text editor, and I put your original code as a comment after the %%EOF for comparison and reference:
%PDF-1.4
1 0 obj
<< /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >>
endobj
2 0 obj
<< /Type Outlines /Count 0 >>
endobj
3 0 obj
<< /Type /Pages /Kids [4 0 R] /Count 1 >>
endobj
4 0 obj
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >>
endobj
5 0 obj
<< /Length 35 >>
stream
… Page-marking operators …
endstream
endobj
6 0 obj
[/PDF]
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000119 00000 n
0000000176 00000 n
0000000295 00000 n
0000000376 00000 n
trailer
<< /Size 7 /Root 1 0 R >>
startxref
394
%%EOF
%% %PDF-1.4 1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Kids [4 0 R] /Count 1 >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >> endobj 5 0 obj << /Length 35 >> stream … Page-marking operators … endstream endobj 6 0 obj [/PDF] endobj xref 0 7 0000000000 65535 f 0000000009 00000 n 0000000074 00000 n 0000000119 00000 n 0000000176 00000 n 0000000295 00000 n 0000000373 00000 n trailer << /Size 7 /Root 1 0 R >> startxref 395