Out of simple curiosity, having seen the smallest GIF, what is the smallest possible valid PDF file?
This is an interesting problem. Taking it by the book, you can start off with this:
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 3 3]>>endobj
xref
0 4
0000000000 65535 f
0000000010 00000 n
0000000053 00000 n
0000000102 00000 n
trailer<</Size 4/Root 1 0 R>>
startxref
149
%EOF
which is 291 bytes of PDF joy. Acrobat opens it, but it complains somewhat. There is one page in it and it is 3/72" square, the minimum allowed by the spec.
However, Acrobat X doesn't even bother with the cross reference table anymore, so we can take that out:
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 3 3]>>endobj
trailer<</Size 4/Root 1 0 R>>
Acrobat complains, but opens it. Now we're at 178 bytes.
Turns out that you don't need that /Size in the trailer. Now we're at 172:
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 3 3]>>endobj
trailer<</Root 1 0 R>>
Turns out you don't need all those pesky /Type elements in your dictionaries:
%PDF-1.0
1 0 obj<</Pages 2 0 R>>endobj 2 0 obj<</Kids[3 0 R]/Count 1>>endobj 3 0 obj<</MediaBox[0 0 3 3]>>endobj
trailer<</Root 1 0 R>>
Now we're at 138 bytes.
It also turns out that when the spec says "shall be an indirect reference" and /Count is required, and the header "must" be %PDF-1.0, they're making loose suggestions. This is the smallest I could make it and have it openable in Acrobat X:
%PDF-1.
trailer<</Root<</Pages<</Kids[<</MediaBox[0 0 3 3]>>]>>>>>>
70 bytes.
Now, my editor uses Windows newline discipline, but Acrobat accepts Windows, Mac, or Unix conventions, so by using a hex editor, I replaced the \r\n with \r and removed the last newline altogether, which leaves me with 67 bytes
25 50 44 46 2D 31 2E 0D 74 72 61 69 6C 65 72 3C
3C 2F 52 6F 6F 74 3C 3C 2F 50 61 67 65 73 3C 3C
2F 4B 69 64 73 5B 3C 3C 2F 4D 65 64 69 61 42 6F
78 5B 30 20 30 20 33 20 33 5D 3E 3E 5D 3E 3E 3E
3E 3E 3E
I tried taking off the last end dictionary (>>), but Acrobat wouldn't have that. The PDF reading built-in to Google Chrome (FoxIt) won't open it.
As a PostScript (HA! See what I did there?), if you consent to Acrobat "repairing" the file, it bumps up to 3550 bytes, most of it optional metadata, but it leaves behind a number of clear spec violations.
I could not get the hello world example to open.
For a small-ish file with text content :
%PDF-1.2
9 0 obj
<<
>>
stream
BT/ 9 Tf(Test)' ET
endstream
endobj
4 0 obj
<<
/Type /Page
/Parent 5 0 R
/Contents 9 0 R
>>
endobj
5 0 obj
<<
/Kids [4 0 R ]
/Count 1
/Type /Pages
/MediaBox [ 0 0 99 9 ]
>>
endobj
3 0 obj
<<
/Pages 5 0 R
/Type /Catalog
>>
endobj
trailer
<<
/Root 3 0 R
>>
%%EOF
Based on all the answers here, here's the smallest PDF with text:
SMALL_PDF = (
b"%PDF-1.2 \n"
b"9 0 obj\n<<\n>>\nstream\nBT/ 32 Tf( YOUR TEXT HERE )' ET\nendstream\nendobj\n"
b"4 0 obj\n<<\n/Type /Page\n/Parent 5 0 R\n/Contents 9 0 R\n>>\nendobj\n"
b"5 0 obj\n<<\n/Kids [4 0 R ]\n/Count 1\n/Type /Pages\n/MediaBox [ 0 0 250 50 ]\n>>\nendobj\n"
b"3 0 obj\n<<\n/Pages 5 0 R\n/Type /Catalog\n>>\nendobj\n"
b"trailer\n<<\n/Root 3 0 R\n>>\n"
b"%%EOF"
)
As base64. Copy this and test in Chrome:
data:application/pdf;base64,JVBERi0xLjIgCjkgMCBvYmoKPDwKPj4Kc3RyZWFtCkJULyAzMiBUZiggIFlPVVIgVEVYVCBIRVJFICAgKScgRVQKZW5kc3RyZWFtCmVuZG9iago0IDAgb2JqCjw8Ci9UeXBlIC9QYWdlCi9QYXJlbnQgNSAwIFIKL0NvbnRlbnRzIDkgMCBSCj4+CmVuZG9iago1IDAgb2JqCjw8Ci9LaWRzIFs0IDAgUiBdCi9Db3VudCAxCi9UeXBlIC9QYWdlcwovTWVkaWFCb3ggWyAwIDAgMjUwIDUwIF0KPj4KZW5kb2JqCjMgMCBvYmoKPDwKL1BhZ2VzIDUgMCBSCi9UeXBlIC9DYXRhbG9nCj4+CmVuZG9iagp0cmFpbGVyCjw8Ci9Sb290IDMgMCBSCj4+CiUlRU9G
To make the page bigger, adjust the MediaBox dimensions :)
/MediaBox [ 0 0 250 50 ]
I thought I'd make a smallest pdf that displays "Hello World". The text is in the lower left corner. Sorry about the 9-point font, any larger would cost an extra byte :)
172 bytes for Adobe Reader X (if saved with linefeed-only newlines and no trailing newline or null-byte):
%PDF-1.
1 0 obj<</Kids[<</Parent 1 0 R/Resources<<>>/Contents 2 0 R>>]>>endobj 2 0 obj<<>>stream
BT/ 9 Tf(Hello World)' ET
endstream
endobj trailer<</Root<</Pages 1 0 R>>>>
120 bytes for Chrome's builtin PDF viewer:
%PDF 1 0 obj<</Pages<</Kids[<</Contents<<>>stream
BT 9 Tf(Hello World)' ET endstream>>]>>>>endobj trailer<</Root 1 0 R>>
To easily see this in Chrome, paste this URI in the address bar (SO won't let me link to it, and it won't work at all in other browsers):
data:application/pdf,%25PDF%201%200%20obj%3C%3C%2FPages%3C%3C%2FKids%5B%3C%3C%2FContents%3C%3C%3E%3Estream%0ABT%209%20Tf(Hello%20World)'%20ET%20endstream%3E%3E%5D%3E%3E%3E%3Eendobj%20trailer%3C%3C%2FRoot%201%200%20R%3E%3E
I was going to give an example of what I thought was the minimal valid "universal" PDF. until I noticed that the whole ethos of using a PDF is to ensure it will render exactly the same on all devices and their PDF readers. However on cross checking my "perfectly small well formed PDF" I spotted this. TL;DR this is fixed in my personal minimal text template (at the end)
So the ground rule was "smallest possible valid PDF" but I consider this shortage should count as an invalid PDF since it does not adhere to the concept of "Fit for Purpose" thus the minimum PDF must itself as a minimum contain a minimum of one means of fixing a working font.
To explain my proposed solution and why its less than perfect here it is in a rough form because of cut and paste.
%PDF-1.0
%µ¶
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Kids[3 0 R]/Count 1/Type/Pages/MediaBox[0 0 595 792]>>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Contents 4 0 R/Resources<<>>>>
endobj
4 0 obj
<</Length 58>>
stream
q
BT
/ 96 Tf
1 0 0 1 36 684 Tm
(Hello World!) Tj
ET
Q
endstream
endobj
xref
0 5
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000136 00000 n
0000000209 00000 n
trailer
<</Size 5/Root 1 0 R>>
startxref
316
%%EOF
Whilst not defined by the rules of the question I have included some past experience of user problems.
The first difference you might note is media box in 2nd obj is a hybrid MediaBox[0 0 595 792] which is a minimax A4 width and minimax US Letter high, since otherwise the "universal page" in most countries would force a second sheet # 100% scale printing either for too wide or too high a page definition for the locale defaults.
And the current problem is evidenced in 3rd obj as no fonts have been set for resources, thus in aiming for minimal the PDF, I contest without a font defined, will be Invalid.
Thus none of the answers so far including my own, appear to produce a PDF that will "WORK" as a "VALID" means to produce the same printout, regardless of platform or viewer.
Turning to libraries I found a 3MB zip with an exceptionally versatile windows.exe (a single file that can do most pdf functions like split merge import stamp export attachments etc.) which can take "Hello World! in a command line and produce a good working file, this is page centre zoomed in
it uses a stream for the text and its positioning, and has other conforming data like producer so I offer this as a potentially good minimal to pare down, note as presented this file will appear blank due to stream corruption from binary to text.
%PDF-1.7
%µ¶
1 0 obj
<</Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[5 0 R]/MediaBox[0 0 595 792]/Type/Pages>>
endobj
3 0 obj
<</BaseFont/Helvetica/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
4 0 obj
<</Filter/FlateDecode/Length 101>>
stream
xœ*Tp
QÐw3P04Ò30PISp
Q01
à˜kdf¢ga¬`bhâ%ç‚ô(„”#©Aîè"EéÚlA
HW‘‚†GjNN¾Bx~QNŠ¢¦BHÈÞ## ÿÿFå
endstream
endobj
5 0 obj
<</Contents 4 0 R/CropBox[0 0 595 792]/MediaBox[0 0 595 792]/Parent 2 0 R/Resources<</Font<</F0 3 0 R>>>>/Type/Page>>
endobj
6 0 obj
<</CreationDate(D:20220600600709+01'00')/ModDate(D:20220600600709+01'00')/Producer(me 2)>>
endobj
xref
0 7
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000136 00000 n
0000000225 00000 n
0000000395 00000 n
0000000529 00000 n
trailer
<</Size 7/Info 6 0 R/Root 1 0 R/ID[<A2A0CE5CCD9D0DABD5845AD574BF0A5C><09BF9D281BE12CB5B5933BB2B62B0D4D>]>>
startxref
636
%%EOF
P.S I deliberately added a non valid item so is intentionally not the minimum working answer, see if you can work out what's clearly wrong:-)
My personal offering
So I am often asked how to write plain text templated PDFs thus need the font to be static (Helvetica or Courier should do) and a structure that is easy to modify using windows CMD line, so this suits my purpose its now 698 bytes as shown with two place holders to show multi-line so if needed can find and replace Helvetica with Courier (note intentional 2 spaces after to keep byte count)
%PDF-1.1
%âã
1 0 obj
<</Type/Catalog/Pages<</Type/Pages/Count 1/Kids[2 0 R]>>>>
endobj
2 0 obj
<</Type/Page/Parent 1 0 R/MediaBox[0 0 594 792]/Resources<</Font<</F1 3 0 R>>/ProcSet[/PDF/Text]>>/Contents 4 0 R>>
endobj
3 0 obj
<</Type/Font/Subtype/Type1/Name/F1/BaseFont/Helvetica>>
endobj
4 0 obj
<</Length 5 0 R>>
stream
BT
/F1 36 Tf
1 0 0 1 255 752 Tm
48 TL
( Hello)'
(World!)'
ET
endstream
endobj
5 0 obj
78
endobj
xref
0 6
0000000000 65536 f
0000000017 00000 n
0000000094 00000 n
0000000228 00000 n
0000000302 00000 n
0000000425 00000 n
trailer
<</Size 6/Info <</CreationDate(D:2023)/Producer(cmd2pdf)/Title(mini.pdf)>>/Root 1 0 R>>
startxref
446
%%EOF
To see how this approach works in windows command line RIGHT CLICK and download as text https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd (now 200 lines long!) NOTE browser security may ask you to trust a cmd as download thus use .txt extension and you will still need to change properties to UNBLOCK once you are happy it should do no harm to run it!
#mkl are you up for producing your best shot ?
According to this Ange Albertini lecture, the smallest possible valid PDF is 36 bytes:
%PDF-(NULL)trailer<</Root<</Pages<<>>>>>>
Where (NULL) is the unprintable ASCII 0 character.
However, as Ange notes, while this PDF is technically valid, most PDF reader apps will regard it as invalid based on the size alone, thus failing to open it.
I needed a PDF version which is usable by a PDF converter (A4 format issue.. all the above constructs worked with Adobe Reader and Chrome, but not with the PDF converter which required DIN A4).
I found this site and this PDF worked fine with the PDF converter I'm using: https://help.callassoftware.com/m/73261/l/798383-how-to-create-a-simple-pdf-file
Working for a PDF related company, I know that the following content will be working pretty well. This is a valid empty A4 page:
%PDF-1.4
%âãÏÓ
5 0 obj
<<
/Length 1
>>
stream
endstream
endobj
4 0 obj
<<
/Type /Page
/MediaBox [0 0 612 792]
/Resources <<
>>
/Contents 5 0 R
/Parent 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
3 0 obj
<<
/Creator (PDF Creator http://www.pdf-tools.com)
/CreationDate (D:20150701112447+02'00')
/ModDate (D:20220607183602+02'00')
/Producer (3-Heights\222 PDF Optimization Shell 6.0.0.0 \(http://www.pdf-tools.com\))
>>
endobj
xref
0 6
0000000000 65535 f
0000000226 00000 n
0000000169 00000 n
0000000275 00000 n
0000000065 00000 n
0000000015 00000 n
trailer
<<
/Size 6
/Root 1 0 R
/Info 3 0 R
/ID [<1C3500CA9F7232B97E0EF3F789E8B7F2> <254C8D153F655D49945EAD68D801E011>]
>>
startxref
505
%%EOF
Now using Javascript, you can embed this into your js bundle. First encode in base64 the content above, then use the encoded string and create a Blob file with it by writing:
const str = 'JVBERi0xLjQKJcOiw6PDj8OTCjUgMCBvYmoKPDwKL0xlbmd0aCAxCj4+CnN0cmVhbQogCmVuZHN0cmVhbQplbmRvYmoKNCAwIG9iago8PAovVHlwZSAvUGFnZQovTWVkaWFCb3ggWzAgMCA2MTIgNzkyXQovUmVzb3VyY2VzIDw8Cj4+Ci9Db250ZW50cyA1IDAgUgovUGFyZW50IDIgMCBSCj4+CmVuZG9iagoyIDAgb2JqCjw8Ci9UeXBlIC9QYWdlcwovS2lkcyBbNCAwIFJdCi9Db3VudCAxCj4+CmVuZG9iagoxIDAgb2JqCjw8Ci9UeXBlIC9DYXRhbG9nCi9QYWdlcyAyIDAgUgo+PgplbmRvYmoKMyAwIG9iago8PAovQ3JlYXRvciAoUERGIENyZWF0b3IgaHR0cDovL3d3dy5wZGYtdG9vbHMuY29tKQovQ3JlYXRpb25EYXRlIChEOjIwMTUwNzAxMTEyNDQ3KzAyJzAwJykKL01vZERhdGUgKEQ6MjAyMjA2MDcxODM2MDIrMDInMDAnKQovUHJvZHVjZXIgKDMtSGVpZ2h0c1wyMjIgUERGIE9wdGltaXphdGlvbiBTaGVsbCA2LjAuMC4wIFwoaHR0cDovL3d3dy5wZGYtdG9vbHMuY29tXCkpCj4+CmVuZG9iagp4cmVmCjAgNgowMDAwMDAwMDAwIDY1NTM1IGYKMDAwMDAwMDIyNiAwMDAwMCBuCjAwMDAwMDAxNjkgMDAwMDAgbgowMDAwMDAwMjc1IDAwMDAwIG4KMDAwMDAwMDA2NSAwMDAwMCBuCjAwMDAwMDAwMTUgMDAwMDAgbgp0cmFpbGVyCjw8Ci9TaXplIDYKL1Jvb3QgMSAwIFIKL0luZm8gMyAwIFIKL0lEIFs8MUMzNTAwQ0E5RjcyMzJCOTdFMEVGM0Y3ODlFOEI3RjI+IDwyNTRDOEQxNTNGNjU1RDQ5OTQ1RUFENjhEODAxRTAxMT5dCj4+CnN0YXJ0eHJlZgo1MDUKJSVFT0Y=';
const blob = new Blob([atob(str)], { type: 'application/pdf' });
In Java, use this:
private static String samplepdf = "255044462D312E0D747261696C65723C3C2F526F6F743C3C2F50616765733C3C2F4B6964735B3C3C2F4D65646961426F785B302030203320335D3E3E5D3E3E3E3E3E3E";
and then
byte[] bytes = hexStringToByteArray(samplepdf);
...
public byte[] hexStringToByteArray(String s) {
int len = s.length();
byte[] data = new byte[len / 2];
for (int i = 0; i < len; i += 2) {
data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
+ Character.digit(s.charAt(i + 1), 16));
}
return data;
}
Related
The idea is to be able to sign a PDF file multiple times with my own PDF parser.
Reference files: here.
When the signature isn't going to be visible, all work ok. I sign 1.pdf once (2.pdf) and then twice (3.pdf), Adobe Acrobat recognizes the signature.
The problem arises when the signature should be visible. The first signing works correctly (2.pdf). However the second (3.pdf) fails, Acrobat says the first signature is invalidated and the second is not recognized.
As far as I can tell, the only difference between visible and invisible is the adding of the text object. Why adobe invalidates the first signature and why the second isn't recognized?
28 0 obj
<</BaseFont/Helvetica/Type/Font/Subtype/Type1/Encoding/WinAnsiEncoding/Name/Helv>>
endobj
29 0 obj
<</BaseFont/ZapfDingbats/Type/Font/Subtype/Type1/Name/ZaDb>>
endobj
31 0 obj<</Font 32 0 R>>
endobj
32 0 obj<</FAdESFont2 33 0 R>>
endobj
33 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
34 0 obj
<</Length 90>>stream
BT
6 760 TD
/FAdESFont2 6 Tf
(m#turboirc.com MICHAIL CHOURDAKIS 1/23/2023 17:24:10) Tj
ET
endstream
endobj
26 0 obj
<</Type/XObject/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/Subtype/Form/BBox[0 0 0 0]/Matrix [1 0 0 1 0 0]/Length 8/FormType 1/Filter/FlateDecode>>stream
xœ
endstream
endobj
3 0 obj
<</Contents[34 0 R 24 0 R 12 0 R]/CropBox[0.0 0.0 612.0 792.0]/MediaBox[0.0 0.0 612.0 792.0]/Parent 2 0 R/Resources 13 0 R/Rotate 0/Type/Page/Annots[17 0 R 27 0 R]>>
endobj
2 0 obj
<</Count 1/Kids[3 0 R]/Type/Pages>>
endobj
1 0 obj
<</AcroForm<</Fields[17 0 R 27 0 R]/DR<</Font<</Helv 28 0 R/ZaDb 29 0 R>>>>/DA(/Helv 0 Tf 0 g )/SigFlags 3>>/AcroForm<</Fields[17 0 R]/DR<</Font<</Helv 18 0 R/ZaDb 19 0 R>>>>/DA(/Helv 0 Tf 0 g )/SigFlags 3>>/Pages 2 0 R/Type/Catalog>>
endobj
14 0 obj
<</Producer(AdES Tools https://www.turboirc.com)/ModDate(D:20230123152410+00'00')>>
endobj
xref
Why adobe invalidates the first signature and why the second isn't recognized?
Because you add the visualizations of the signatures in an inappropriate way.
You add visualizations of the signatures by adding to the static page content (the page content streams). This is the wrong approach if you want to be able to add signatures to already signed PDFs, because manipulation of the static page content after signing is a forbidden change, see this answer.
The appropriate way to add visualizations of PDF signatures is by adding an appearance stream to the respective signature field widget.
For details you may want to study the PDF specification ISO 32000.
Where do I find information about how a pdf is made up?
For example: A pdf I created named Dokname containing the string TEST opend in a text-editor looks like this:
(I replaced the parts the text-editor couldn't decode with [...])
%PDF-1.4
%Óëéá
1 0 obj
<</Title (Dokname)
/Producer (Skia/PDF m102 Google Docs Renderer)>>
endobj
3 0 obj
<</ca 1
/BM /Normal>>
endobj
5 0 obj
<</Filter /FlateDecode
/Length 160>> stream
[...]
endstream
endobj
2 0 obj
<</Type /Page
/Resources <</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/ExtGState <</G3 3 0 R>>
/Font <</F4 4 0 R>>>>
/MediaBox [0 0 596 842]
/Contents 5 0 R
/StructParents 0
/Parent 6 0 R>>
endobj
6 0 obj
<</Type /Pages
/Count 1
/Kids [2 0 R]>>
endobj
7 0 obj
<</Type /Catalog
/Pages 6 0 R>>
endobj
8 0 obj
<</Length1 14972
/Filter /FlateDecode
/Length 7164>> stream
[...]
endstream
endobj
9 0 obj
<</Type /FontDescriptor
/FontName /AAAAAA+ArialMT
/Flags 4
/Ascent 905.27344
/Descent -211.91406
/StemV 45.898438
/CapHeight 715.82031
/ItalicAngle 0
/FontBBox [-664.55078 -324.70703 2000 1005.85938]
/FontFile2 8 0 R>>
endobj
10 0 obj
<</Type /Font
/FontDescriptor 9 0 R
/BaseFont /AAAAAA+ArialMT
/Subtype /CIDFontType2
/CIDToGIDMap /Identity
/CIDSystemInfo <</Registry (Adobe)
/Ordering (Identity)
/Supplement 0>>
/W [0 [750] 40 54 666.99219 55 [610.83984]]
/DW 0>>
endobj
11 0 obj
<</Filter /FlateDecode
/Length 243>> stream
[...]
endstream
endobj
4 0 obj
<</Type /Font
/Subtype /Type0
/BaseFont /AAAAAA+ArialMT
/Encoding /Identity-H
/DescendantFonts [10 0 R]
/ToUnicode 11 0 R>>
endobj
xref
0 12
0000000000 65535 f
0000000015 00000 n
0000000365 00000 n
0000000098 00000 n
0000008721 00000 n
0000000135 00000 n
0000000573 00000 n
0000000628 00000 n
0000000675 00000 n
0000007925 00000 n
0000008159 00000 n
0000008407 00000 n
trailer
<</Size 12
/Root 7 0 R
/Info 1 0 R>>
startxref
8860
%%EOF
What do these obj-elements represent? Where is my TEST? Why did it get scrambled?
What I am searching for can probably all be found in adobe's documentations, but those have hundreds of pages which is very overwhelming. I get that this is a very complex topic and I am not trying to understand it completely. Just looking for an introduction or an overview. Unfontunately I didn't find anything like that on youtube or elsewhere..
Too complex for comments and yes you will only find snippets here and there including this and bits in my and others answers.
For a quick overview of the code sample you provided
A pdf is a collection of objects which are placed in no sequential order. So you start at the end before the last %%EOF (potentially one of many !) with startxref 8860 where 8860 is the decimal address of the Cross(XRef)erence table i.e. the files index.
There are many abbreviations (too many to list) and like a stack language most things may appear (literally) backwards so the xref points to each objects position in the file.
The prime target in this case is 7 0 obj <</Type /Catalog /Pages 6 0 R>> endobj since the catalog tells us about where the number of following pages will be found thus in object 6 /Pages /Count 1 /Kids [2 0 R] so its one page further defined in 2 0 obj
We now see there is an image and font(s) placed within /MediaBox [0 0 596 842] which is roughly (a tad wider) than a standard A4 page since 595/72" is closer to 210 mm.
Too much to describe about that one item alone, so skipping to Where is your text? and we see /Contents 5 0 R so that compressed stream of data that you need to decode is most likely your text but the length (/Length 160) is the binary flate encoded stream with placements not just your raw plain text.
The quantity of date sub setting the font seems odd and excessive for just 4 letters (if it was similar Helvetica it would not need including nor breaking the font as CID ArialMT) and without the full file its hard to say why the words /Image* is there, but it is Google Docs Renderer!
My suspicion is we may see characteristics of OCR in that stream.
I need to change the default user unit in a generated pdf file. Here is a minimal example which displays, but without the correct document size.
%PDF-1.7
1 0 obj
<< /Type /Catalog
/Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages
/Kids [ 3 0 R ]
/Count 1 >>
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/UserUnit 2.83
/MediaBox [0 0 2440 1220]
/Contents 4 0 R >>
endobj
4 0 obj
<< /Length 44 >>
stream
0.3 0.5 0.2 0.1 k
100 100 400 400 re
f
endstream
endobj
xref
0 5
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000117 00000 n
0000000221 00000 n
trailer
<< /Size 5
/Root 1 0 R >>
startxref
309
%%EOF
If you open this file in a PDF viewer, it's as if the UserUnit default has not been changed.
I need to get the user units as close to millimetres as possible. The graphics in this file are to be printed onto board then cut out with a CNC machine so there needs to be some level of accuracy with the printing.
How do you set the UserUnit value correctly?
Never assume Apple Preview does the correct thing with PDF files.
If you open this in Adobe Acrobat, the reported page size is 2436 x 1218mm, which I believe is correct for your UserUnit value.
The box looks the same size proportionally as what is shown in Preview, so I'm going to assume that one is drawn correctly as well.
The document contains only Text no images the relveant portions of the PDF are as under:
trailer
<</Root 1 0 R>>
1 0 obj
<</Type/Catalog/Pages 3 0 R>>
endobj
3 0 obj
<</Type/Pages/Kids[4 0 R]/Count 1/Rotate 0/ITXT(5.0.6)>>
endobj
4 0 obj
<</Type/Page
/MediaBox[0 0 612 1008]
/Rotate 0
/Parent 3 0 R
/Resources<<
/ProcSet[/PDF/Text]
/ExtGState 12 0 R
/Font 13 0 R>>
/Contents 5 0 R
/Annots[24 0 R]>>
endobj
12 0 obj
<</R7 7 0 R>>
endobj
7 0 obj
<</Type/ExtGState /OPM 1>>
endobj
13 0 obj
<</R8 8 0 R
/R10 10 0 R>>
endobj
8 0 obj
<</BaseFont /LRSXWR+TimesNewRoman
/FontDescriptor 9 0 R
/Type/Font
/FirstChar 1
/LastChar 41
/Widths[
333 722 250 611 722 611 722 667 722 722 667 556 556 389
722 667 722 722 500 333 444 389 500 278 278 500 333 500
444 500 278 250 889 250 500 500 444 500 278 778 500]
/Encoding 16 0 R
/Subtype/TrueType>>
endobj
16 0 obj
<</Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[
1/I/N/space/T/H/E/G/C/O/U/R/F/P/J/A/B
/D/Y/asterisk/r/e/s/n/t/colon/o/f/h/a/p/l/period
/M/comma/d/v/c/two/i/m/u]
>>
endobj
The above information is provided for requirements purposes, the content object which I want to decoded as:
5 0 obj
<</Length 5950>>
stream
q 0.12 0 0 0.12 0 0 cm
/R7 gs
0 0 0 RG
0 0 0 rg
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 14.0388 Tf
0.997231 0 0 1 90.1533 922.927 Tm
[
(SOH)-0.762768(STX)10.3078(ETX)10.019(EOT)10.888
(ENQ)-6.34593(ACK)10.888(ETX)-7.12126(ENQ)2.22552
(SOH)7.32006(BEL)-6.34489(ENQ)10.797(ETX)-7.1223
(BS)7.04592( )-6.34489(\n)10.797(VT)49.899
(EOT)28.0288(ETX)-7.12126( )2.22552(FF)-0.944827
(ETX)10.0196(\r)-0.945874(\n)-5.8573(STX)10.3083
(SQ)-13.6649(SI)10.798(DLE)-10.097(ETX)52.8727
(SI)11.2835(STX)-6.83247(DC1)2.22657(ETX)10.0175
(ENQ)-6.34489(SI)10.798(VT)49.8969(DC2)105.076
(SI)11.2856(STX)-6.83457(SI)53.6511(ETX)61.442
(SI)105.076(EOT)28.0288(ETX)-7.12335(BS)-1.52554
(ENQ)2.22657(SI)11.2835(STX)-6.83247(DC1)10.798
(SOH)-9.82286(BEL)2.22657(SI)
]TJ
412.949 0 Td
[(VT)-1.52763(ENQ)722.166]TJ
.......
.......
Decoding of PDF stream into text is not very simple, because you don't have anything like text there.
You have series of glyhps with very vairable meaning. In your case, you use font 13 0, that consist of 41 characters of /LRSXWR+TimesNewRoman with changes defined in obj 16 0, that has explanations of meanings of glyphs. You must have some translation table from "space" to " " (I'm quite surprised, that there is a glyph for space in your case). This may not be so simple in other cases. I've seen many times, that there was an embeded font with glyphs sorted by usage and there was no other than visual evidence, what which glyph may represent.
Are you sure you want to read the text from pdf files?
I took the minimal PDF example in the PDF specification from PDF Specification, copied it to NotePad, renamed the file to have the extension .pdf.
I can open it with other PDF viewer (PDF-XChange, SumatraPDF, MuPDF). But when I open it with Adobe Reader, it says the file is broken.
I am not sure if other viewers treat this "broken" file as blank file or not.
The file is supposed to display one blank page, since it is a minimal example.
In fact, I modify the minimal example. Because when I copy it from PDF specification to notepad, and open the .txt file by a Hex Editor, I see a new line in .txt file give me 2 space. For example,
1 0 obj
<< /Type /Catalog
gives me (in Hex Editor)
1 0 obj << /Type /Catalog
which is (in hex values)
31 20 30 20 6F 62 6A 0D 0A 3C 3C 20 2F 54 79 70
65 20 2F 43 61 74 61 6C 6F 67
The 2 spaces between j and < are 0D 0A.
Hence I don't make new lines in NotePad, and modify the values in the xref part.
Below is the full code.
Do you know what's wrong with this example? Why does Adobe Reader say it is broken? Is this because I gave the wrong values in xref?
%PDF-1.4 1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Kids [4 0 R] /Count 1 >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >> endobj 5 0 obj << /Length 35 >> stream … Page-marking operators … endstream endobj 6 0 obj [/PDF] endobj xref 0 7 0000000000 65535 f 0000000009 00000 n 0000000074 00000 n 0000000119 00000 n 0000000176 00000 n 0000000295 00000 n 0000000373 00000 n trailer << /Size 7 /Root 1 0 R >> startxref 395 %%EOF
First: when you 'copied' the example from the PDF specification, very likely a few things happened which made your copy to not work as expected:
...you didn't 'copy' by re-typing the example in a text editor, but
...you used copy'n'paste, using a PDF as the source file.
Depending on your text editor, that method probably caused the conversion of the newline convention to be changed from [cr]+[lf] to [cr] or vice-versa. This in turn means that the byte offset numbers in the object 'table of contents' (the 'xref'-table) are no longer valid.
Another problem with the PDF source code you posted is that it doesn't now contain any linebreaks at all. Some viewers may be able to still silently parse the thing, but not all are. And it certainly is against the spec, because according to the spec, in chapter 7.5.2 it is clearly spelled out that
"The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
Your header violates that rule.
Also, the 'stream' in 5 0 obj isn't any valid PDF code, it is just place holder text (… Page-marking operators …). Some viewers may be tilting when they come across such 'garbage'.
Lastly, your startxref value wasn't correct.
So here is a file that works. I repaired it in a text editor, and I put your original code as a comment after the %%EOF for comparison and reference:
%PDF-1.4
1 0 obj
<< /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >>
endobj
2 0 obj
<< /Type Outlines /Count 0 >>
endobj
3 0 obj
<< /Type /Pages /Kids [4 0 R] /Count 1 >>
endobj
4 0 obj
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >>
endobj
5 0 obj
<< /Length 35 >>
stream
… Page-marking operators …
endstream
endobj
6 0 obj
[/PDF]
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000119 00000 n
0000000176 00000 n
0000000295 00000 n
0000000376 00000 n
trailer
<< /Size 7 /Root 1 0 R >>
startxref
394
%%EOF
%% %PDF-1.4 1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Kids [4 0 R] /Count 1 >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] /Contents 5 0 R /Resources << /ProcSet 6 0 R >> >> endobj 5 0 obj << /Length 35 >> stream … Page-marking operators … endstream endobj 6 0 obj [/PDF] endobj xref 0 7 0000000000 65535 f 0000000009 00000 n 0000000074 00000 n 0000000119 00000 n 0000000176 00000 n 0000000295 00000 n 0000000373 00000 n trailer << /Size 7 /Root 1 0 R >> startxref 395