How to open PDF raw? - pdf

I've been wanting to see the insides of a PDF for a while, like, the raw source code of it so I can look at it. Any way of doing that?

Looking at the raw code of PDFs will not serve you much unless you also have an idea about its internal structure. You should get yourself a copy of the official PDF reference (download PDF), and you should have read some introductionary article such as this [gone] or this to begin with.
Even after such a preparation, you'll not discover much useful when staring at the raw code. Because PDFs usually will contain parts which are "filtered" (that means: compressed).
How to look at the real PDF source behind the 'raw' binary parts
Jay Birkenbilt's qpdf is a very useful commandline tool (available for Linux, Mac OSX, Windows, and as source code, under the open source Artistic License), which can unpack most filtered content and re-organize the internal structure in a way that gives you much more insight into it (all objects are numerically ordered, etc.). The commandline to achieve this is:
qpdf --qdf original.pdf unpacked.pdf
Another useful and free tool (GPL licensed, but Linux-only AFAIK) to look into PDFs is of course PDFEdit. This one even comes with a GUI (if you prefer that), while still allowing you access to the internal structure and "raw" PDF code.

If the purpose is just to look into the file, then any simple text editor will do, ex, Notepad. PDF is just a text based format, including embedded content byte streams. Raw PDF looks like this:
>>
/Border [0 0 0]
/Rect [121.02 332.48 363.24 343.64]
/StructParent 1321
/Subtype /Link
/Type /Annot
>>
endobj
64579 0 obj
<<
/Filter /FlateDecode
/Length 5771
>>
stream
Ũn0x/�+�}�ǹ����\֛ bYO�5[��X��W��L��(�������V�A3�C���������u큋_�a��ךm2N�6� ��A��8
�d���NQ⺢GI��G�[��)�̉Y��R�y{R����&�&�;��g�k1���ҋeTC�(W��`���*��(;�AEc<= mnZ+��|T��v
�.��зe�aޞ��V4�b���L����k�Oj.ֿ�y�����kc|I�� ��C�0��Hf�7d�/�z���m��o��A��B��IJ�%�.
!�%f�б���&�ޒ�4Ύ7�l�3���3`�
endstream
endobj
64580 0 obj
<<
/Border [0 0 0]
/Dest <E4AE7DD2769553EF1668>
/Rect [219 648.5 256.8 659.66]
/StructParent 1323
/Subtype /Link
/Type /Annot
>>
What you see are basic COS objects like name, dictionary, stream and so on. All objects are described in PDF 32000 standard, see section 7.3 Objects.

Use a Hex editor. Of course, unless you know the PDF specification (PDF, 8.6 MB), you won't recognize much.

In addition to the qpdf tool conversion into postscript might be helpful.
PDF is a subset of PS. Usually its quite easy to figure out, e.g. where the labels of a graph are. You can either use pdf2ps or invoke ghostscript
gs -sDEVICE=pswrite some.pdf -sOutputFile=some.ps -dNOPAUSE -c quit
When you generate your PDFs using pdflatex you can disable compression with an option. This makes the PDF more readable.

Some more recent observations on the other answers.
Adobe keep moving about their Open Sourced copy of the 2008 standard so currently that is here https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
The Web Archive have currently a copy here https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
They should be identical 22,491,828 bytes so beware neither includes any errata.
A pdf CAN be plain mime "text/pdf" as perfectly ? annotated generated from a console keyboard or command line (too slow) or a batch file. I wont bore you with the whole file but it starts like this:-
REM Start with File "Magic" Signatures for a PDF
echo %%PDF-1.0>!Fname!
echo %%âãÏÓ>>!Fname!
echo %%01) Prepare file references>>!Fname!
for %%Z in (!Fname!) do set "FZ1=%%~zZ"
echo 1 0 obj>>!Fname!
echo ^<^</Names^<^</Dests 2 0 R^>^>/Outlines 3 0 R>>/PageLayout/OneColumn/PageMode/UseOutlines>>!Fname!
REM ToDo add files
REM /Lang (ga-IE)/MarkInfo^<^</Marked true^>^>/Names ^<^<^/EmbeddedFiles [(file.ext) 3 0 R]^>^>>>!Fname!
echo /Pages 4 0 R/Type/Catalog/ViewerPreferences^<^</DisplayDocTitle true^>^>^>^>>>!Fname!
echo endobj>>!Fname!
echo %%02) Prepare Named Destinations>>!Fname!
Thus the annotated RAW PDF (note I had edited the order in the cmd file in preparation for an XMP data section, so not identical) could looks like :-
%PDF-1.3
%âãÏÓ
%01) Prepare file references
1 0 obj
<</Lang(ga-IE)/Names<</Dests 3 0 R>>/Outlines 4 0 R/PageLayout/OneColumn/PageMode/UseOutlines
/PageLabels<</Nums[0<</S/A>>]>>/Pages 5 0 R/Type/Catalog/ViewerPreferences<</DisplayDocTitle true>>>>
endobj
%02) Reserved for big meta data
2 0 obj
<< >>
endobj
%03) Prepare Named Destinations
3 0 obj
<</Names [(Page1) [6 0 R /XYZ 0 792 null] (QRCode) [6 0 R /XYZ 25.0 317.0 1]]>>
endobj
%04) Prepare Outline / Bookmarks
...
...
Many suggestions by others for decompress binary application/PDF into text/PDF and some may be a hybrid thus still have binarized application text.
The 3 most common designed for the task are qpdf (already mentioned, but uses a hybrid QDF) PDFtk (uncompress) and Mutool (different CLI options), that's the one I play with most, as its easy in GL GUI to change the output settings. The output can be modified in MS Notepad, whilst previewing result.
So any text editing script can write or edit a PDF even with graphics, And several applications can convert RAW "binary" PDF into RAW "textual" PDF. However never attempt to edit PDF whilst temporarily in its textual base64 RePrEx (possible, but totally impractical)

Related

How to modify the display settings of a pdf file in double-page view?

I have converted a pdf-file using ghostscript to pdf/A-2 using
gs -dPDFA=2 -dPrinted=false -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=file_PDFA-2.pdf file.pdf
I have used ghostscript version 9.26 on Ubuntu 18.04.5 LTS.
I use Acrobat Reader to display the pdf-document in double page view. Before using ghostscript, the first page is correctly show to the right (see figure). The ghostscript command apparently modifies this, which makes the document harder to read.
Is there a way in ghostscript to modify this setting? Unfortunately, I couldn't find an option. Using other freely available tools to modify the settings after using ghostscript would also work.
Edit: link to the pdf-files before and after the conversion:
link to the pdf-files
Thank you.
So looking at the before PDF file I see, in the document Catalog (after decompressing the file):
28 0 obj
<<
/Type /Catalog
/Pages 12 0 R
/Names 27 0 R
/PageMode /UseOutlines
/PageLayout /TwoPageRight
/OpenAction 4 0 R
>>
endobj
And in the after file I see the Catalog only contains:
1 0 obj
<<
/Type /Catalog
/Pages 3 0 R
/Metadata 24 0 R
>>
endobj
So it seems likely that the '/PageLayout' key having the value '/TwoPagesRight' is what causes Acrobat to display the way it does. Noe the /PageMode /UseOutlines which appears to be pointless since the PDF file doesn't actually seem to contain any Outlines.
We can use a /PUT pdfmark to add entries to dictionaries, and that is supported by both the pdfwrite device and the interpreter.
[ {Catalog} << /PageLayout /TwoPageRight >> /PUT pdfmark
Should work, but unfortunately does not for me.
Disclaimer,
I support SumatraPDF and am taking the question at face value
"How to modify the display settings of a pdf file in double-page
view?"
plus the stated desire
"Using other freely available tools to modify the settings after using
ghostscript would also work."
So without rotation the pages can be displayed 4 different layouts in a viewer. here are the main 3
Facing Manga (RTL) Cover (Book)

PDF with an external image using XObject

I'm trying to build a PDF file with a link to an external file.
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
On page 348 there is an example of image with an alternate image loaded remotely. When I create a document with the example from the doc, the reader (using acrobat reader XI) doesn't fetch the image. There is no error message but no request is being made (checked using wireshark).
Can I have only a remote image (ie no "normal" image and alternate image).
Is there an example somewhere of a full document using that /FS /URL syntax (ie not just the objects)? I couldn't find any that actually does the request.
Thanks
Edit:
I used LibreOffice to create the base document with a single 1x1 pixel.
http://pastebin.com/5GqCYgMp
I initially created my test document with Acrobat but the output was really messy.
Then replaced the stream with the example from the pdf spec, and tried to fix the startxref offset, but not sure it's correct.
http://pastebin.com/BT42g02P
This document is currently not opening correctly, but I tried to get a minimum test case. My previous attempts were displayed with no errors only by luck (but the remote image didn't work anyway).
Is there any tool that actually allows the creation of XObject with /URL? I don't know the file format enough to create them reliably by hand.
First of all,
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
I would recommend not using a PDF reference but instead the ISO standard. The Adobe PDF references are not normative in nature while the ISO standard is. (The actual content differences are minute but if there is a normative spec, one should use it.) Adobe also publishes a copy of the ISO standard with merely the header exchanged.
Then, please don't treat PDFs as text documents. E.g. by sharing them on pastebin, you make them subject to treatment as text which essentially destroys the content.
That all been said, let's look at your actual issue:
In your sample PDF you have:
4 0 obj
<</Type/XObject/Subtype/Image/Width 1/Height 1/BitsPerComponent 8/Length 0/F << /FS /URL
/F ( https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png )
>>/Filter/FlateDecode/ColorSpace/DeviceRGB
>>
stream
endstream
endobj
This indicates that the PDF viewer shall find at the URL https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png a file containing an array of 1 (/Width 1/Height 1) RGB (/ColorSpace/DeviceRGB) sample with 1 byte per color (/BitsPerComponent 8), cf. section 8.9.5 Image Dictionaries of ISO 32000-1.
I doubt your file fulfills that, I assume it actually is a PNG file in particular with a PNG structure, not the structure explained above.
PDF does not support the PNG format as is, you have to transform the data. It does support, though, the JPEG format using the /FFilter /DCTDecode which is why the sample from the specification
16 0 obj
<< /Type /XObject
/Subtype /Image
/Width 1000
/Height 2000
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Length 0 % This is an external stream
/F << /FS /URL
/F (http://www.myserver.mycorp.com/images/exttest.jpg)
>>
/FFilter /DCTDecode
>>
stream
endstream
endobj
makes it look so easy.

can I create a PDF from ImageMagick which will always open at zoom 100%? [duplicate]

I am running into an issue with PNG to PDF conversion.
Actually I have big PNG files not in size but in contents.
In PDF conversion it creates a big PDF files. I don't have any issue with its quality, but whenever I try to open this PDF in PDF viewer, it opens in "Fit to Page" mode.
So, I can't see the created PDF in the initial view, but I need to zoom it up to 100%.
My question is: can I create a PDF which will always open at zoom 100% ?
You can possibly achieve what you want with the help of Ghostscript.
Ghostscript supports to insert PostScript snippets into its command line parameters via -c "...[PostScript code here]...".
PostScript has a special operator called pdfmark. This operator is not understood by most PostScript interpreters, but is understood by Acrobat Distiller and (for most of its parameters) also by Ghostscript when generating PDFs.
So you could try to insert
-c "[ /PageMode /UseNone /Page 1 /View [/XYZ null null 1] \
/PageLayout /SinglePage /DOCVIEW pdfmark"
into a PDF->PDF conversion Ghostscript command line.
Please take note about various basic things concerning this snippet:
The contents of the command line snippet appears to be 'unbalanced' regarding the [ and ] operators/keywords. But it is not! The initial [ is balanced by the final pdfmark keyword. (Don't ask -- I did not define this syntax...)
The 'inner' [ ... ] brackets delimit an array representing the page /View settings you desire.
Not all PDF viewers do respect the view settings embedded in the PDF file (Acrobat software does!).
Most PDF viewers allow users to override the view settings embedded in PDF files (Acrobat software also does this). That is, you can tell your viewer to never respect any settings from the PDF files it opens, but f.e. to always open it with "fit to width".
Some specific things about this snippet:
The page mode /UseNone means: the document displays without bookmarks or thumbnails. It could be replaced by
/UseOutlines (to display bookmarks also, not just the pages)
/UseThumbs (to display thumbnail images of the pages, not just the pages
/FullScreen (to open document in full screen mode)
The array for the view mode constructed as [/XYZ <left> <top> <zoom>] means: The zoom factor is 1 (=100%), the left distance from the page origin is the special 'null' value, which means to keep the previously user-set value; the top distance from the page origin is also 'null'. This array could be replaced by
/Fit (to adapt the page to the current window size)
/FitB (to adapt the visible page content to the current window size)
/FitH <top>' (to adapt the page width to the current window width);` indicates the required distance from page origin to upper edge of window.
...plus several others I cannot remember right now.
So to change the settings of an existing PDF file, you could do the following:
gs \
-o out.pdf \
-sDEVICE=pdfwrite \
-c "[ /PageMode /UseNone /Page 1 /View [ /XYZ null null 1 ] " \
-c " /PageLayout /SinglePage /DOCVIEW pdfmark" \
-f in.pdf
To check if the Ghostscript command worked, open the PDF in a text editor which is capable of handling binary files. Search for the /View or the /PageMode keywords and check if they are there, inserted as values into the PDF root object.
If it worked, check if your PDF viewer honors the settings. If it doesn't honor them, see if there is an overriding setting within the viewers preference settings.
I did a quick test run on a sample PDF of mine. Here is how the PDF root object's dictionary looks now, checked with the help of pdf-parser.py:
pdf-parser-beta.py -s Catalog a.pdf
obj 1 0
Type: /Catalog
Referencing: 3 0 R, 9 0 R
<<
/Type /Catalog
/Pages 3 0 R
/PageMode /UseNone
/Page 1
/View [/XYZ null null 1]
/PageLayout /SinglePage
/Metadata 9 0 R
>>
To learn more about the pdfmark operator, google for 'pdfmark reference filetype:pdf'. You should be able to find it on the Adobe website and elsewhere:
https://www.google.de/search?q=pdfmark%20reference%20filetype%3Apdf&oq=pdfmark%20reference%20filetype%3Apdf
In order to let ImageMagick create a PDF as you want it, you may be able to hack the file defining your delegate settings. For more help about this topic see for example here:
http://www.imagemagick.org/Usage/files/#delegates
PDF specification supports this functionality in this way: create a GoTo action that goes to first page and sets the zoom level to 100% and then set the action as the document open action.
How exactly you implement it in real life depends very much on the tool you use to create the PDF file. I do not know if ImageMagick can create such actions.

Steganography between PDF objects or after the %%EOF

1) I saw that some persons are worked to hide data between PDF objects.
They told that this method works but the big disadvantage is that acrobat reader asks to re-save the file when closing window.
I don't understand what they mean about concealing information between PDF objects.
Please i need your help :)
2) I also saw that some person's are worked to conceal information after the %%EOF and are told that is not a solution because signing is not applied to the metadata, which needed a feature.
Also i don't understand what they mean about metadata in this topic ?
i refered to this link How to hide text in an PDF file?
Best regards,
Liszt.
1) I saw that some persons are worked to hide data between PDF objects. They told that this method works but the big disadvantage is that acrobat reader asks to re-save the file when closing window.
I don't understand what they mean about concealing information between PDF objects.
Normally your PDF is a sequence of PDF objects preceded by identifying numbers and a cross reference mapping those numbers to their position in the PDF:
...
2 0 obj
/WinAnsiEncoding
endobj
3 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier
/Name /F001
/Encoding 2 0 R
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier-Bold
/Name /F002
/Encoding 2 0 R
>>
....
xref
0 17
0000000000 65535 f
0000014476 00000 n
0000000017 00000 n
0000000052 00000 n
0000000205 00000 n
...
When PDF parsers parse an object (e.g. object 2), they usually only look up the associated value in the cross references (here in case of object 2 it's 17) and start reading the file at byte 17, first expecting the object and generation numbers (2 0) and then the tag obj; they parse everything after that tag up to the matching endobj tag and then stop. (Actually in some cases it's a bit more twisted, but this is the general idea.)
Thus, some people think it a good idea to add their secret data between the endobj of one PDF object and the object number of the next, like this:
2 0 obj
/WinAnsiEncoding
endobj
HERE ARE MY VERY SECRET VERY HIDDEN DATA, PROBABLY ENCRYPTED ETC
3 0 obj
Now some PDF readers do recognize that there are some trash bytes and offer to save the file without them.
2) I also saw that some person's are worked to conceal information after the %%EOF and are told that is not a solution because signing is not applied to the metadata, which needed a feature.
Most PDF readers ignore some trash data after the marker because in times long gone some PDF generation or transport processes left some additional trash there.
...
%%EOF
AGAIN SOME SECRET DATA
When themselves manipulating the PDFs, though, e.g. when signing them, the PDF readers may just go ahead and throw out everything that is not there according to PDF specification. Or in case of signing, they may leave the trailing bytes where they are and then integrate the signature after them. Some program expecting those extra data at the end-of-file may not find them anymore afterwards as they now are somewhere inside.
Also i don't understand what they mean about metadata in this topic ?
Some people actually use such mechanisms to add information required in later processing steps. E.g. the process creating some PDF invoice may add the address to send the PDF to and the amount to pay at the end of the file, then the PDF is processed some more, e.g. reviewed or archived, and in some final process it is send out to the addressee.
The review step might be handled differently depending on the amount added at the end; maybe sales worth more than $1000 must be cleared by special personal.
The sending process may also use the extra data after the end of file for sending the file to the recipient.
Such data about some document sometimes are called metadata.

Export PDF page labels on command line

I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.
Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.
Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.
Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.
Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:
Each PDF document must contain a root object.
That root object must be of /Type /Catalog.
The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.
Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.
If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.
I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.
But instead I'll give you an example in ASCII PDF code:
213 0 obj
<< /Type /Catalog
/PageLabels
<<
/Nums
[
0 << % start labeling from page no. 1
/S /r % label with lowercase roman numbers
>>
7 << % start new labeling from page no. 8
/S /D % label with standard decimal numbers
>>
11 << % start labeling page no. 12
/S /D % label with decimal numbers...
/P (ABCD-) % ...but using label prefix 'ABCD-'...
/St 3 % ...followed by '3' as the start decimal.
>>
]
>>
%%...........................
%%...more root object keys...
%%...........................
>>
endobj
The above example will label the pages number 1, 2, 3, ... (last) like this:
i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...
As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.
I've written a small command-line utility based on Poppler that does just this task: https://github.com/HeimMatthias/pdfpagelabels
Disclaimer: I'm the OP and created the original post under a different account. I have been using the solution via pdftk (listed in a comment above) successfully for years in my implementation. However, last year it was time to reimplement our system from scratch and we've had numerous instances where the pdf-tk output could not be parsed by our implementation.
The new command-line tool follows the philosophy of doing just one thing, but doing it well, and simply prints the page labels of all or selected pages of a pdf-file. If anyone finds this useful, and stumbles upon it here, all the better for it.