1) I saw that some persons are worked to hide data between PDF objects.
They told that this method works but the big disadvantage is that acrobat reader asks to re-save the file when closing window.
I don't understand what they mean about concealing information between PDF objects.
Please i need your help :)
2) I also saw that some person's are worked to conceal information after the %%EOF and are told that is not a solution because signing is not applied to the metadata, which needed a feature.
Also i don't understand what they mean about metadata in this topic ?
i refered to this link How to hide text in an PDF file?
Best regards,
Liszt.
1) I saw that some persons are worked to hide data between PDF objects. They told that this method works but the big disadvantage is that acrobat reader asks to re-save the file when closing window.
I don't understand what they mean about concealing information between PDF objects.
Normally your PDF is a sequence of PDF objects preceded by identifying numbers and a cross reference mapping those numbers to their position in the PDF:
...
2 0 obj
/WinAnsiEncoding
endobj
3 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier
/Name /F001
/Encoding 2 0 R
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier-Bold
/Name /F002
/Encoding 2 0 R
>>
....
xref
0 17
0000000000 65535 f
0000014476 00000 n
0000000017 00000 n
0000000052 00000 n
0000000205 00000 n
...
When PDF parsers parse an object (e.g. object 2), they usually only look up the associated value in the cross references (here in case of object 2 it's 17) and start reading the file at byte 17, first expecting the object and generation numbers (2 0) and then the tag obj; they parse everything after that tag up to the matching endobj tag and then stop. (Actually in some cases it's a bit more twisted, but this is the general idea.)
Thus, some people think it a good idea to add their secret data between the endobj of one PDF object and the object number of the next, like this:
2 0 obj
/WinAnsiEncoding
endobj
HERE ARE MY VERY SECRET VERY HIDDEN DATA, PROBABLY ENCRYPTED ETC
3 0 obj
Now some PDF readers do recognize that there are some trash bytes and offer to save the file without them.
2) I also saw that some person's are worked to conceal information after the %%EOF and are told that is not a solution because signing is not applied to the metadata, which needed a feature.
Most PDF readers ignore some trash data after the marker because in times long gone some PDF generation or transport processes left some additional trash there.
...
%%EOF
AGAIN SOME SECRET DATA
When themselves manipulating the PDFs, though, e.g. when signing them, the PDF readers may just go ahead and throw out everything that is not there according to PDF specification. Or in case of signing, they may leave the trailing bytes where they are and then integrate the signature after them. Some program expecting those extra data at the end-of-file may not find them anymore afterwards as they now are somewhere inside.
Also i don't understand what they mean about metadata in this topic ?
Some people actually use such mechanisms to add information required in later processing steps. E.g. the process creating some PDF invoice may add the address to send the PDF to and the amount to pay at the end of the file, then the PDF is processed some more, e.g. reviewed or archived, and in some final process it is send out to the addressee.
The review step might be handled differently depending on the amount added at the end; maybe sales worth more than $1000 must be cleared by special personal.
The sending process may also use the extra data after the end of file for sending the file to the recipient.
Such data about some document sometimes are called metadata.
Related
I have uncompressed a PDF and I am able to view the streams and know where the data for the 3d model is kept. I have also extracted the section that describes the PRC part of the file. I have used SumatraPDF in order to view the file as it supports PRC. But it shows an error. How can extract the 3D model from the PDF?
That I will be able to extract the 3D model from the PDF in any format so I can ultimately convert it into a format which can be viewed and edited if need be.
Currently SumatraPDF does not RUN 3D models nor other embeds nor attachments (one of the reasons its considered SAFEST PDF reader). It will accept as .PDF, Adobe ONLY supported 3D PRC / U3D, or other PRCs (Palm Reader Compressed eBook) which are runnable as plain text.
It can show any static 3D cover image (and as work in progress in some versions allow extract some types of attachment) or potentially delete attached annotations (not what you want). https://github.com/GitHubRulesOK/MyNotes/raw/master/Acrobat%203D%20test%20-%20Laughing%20Porpoise.pdf
It is possible in PDF editors to run, view and extract such contents, possibly via their API/SDK.
<<
/AN 5 0 R
/Filter /FlateDecode
/Length 335276
/Subtype /U3D
/Type /3D
/VA 6 0 R
>>
stream
xœì¼ 8Tßÿ~{B–¢0d×fŒuÈ.KciU„lI¶ÅZ$KeI²e§´ªˆ¢ÍRÙµEÖìd™ÿÌŒQÚ>ßßÿÿüïóÜgfÎyÏû¼Þçõ:çžsî½ÇBVà `70wÏØúoßÁ“šÎhzìÝãµ×n£Ïz#ÖÎÉ~ýaY;iEq
»6{4l=ØìñÕ0?pÀeŸ“— l½ôz©õRø¿³¦zß-\¼<özíõ𤤠€‡ÃÑ•Ï þÎ#|§´¸¸9:¹åãñ‚6&ä6Tß~ž›ú$++{«xÆüŽÀ§d>}îš`F š oJ¾—,¨æJø+ HñÓlœ÷Úz‘c· Ë'qùÝú:‚{²cÁ4|RÊ 88<þB
.......
I'm trying to build a PDF file with a link to an external file.
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
On page 348 there is an example of image with an alternate image loaded remotely. When I create a document with the example from the doc, the reader (using acrobat reader XI) doesn't fetch the image. There is no error message but no request is being made (checked using wireshark).
Can I have only a remote image (ie no "normal" image and alternate image).
Is there an example somewhere of a full document using that /FS /URL syntax (ie not just the objects)? I couldn't find any that actually does the request.
Thanks
Edit:
I used LibreOffice to create the base document with a single 1x1 pixel.
http://pastebin.com/5GqCYgMp
I initially created my test document with Acrobat but the output was really messy.
Then replaced the stream with the example from the pdf spec, and tried to fix the startxref offset, but not sure it's correct.
http://pastebin.com/BT42g02P
This document is currently not opening correctly, but I tried to get a minimum test case. My previous attempts were displayed with no errors only by luck (but the remote image didn't work anyway).
Is there any tool that actually allows the creation of XObject with /URL? I don't know the file format enough to create them reliably by hand.
First of all,
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
I would recommend not using a PDF reference but instead the ISO standard. The Adobe PDF references are not normative in nature while the ISO standard is. (The actual content differences are minute but if there is a normative spec, one should use it.) Adobe also publishes a copy of the ISO standard with merely the header exchanged.
Then, please don't treat PDFs as text documents. E.g. by sharing them on pastebin, you make them subject to treatment as text which essentially destroys the content.
That all been said, let's look at your actual issue:
In your sample PDF you have:
4 0 obj
<</Type/XObject/Subtype/Image/Width 1/Height 1/BitsPerComponent 8/Length 0/F << /FS /URL
/F ( https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png )
>>/Filter/FlateDecode/ColorSpace/DeviceRGB
>>
stream
endstream
endobj
This indicates that the PDF viewer shall find at the URL https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png a file containing an array of 1 (/Width 1/Height 1) RGB (/ColorSpace/DeviceRGB) sample with 1 byte per color (/BitsPerComponent 8), cf. section 8.9.5 Image Dictionaries of ISO 32000-1.
I doubt your file fulfills that, I assume it actually is a PNG file in particular with a PNG structure, not the structure explained above.
PDF does not support the PNG format as is, you have to transform the data. It does support, though, the JPEG format using the /FFilter /DCTDecode which is why the sample from the specification
16 0 obj
<< /Type /XObject
/Subtype /Image
/Width 1000
/Height 2000
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Length 0 % This is an external stream
/F << /FS /URL
/F (http://www.myserver.mycorp.com/images/exttest.jpg)
>>
/FFilter /DCTDecode
>>
stream
endstream
endobj
makes it look so easy.
I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.
Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.
Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.
Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.
Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:
Each PDF document must contain a root object.
That root object must be of /Type /Catalog.
The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.
Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.
If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.
I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.
But instead I'll give you an example in ASCII PDF code:
213 0 obj
<< /Type /Catalog
/PageLabels
<<
/Nums
[
0 << % start labeling from page no. 1
/S /r % label with lowercase roman numbers
>>
7 << % start new labeling from page no. 8
/S /D % label with standard decimal numbers
>>
11 << % start labeling page no. 12
/S /D % label with decimal numbers...
/P (ABCD-) % ...but using label prefix 'ABCD-'...
/St 3 % ...followed by '3' as the start decimal.
>>
]
>>
%%...........................
%%...more root object keys...
%%...........................
>>
endobj
The above example will label the pages number 1, 2, 3, ... (last) like this:
i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...
As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.
I've written a small command-line utility based on Poppler that does just this task: https://github.com/HeimMatthias/pdfpagelabels
Disclaimer: I'm the OP and created the original post under a different account. I have been using the solution via pdftk (listed in a comment above) successfully for years in my implementation. However, last year it was time to reimplement our system from scratch and we've had numerous instances where the pdf-tk output could not be parsed by our implementation.
The new command-line tool follows the philosophy of doing just one thing, but doing it well, and simply prints the page labels of all or selected pages of a pdf-file. If anyone finds this useful, and stumbles upon it here, all the better for it.
I've been wanting to see the insides of a PDF for a while, like, the raw source code of it so I can look at it. Any way of doing that?
Looking at the raw code of PDFs will not serve you much unless you also have an idea about its internal structure. You should get yourself a copy of the official PDF reference (download PDF), and you should have read some introductionary article such as this [gone] or this to begin with.
Even after such a preparation, you'll not discover much useful when staring at the raw code. Because PDFs usually will contain parts which are "filtered" (that means: compressed).
How to look at the real PDF source behind the 'raw' binary parts
Jay Birkenbilt's qpdf is a very useful commandline tool (available for Linux, Mac OSX, Windows, and as source code, under the open source Artistic License), which can unpack most filtered content and re-organize the internal structure in a way that gives you much more insight into it (all objects are numerically ordered, etc.). The commandline to achieve this is:
qpdf --qdf original.pdf unpacked.pdf
Another useful and free tool (GPL licensed, but Linux-only AFAIK) to look into PDFs is of course PDFEdit. This one even comes with a GUI (if you prefer that), while still allowing you access to the internal structure and "raw" PDF code.
If the purpose is just to look into the file, then any simple text editor will do, ex, Notepad. PDF is just a text based format, including embedded content byte streams. Raw PDF looks like this:
>>
/Border [0 0 0]
/Rect [121.02 332.48 363.24 343.64]
/StructParent 1321
/Subtype /Link
/Type /Annot
>>
endobj
64579 0 obj
<<
/Filter /FlateDecode
/Length 5771
>>
stream
Ũn0x/�+�}�ǹ����\֛ bYO�5[��X��W��L��(�������V�A3�C���������u큋_�a��ךm2N�6� ��A��8
�d���NQ⺢GI��G�[��)�̉Y��R�y{R����&�&�;��g�k1���ҋeTC�(W��`���*��(;�AEc<= mnZ+��|T��v
�.��зe�aޞ��V4�b���L����k�Oj.ֿ�y�����kc|I�� ��C�0��Hf�7d�/�z���m��o��A��B��IJ�%�.
!�%f�б���&�ޒ�4Ύ7�l�3���3`�
endstream
endobj
64580 0 obj
<<
/Border [0 0 0]
/Dest <E4AE7DD2769553EF1668>
/Rect [219 648.5 256.8 659.66]
/StructParent 1323
/Subtype /Link
/Type /Annot
>>
What you see are basic COS objects like name, dictionary, stream and so on. All objects are described in PDF 32000 standard, see section 7.3 Objects.
Use a Hex editor. Of course, unless you know the PDF specification (PDF, 8.6 MB), you won't recognize much.
In addition to the qpdf tool conversion into postscript might be helpful.
PDF is a subset of PS. Usually its quite easy to figure out, e.g. where the labels of a graph are. You can either use pdf2ps or invoke ghostscript
gs -sDEVICE=pswrite some.pdf -sOutputFile=some.ps -dNOPAUSE -c quit
When you generate your PDFs using pdflatex you can disable compression with an option. This makes the PDF more readable.
Some more recent observations on the other answers.
Adobe keep moving about their Open Sourced copy of the 2008 standard so currently that is here https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
The Web Archive have currently a copy here https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
They should be identical 22,491,828 bytes so beware neither includes any errata.
A pdf CAN be plain mime "text/pdf" as perfectly ? annotated generated from a console keyboard or command line (too slow) or a batch file. I wont bore you with the whole file but it starts like this:-
REM Start with File "Magic" Signatures for a PDF
echo %%PDF-1.0>!Fname!
echo %%âãÏÓ>>!Fname!
echo %%01) Prepare file references>>!Fname!
for %%Z in (!Fname!) do set "FZ1=%%~zZ"
echo 1 0 obj>>!Fname!
echo ^<^</Names^<^</Dests 2 0 R^>^>/Outlines 3 0 R>>/PageLayout/OneColumn/PageMode/UseOutlines>>!Fname!
REM ToDo add files
REM /Lang (ga-IE)/MarkInfo^<^</Marked true^>^>/Names ^<^<^/EmbeddedFiles [(file.ext) 3 0 R]^>^>>>!Fname!
echo /Pages 4 0 R/Type/Catalog/ViewerPreferences^<^</DisplayDocTitle true^>^>^>^>>>!Fname!
echo endobj>>!Fname!
echo %%02) Prepare Named Destinations>>!Fname!
Thus the annotated RAW PDF (note I had edited the order in the cmd file in preparation for an XMP data section, so not identical) could looks like :-
%PDF-1.3
%âãÏÓ
%01) Prepare file references
1 0 obj
<</Lang(ga-IE)/Names<</Dests 3 0 R>>/Outlines 4 0 R/PageLayout/OneColumn/PageMode/UseOutlines
/PageLabels<</Nums[0<</S/A>>]>>/Pages 5 0 R/Type/Catalog/ViewerPreferences<</DisplayDocTitle true>>>>
endobj
%02) Reserved for big meta data
2 0 obj
<< >>
endobj
%03) Prepare Named Destinations
3 0 obj
<</Names [(Page1) [6 0 R /XYZ 0 792 null] (QRCode) [6 0 R /XYZ 25.0 317.0 1]]>>
endobj
%04) Prepare Outline / Bookmarks
...
...
Many suggestions by others for decompress binary application/PDF into text/PDF and some may be a hybrid thus still have binarized application text.
The 3 most common designed for the task are qpdf (already mentioned, but uses a hybrid QDF) PDFtk (uncompress) and Mutool (different CLI options), that's the one I play with most, as its easy in GL GUI to change the output settings. The output can be modified in MS Notepad, whilst previewing result.
So any text editing script can write or edit a PDF even with graphics, And several applications can convert RAW "binary" PDF into RAW "textual" PDF. However never attempt to edit PDF whilst temporarily in its textual base64 RePrEx (possible, but totally impractical)
Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?