Extracting 3D model from a pdf - pdf

I have uncompressed a PDF and I am able to view the streams and know where the data for the 3d model is kept. I have also extracted the section that describes the PRC part of the file. I have used SumatraPDF in order to view the file as it supports PRC. But it shows an error. How can extract the 3D model from the PDF?
That I will be able to extract the 3D model from the PDF in any format so I can ultimately convert it into a format which can be viewed and edited if need be.

Currently SumatraPDF does not RUN 3D models nor other embeds nor attachments (one of the reasons its considered SAFEST PDF reader). It will accept as .PDF, Adobe ONLY supported 3D PRC / U3D, or other PRCs (Palm Reader Compressed eBook) which are runnable as plain text.
It can show any static 3D cover image (and as work in progress in some versions allow extract some types of attachment) or potentially delete attached annotations (not what you want). https://github.com/GitHubRulesOK/MyNotes/raw/master/Acrobat%203D%20test%20-%20Laughing%20Porpoise.pdf
It is possible in PDF editors to run, view and extract such contents, possibly via their API/SDK.
<<
/AN 5 0 R
/Filter /FlateDecode
/Length 335276
/Subtype /U3D
/Type /3D
/VA 6 0 R
>>
stream
xœì¼ 8Tßÿ~{B–¢0d×fŒuÈ.KciU„lI¶ÅZ$KeI²e§´ªˆ¢ÍRÙµEÖìd™ÿÌŒQÚ>ßßÿÿüïóÜgfÎyÏû¼Þçõ:çžsî½ÇBVà `70wÏØúoßÁ“šÎhzìÝãµ×n£Ïz#ÖÎÉ~ýaY;iEq
»6{4l=ØìñÕ0?pÀeŸ“— l½ôz©õRø¿³¦zß-\¼<özíõ𤤠€‡ÃÑ•Ï þÎ#|§´¸¸9:¹åãñ‚6&ä6Tß~ž›ú$++{«xÆüŽÀ§d>}îš`F š oJ¾—,¨æJø+ HñÓlœ÷Úz‘c· Ë'qùÝú:‚{²cÁ4|RÊ 88<þB
.......

Related

Crop PDF Content

I have a pdf that I would like to impose. It has 8.5x11" pages, media box, and crop box. I want the pdf to have 17x11" pages, by merging adjacent pages. Unfortunately, most pages have content either completely outside or straddling the crop box. Because each page can only have a single stream and crop box, when imposed, the overlapping content becomes visible. This is bad.
I don't want to rasterize my pdf because that would fix the DPI ahead-of-time. So I won't consider exporting pages as images, appending the images (imagemagick), then embedding these paired images into a new pdf.
I've also had problems imposing in postscript - issues with transparency, font rasterization, and other visual glitches during the pdf->ps->pdf conversions.
The answer should be scriptable.
So far I've tried:
podofo imposition scripts (lua)
PyPDF2 (python)
ghostscript
latex
The question "Ghostscript removes content outside the crop box?" suggests that ghostscript's pdfwrite module, when generating an output pdf file, will rasterize and crop content according to the crop box. So I'd only have to pipe my pdf through ghostscript's pdfwrite module. Unfortunately, this doesn't work.
I was about to give up when I tried printing the pdf to another pdf through evince. It works perfectly - text & vector elements within the crop box are not rasterized, and elements outside the crop box are removed (I haven't tested straddling elements yet). The quality is high - resolution (page size) and appearance are identical. In fact, everything seems to be the same except for the metadata.
So:
the question is possible
the answer already exists
How can I access it?
I think this functionality might be provided by cup's pdftopdf binary. I don't have any problems calling an external binary.... but can't figure out how to use pdftopdf.
Edit: Link to test pdf. It contains raster, vector, and text items - some partially occluded by partially transparent items - that span as well as abut adjacent pages. Once again, printing this PDF through cups appears to crop all content outside the crop box. However, opening the filtered pdf in inkscape shows that the off-page items are individually masked, not cropped - except text, which is trimmed.
The trick is to use Form XObjects to impose multiple pages within a single page. Form XObjects can reference entire PDF pages, and maintain independent clips. PyPDF2 doesn't support Form XObjects, so merging unifies the stream of all input pages such that they share the clip/media box of the output page. I've been successful in using both pdflatex and pdfrw (python) - test programs are inlined below. Since Form XObjects are derived from a similar postscript level 2 feature, as suggested by KenS it should be possible to achieve the same goal in ghostscript using "page clips". In fact he shared a ghostscript 2x1 imposition script in another answer, but it appears horrendously complicated. Combined with the font rasterization issues of poppler's pdftops (even with compatibility level > 1.4), I've abandoned the ghostscript approach.
Latex script derived from How to stitch two PDF pages together as one big page?. Requires pdflatex:
\documentclass{article}
\usepackage{pdfpages}
\usepackage[paperwidth=8.5in, paperheight=11in]{geometry}
\usepackage[multidot]{grffile}
\pagestyle{plain}
\begin{document}
\setlength\voffset{+0.0in}
\setlength\hoffset{+0.0in}
\includepdf[ noautoscale=true
, frame=false
, pages={1}
]
{<file.pdf>}
\eject \paperwidth=17in \pdfpagewidth=17in \paperheight=11in \pdfpageheight=11in
\includepdf[ nup=2x1
, noautoscale=true
, frame=false
, pages={2-,}
]
{<file.pdf>}
\end{document}
pdfrw (python script) derived from pdfrw:examples:booklet. Requires pdfrw >= 0.2:
#!/usr/bin/env python3
# Copyright:
# Yclept Nemo
# 2016
# License:
# GPLv3
import itertools
import argparse
import pdfrw
# from itertool recipes in the python documentation
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
def pagemerge(page, *pages):
merged = pdfrw.PageMerge() + page
for page in reversed(list(itertools.takewhile(lambda i: i is not None, reversed(pages)))):
merged = merged + page
merged[-1].x = merged[-2].x + merged[-2].w
return merged.render()
parser = argparse.ArgumentParser(description='Impose PDF files using Form XOBjects')
parser.add_argument\
( "source"
, help="PDF, source path"
, type=pdfrw.PdfReader
)
parser.add_argument\
( "-s", "--spacer"
, help="PDF, spacer path"
, type=lambda fp: next(iter(pdfrw.PdfReader(fp).pages), None)
)
parser.add_argument\
( "target"
, help="PDF, target path"
)
args = parser.parse_args()
pages = args.source.pages[:1]
for pair in grouper(args.source.pages[1:], 2):
assert pair[0] is not None
pages.append(pagemerge(pair[0], args.spacer, pair[1]))
# include metadata in target
target = pdfrw.PdfWriter()
target.addpages(pages)
target.trailer.Info = args.source.Info
target.write(args.target)
Some idiosyncrasies as of pdfrw 0.2:
Note that the operations +=, append and extend are not defined for pdfrw.PageMerge, even though it behaves like a list. Furthermore + acts like += in that it modifies the left-hand-side object.
Ghostscript and the pdfwrite device do not, in general, rasterise the content of input PDF files (the caveat is for cases involving transparent input and the output being < PDF 1.4).
Object which are entirely clipped out are not preserved into the output.
So the short answer is that this should be entirely feasible using Ghostscript and the pdfwrite device, with the advantage that its possible to impose the pages as well in a single operation. I do have an open bug report about clipping in a similar situation (reverse imposition) but have not yet had time to address it.
Note that Ghostscript normally uses the MediaBox for the clip region, if you want to use the CropBox then you need to add -dUseCropBox to the command line.

PDF with an external image using XObject

I'm trying to build a PDF file with a link to an external file.
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
On page 348 there is an example of image with an alternate image loaded remotely. When I create a document with the example from the doc, the reader (using acrobat reader XI) doesn't fetch the image. There is no error message but no request is being made (checked using wireshark).
Can I have only a remote image (ie no "normal" image and alternate image).
Is there an example somewhere of a full document using that /FS /URL syntax (ie not just the objects)? I couldn't find any that actually does the request.
Thanks
Edit:
I used LibreOffice to create the base document with a single 1x1 pixel.
http://pastebin.com/5GqCYgMp
I initially created my test document with Acrobat but the output was really messy.
Then replaced the stream with the example from the pdf spec, and tried to fix the startxref offset, but not sure it's correct.
http://pastebin.com/BT42g02P
This document is currently not opening correctly, but I tried to get a minimum test case. My previous attempts were displayed with no errors only by luck (but the remote image didn't work anyway).
Is there any tool that actually allows the creation of XObject with /URL? I don't know the file format enough to create them reliably by hand.
First of all,
I'm using the spec https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
I would recommend not using a PDF reference but instead the ISO standard. The Adobe PDF references are not normative in nature while the ISO standard is. (The actual content differences are minute but if there is a normative spec, one should use it.) Adobe also publishes a copy of the ISO standard with merely the header exchanged.
Then, please don't treat PDFs as text documents. E.g. by sharing them on pastebin, you make them subject to treatment as text which essentially destroys the content.
That all been said, let's look at your actual issue:
In your sample PDF you have:
4 0 obj
<</Type/XObject/Subtype/Image/Width 1/Height 1/BitsPerComponent 8/Length 0/F << /FS /URL
/F ( https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png )
>>/Filter/FlateDecode/ColorSpace/DeviceRGB
>>
stream
endstream
endobj
This indicates that the PDF viewer shall find at the URL https://upload.wikimedia.org/wikipedia/commons/c/ca/1x1.png a file containing an array of 1 (/Width 1/Height 1) RGB (/ColorSpace/DeviceRGB) sample with 1 byte per color (/BitsPerComponent 8), cf. section 8.9.5 Image Dictionaries of ISO 32000-1.
I doubt your file fulfills that, I assume it actually is a PNG file in particular with a PNG structure, not the structure explained above.
PDF does not support the PNG format as is, you have to transform the data. It does support, though, the JPEG format using the /FFilter /DCTDecode which is why the sample from the specification
16 0 obj
<< /Type /XObject
/Subtype /Image
/Width 1000
/Height 2000
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Length 0 % This is an external stream
/F << /FS /URL
/F (http://www.myserver.mycorp.com/images/exttest.jpg)
>>
/FFilter /DCTDecode
>>
stream
endstream
endobj
makes it look so easy.

can I create a PDF from ImageMagick which will always open at zoom 100%? [duplicate]

I am running into an issue with PNG to PDF conversion.
Actually I have big PNG files not in size but in contents.
In PDF conversion it creates a big PDF files. I don't have any issue with its quality, but whenever I try to open this PDF in PDF viewer, it opens in "Fit to Page" mode.
So, I can't see the created PDF in the initial view, but I need to zoom it up to 100%.
My question is: can I create a PDF which will always open at zoom 100% ?
You can possibly achieve what you want with the help of Ghostscript.
Ghostscript supports to insert PostScript snippets into its command line parameters via -c "...[PostScript code here]...".
PostScript has a special operator called pdfmark. This operator is not understood by most PostScript interpreters, but is understood by Acrobat Distiller and (for most of its parameters) also by Ghostscript when generating PDFs.
So you could try to insert
-c "[ /PageMode /UseNone /Page 1 /View [/XYZ null null 1] \
/PageLayout /SinglePage /DOCVIEW pdfmark"
into a PDF->PDF conversion Ghostscript command line.
Please take note about various basic things concerning this snippet:
The contents of the command line snippet appears to be 'unbalanced' regarding the [ and ] operators/keywords. But it is not! The initial [ is balanced by the final pdfmark keyword. (Don't ask -- I did not define this syntax...)
The 'inner' [ ... ] brackets delimit an array representing the page /View settings you desire.
Not all PDF viewers do respect the view settings embedded in the PDF file (Acrobat software does!).
Most PDF viewers allow users to override the view settings embedded in PDF files (Acrobat software also does this). That is, you can tell your viewer to never respect any settings from the PDF files it opens, but f.e. to always open it with "fit to width".
Some specific things about this snippet:
The page mode /UseNone means: the document displays without bookmarks or thumbnails. It could be replaced by
/UseOutlines (to display bookmarks also, not just the pages)
/UseThumbs (to display thumbnail images of the pages, not just the pages
/FullScreen (to open document in full screen mode)
The array for the view mode constructed as [/XYZ <left> <top> <zoom>] means: The zoom factor is 1 (=100%), the left distance from the page origin is the special 'null' value, which means to keep the previously user-set value; the top distance from the page origin is also 'null'. This array could be replaced by
/Fit (to adapt the page to the current window size)
/FitB (to adapt the visible page content to the current window size)
/FitH <top>' (to adapt the page width to the current window width);` indicates the required distance from page origin to upper edge of window.
...plus several others I cannot remember right now.
So to change the settings of an existing PDF file, you could do the following:
gs \
-o out.pdf \
-sDEVICE=pdfwrite \
-c "[ /PageMode /UseNone /Page 1 /View [ /XYZ null null 1 ] " \
-c " /PageLayout /SinglePage /DOCVIEW pdfmark" \
-f in.pdf
To check if the Ghostscript command worked, open the PDF in a text editor which is capable of handling binary files. Search for the /View or the /PageMode keywords and check if they are there, inserted as values into the PDF root object.
If it worked, check if your PDF viewer honors the settings. If it doesn't honor them, see if there is an overriding setting within the viewers preference settings.
I did a quick test run on a sample PDF of mine. Here is how the PDF root object's dictionary looks now, checked with the help of pdf-parser.py:
pdf-parser-beta.py -s Catalog a.pdf
obj 1 0
Type: /Catalog
Referencing: 3 0 R, 9 0 R
<<
/Type /Catalog
/Pages 3 0 R
/PageMode /UseNone
/Page 1
/View [/XYZ null null 1]
/PageLayout /SinglePage
/Metadata 9 0 R
>>
To learn more about the pdfmark operator, google for 'pdfmark reference filetype:pdf'. You should be able to find it on the Adobe website and elsewhere:
https://www.google.de/search?q=pdfmark%20reference%20filetype%3Apdf&oq=pdfmark%20reference%20filetype%3Apdf
In order to let ImageMagick create a PDF as you want it, you may be able to hack the file defining your delegate settings. For more help about this topic see for example here:
http://www.imagemagick.org/Usage/files/#delegates
PDF specification supports this functionality in this way: create a GoTo action that goes to first page and sets the zoom level to 100% and then set the action as the document open action.
How exactly you implement it in real life depends very much on the tool you use to create the PDF file. I do not know if ImageMagick can create such actions.

How to open PDF raw?

I've been wanting to see the insides of a PDF for a while, like, the raw source code of it so I can look at it. Any way of doing that?
Looking at the raw code of PDFs will not serve you much unless you also have an idea about its internal structure. You should get yourself a copy of the official PDF reference (download PDF), and you should have read some introductionary article such as this [gone] or this to begin with.
Even after such a preparation, you'll not discover much useful when staring at the raw code. Because PDFs usually will contain parts which are "filtered" (that means: compressed).
How to look at the real PDF source behind the 'raw' binary parts
Jay Birkenbilt's qpdf is a very useful commandline tool (available for Linux, Mac OSX, Windows, and as source code, under the open source Artistic License), which can unpack most filtered content and re-organize the internal structure in a way that gives you much more insight into it (all objects are numerically ordered, etc.). The commandline to achieve this is:
qpdf --qdf original.pdf unpacked.pdf
Another useful and free tool (GPL licensed, but Linux-only AFAIK) to look into PDFs is of course PDFEdit. This one even comes with a GUI (if you prefer that), while still allowing you access to the internal structure and "raw" PDF code.
If the purpose is just to look into the file, then any simple text editor will do, ex, Notepad. PDF is just a text based format, including embedded content byte streams. Raw PDF looks like this:
>>
/Border [0 0 0]
/Rect [121.02 332.48 363.24 343.64]
/StructParent 1321
/Subtype /Link
/Type /Annot
>>
endobj
64579 0 obj
<<
/Filter /FlateDecode
/Length 5771
>>
stream
Ũn0x/�+�}�ǹ����\֛ bYO�5[��X��W��L��(�������V�A3�C���������u큋_�a��ךm2N�6� ��A��8
�d���NQ⺢GI��G�[��)�̉Y��R�y{R����&�&�;��g�k1���ҋeTC�(W��`���*��(;�AEc<= mnZ+��|T��v
�.��зe�aޞ��V4�b���L����k�Oj.ֿ�y�����kc|I�� ��C�0��Hf�7d�/�z���m��o��A��B��IJ�%�.
!�%f�б���&�ޒ�4Ύ7�l�3���3`�
endstream
endobj
64580 0 obj
<<
/Border [0 0 0]
/Dest <E4AE7DD2769553EF1668>
/Rect [219 648.5 256.8 659.66]
/StructParent 1323
/Subtype /Link
/Type /Annot
>>
What you see are basic COS objects like name, dictionary, stream and so on. All objects are described in PDF 32000 standard, see section 7.3 Objects.
Use a Hex editor. Of course, unless you know the PDF specification (PDF, 8.6 MB), you won't recognize much.
In addition to the qpdf tool conversion into postscript might be helpful.
PDF is a subset of PS. Usually its quite easy to figure out, e.g. where the labels of a graph are. You can either use pdf2ps or invoke ghostscript
gs -sDEVICE=pswrite some.pdf -sOutputFile=some.ps -dNOPAUSE -c quit
When you generate your PDFs using pdflatex you can disable compression with an option. This makes the PDF more readable.
Some more recent observations on the other answers.
Adobe keep moving about their Open Sourced copy of the 2008 standard so currently that is here https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
The Web Archive have currently a copy here https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
They should be identical 22,491,828 bytes so beware neither includes any errata.
A pdf CAN be plain mime "text/pdf" as perfectly ? annotated generated from a console keyboard or command line (too slow) or a batch file. I wont bore you with the whole file but it starts like this:-
REM Start with File "Magic" Signatures for a PDF
echo %%PDF-1.0>!Fname!
echo %%âãÏÓ>>!Fname!
echo %%01) Prepare file references>>!Fname!
for %%Z in (!Fname!) do set "FZ1=%%~zZ"
echo 1 0 obj>>!Fname!
echo ^<^</Names^<^</Dests 2 0 R^>^>/Outlines 3 0 R>>/PageLayout/OneColumn/PageMode/UseOutlines>>!Fname!
REM ToDo add files
REM /Lang (ga-IE)/MarkInfo^<^</Marked true^>^>/Names ^<^<^/EmbeddedFiles [(file.ext) 3 0 R]^>^>>>!Fname!
echo /Pages 4 0 R/Type/Catalog/ViewerPreferences^<^</DisplayDocTitle true^>^>^>^>>>!Fname!
echo endobj>>!Fname!
echo %%02) Prepare Named Destinations>>!Fname!
Thus the annotated RAW PDF (note I had edited the order in the cmd file in preparation for an XMP data section, so not identical) could looks like :-
%PDF-1.3
%âãÏÓ
%01) Prepare file references
1 0 obj
<</Lang(ga-IE)/Names<</Dests 3 0 R>>/Outlines 4 0 R/PageLayout/OneColumn/PageMode/UseOutlines
/PageLabels<</Nums[0<</S/A>>]>>/Pages 5 0 R/Type/Catalog/ViewerPreferences<</DisplayDocTitle true>>>>
endobj
%02) Reserved for big meta data
2 0 obj
<< >>
endobj
%03) Prepare Named Destinations
3 0 obj
<</Names [(Page1) [6 0 R /XYZ 0 792 null] (QRCode) [6 0 R /XYZ 25.0 317.0 1]]>>
endobj
%04) Prepare Outline / Bookmarks
...
...
Many suggestions by others for decompress binary application/PDF into text/PDF and some may be a hybrid thus still have binarized application text.
The 3 most common designed for the task are qpdf (already mentioned, but uses a hybrid QDF) PDFtk (uncompress) and Mutool (different CLI options), that's the one I play with most, as its easy in GL GUI to change the output settings. The output can be modified in MS Notepad, whilst previewing result.
So any text editing script can write or edit a PDF even with graphics, And several applications can convert RAW "binary" PDF into RAW "textual" PDF. However never attempt to edit PDF whilst temporarily in its textual base64 RePrEx (possible, but totally impractical)

Programmatically add comments to PDF header

Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?