Export PDF page labels on command line - pdf

I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.
Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.

Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.
Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.
Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:
Each PDF document must contain a root object.
That root object must be of /Type /Catalog.
The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.
Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.
If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.
I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.
But instead I'll give you an example in ASCII PDF code:
213 0 obj
<< /Type /Catalog
/PageLabels
<<
/Nums
[
0 << % start labeling from page no. 1
/S /r % label with lowercase roman numbers
>>
7 << % start new labeling from page no. 8
/S /D % label with standard decimal numbers
>>
11 << % start labeling page no. 12
/S /D % label with decimal numbers...
/P (ABCD-) % ...but using label prefix 'ABCD-'...
/St 3 % ...followed by '3' as the start decimal.
>>
]
>>
%%...........................
%%...more root object keys...
%%...........................
>>
endobj
The above example will label the pages number 1, 2, 3, ... (last) like this:
i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...
As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.

I've written a small command-line utility based on Poppler that does just this task: https://github.com/HeimMatthias/pdfpagelabels
Disclaimer: I'm the OP and created the original post under a different account. I have been using the solution via pdftk (listed in a comment above) successfully for years in my implementation. However, last year it was time to reimplement our system from scratch and we've had numerous instances where the pdf-tk output could not be parsed by our implementation.
The new command-line tool follows the philosophy of doing just one thing, but doing it well, and simply prints the page labels of all or selected pages of a pdf-file. If anyone finds this useful, and stumbles upon it here, all the better for it.

Related

Find tagged content in PDF/A-1a using pdfbox

I have what I presume to be a PDF/A-1a file that was generated by apache fop and has an overlay letterhead put on using OverlayPDF from pdfbox. preflight recognizes the file as ok (but obviously only PDF/A-1b) and Acroreader says it is "PDF/A" mode and "Tagged: yes" in the document properties. I would like to see how that looks so I could maybe tweak fop into some small improvements.
My question is, where can I look to see the tagged content (i.e. the text representation of what in PDF is a kerned sequence of char outputs), preferably without coding myself, e.g. using the debugger/PDFReader from pdfbox? I'm a little lost there - is there an alternative way getting a textual output of the document structure e.g. into an xml file to search it using an editor? - TIA!
Edit
The letterhead(s) itself is originally postscript and converted to PDF/A-1b using ghostscript, then overlayed with
java -jar pdfbox-app-2.0.0-RC3.jar OverlayPDF letter_plain.pdf \
followingpages_letterhead.pdf -first firstpage_letterhead.pdf \
letter_with_head.pdf
The letter_plain.pdf is generated with fop using
fop -pdfprofile 'PDF/A-1a' -v -d -c my_fop_config.cfg -xml letter.xml \
-xsl letter_to_fo.xsl -pdf letter_plain.pdf
The versions used are pdfbox 2.0 and fop 1.1.
In case the letter_with_head.pdf would no longer be PDF/A-1a then the question would apply to the letter_plain.pdf which should be 1a as per the fop call, would have to choose a different solution (like svg) to get the letterhead in then.
Edit 2
Example pdfs can be found here: https://www.magentacloud.de/share/j9qk7jfzyv - there is no need for a separate followingpages_letterhead.pdf as the sample is only one page.
Edit 3
I have suspicion that text is buried somewhere below Root/StructTreeRoot/ParentTree/Nums/[1]/[3]/P/P/P/P/P/P (assuming that the P's somehow map the fo:block's) but can't get nowhere showing text from the pdf.
The structure tree entries in the PDF at hand maps to marked content in the pages content stream. As an example the entry in
Root/StructTreeRoot/K/[0]/K/[0]/K/[1]/K/[0]/K/[0]/K/[0]/K/[0]
maps to this part of the pages content stream
/Span << /MCID 0 >> BDC
BT
/F15 11 Tf
1 0 0 -1 0 9.163 Tm
[ (Bes) 15 (tell-Nr) 48 (. 1) 34 (23) 6 (456) 29 (7) 40 (8) ] TJ
ET
EMC
As can be seen there is no additional definition so there is no easily displayable text other than parsing the TJoperator in this example sequence. So the tagging is used to define the structure of the document pointing to different building blocks only.
In addition there is some information for Accessibility Support. But that's limited to specifying the Langattribute in the structure tree.

can I create a PDF from ImageMagick which will always open at zoom 100%? [duplicate]

I am running into an issue with PNG to PDF conversion.
Actually I have big PNG files not in size but in contents.
In PDF conversion it creates a big PDF files. I don't have any issue with its quality, but whenever I try to open this PDF in PDF viewer, it opens in "Fit to Page" mode.
So, I can't see the created PDF in the initial view, but I need to zoom it up to 100%.
My question is: can I create a PDF which will always open at zoom 100% ?
You can possibly achieve what you want with the help of Ghostscript.
Ghostscript supports to insert PostScript snippets into its command line parameters via -c "...[PostScript code here]...".
PostScript has a special operator called pdfmark. This operator is not understood by most PostScript interpreters, but is understood by Acrobat Distiller and (for most of its parameters) also by Ghostscript when generating PDFs.
So you could try to insert
-c "[ /PageMode /UseNone /Page 1 /View [/XYZ null null 1] \
/PageLayout /SinglePage /DOCVIEW pdfmark"
into a PDF->PDF conversion Ghostscript command line.
Please take note about various basic things concerning this snippet:
The contents of the command line snippet appears to be 'unbalanced' regarding the [ and ] operators/keywords. But it is not! The initial [ is balanced by the final pdfmark keyword. (Don't ask -- I did not define this syntax...)
The 'inner' [ ... ] brackets delimit an array representing the page /View settings you desire.
Not all PDF viewers do respect the view settings embedded in the PDF file (Acrobat software does!).
Most PDF viewers allow users to override the view settings embedded in PDF files (Acrobat software also does this). That is, you can tell your viewer to never respect any settings from the PDF files it opens, but f.e. to always open it with "fit to width".
Some specific things about this snippet:
The page mode /UseNone means: the document displays without bookmarks or thumbnails. It could be replaced by
/UseOutlines (to display bookmarks also, not just the pages)
/UseThumbs (to display thumbnail images of the pages, not just the pages
/FullScreen (to open document in full screen mode)
The array for the view mode constructed as [/XYZ <left> <top> <zoom>] means: The zoom factor is 1 (=100%), the left distance from the page origin is the special 'null' value, which means to keep the previously user-set value; the top distance from the page origin is also 'null'. This array could be replaced by
/Fit (to adapt the page to the current window size)
/FitB (to adapt the visible page content to the current window size)
/FitH <top>' (to adapt the page width to the current window width);` indicates the required distance from page origin to upper edge of window.
...plus several others I cannot remember right now.
So to change the settings of an existing PDF file, you could do the following:
gs \
-o out.pdf \
-sDEVICE=pdfwrite \
-c "[ /PageMode /UseNone /Page 1 /View [ /XYZ null null 1 ] " \
-c " /PageLayout /SinglePage /DOCVIEW pdfmark" \
-f in.pdf
To check if the Ghostscript command worked, open the PDF in a text editor which is capable of handling binary files. Search for the /View or the /PageMode keywords and check if they are there, inserted as values into the PDF root object.
If it worked, check if your PDF viewer honors the settings. If it doesn't honor them, see if there is an overriding setting within the viewers preference settings.
I did a quick test run on a sample PDF of mine. Here is how the PDF root object's dictionary looks now, checked with the help of pdf-parser.py:
pdf-parser-beta.py -s Catalog a.pdf
obj 1 0
Type: /Catalog
Referencing: 3 0 R, 9 0 R
<<
/Type /Catalog
/Pages 3 0 R
/PageMode /UseNone
/Page 1
/View [/XYZ null null 1]
/PageLayout /SinglePage
/Metadata 9 0 R
>>
To learn more about the pdfmark operator, google for 'pdfmark reference filetype:pdf'. You should be able to find it on the Adobe website and elsewhere:
https://www.google.de/search?q=pdfmark%20reference%20filetype%3Apdf&oq=pdfmark%20reference%20filetype%3Apdf
In order to let ImageMagick create a PDF as you want it, you may be able to hack the file defining your delegate settings. For more help about this topic see for example here:
http://www.imagemagick.org/Usage/files/#delegates
PDF specification supports this functionality in this way: create a GoTo action that goes to first page and sets the zoom level to 100% and then set the action as the document open action.
How exactly you implement it in real life depends very much on the tool you use to create the PDF file. I do not know if ImageMagick can create such actions.

How to annotate PS or PDF from (Linux) command line without losing quality?

Is there any command line tool for Linux that will allow me to annotate a PS or PDF file with text or a particular font, color, and size with no loss of quality? I have tried ImageMagick's convert, and the resulting PDF is of pretty poor quality.
I have a template originally authored in Adobe Illustrator, and I would like to generate PDFs from it with names in certain places. I have a huge list of names, so I would like to do this in a batch (not interactively).
If anyone has any ideas I'd appreciate hearing them.
Thanks,
Carl
Another way to accomplish this would be to hack the postscript file itself. It used to be that AI files were postscript files, and you could modify them directly; I don't know if that's true anymore. So you may have to export it.
For simplicity, I assume there's a single page. Therefore, at the very end there will be a single call to showpage (perhaps through another name). Any drawing commands performed before showpage will show up on the page.
You may need to reinitialize the graphics state (initgraphics), as the rest of the document may have left it all funny, expecting showpage to clean up before anyone notices.
To place text, you'll need to set a new font (the old one was invalidated by initgraphics) measure the location in points (72 points/inch, 28.3465 points/cm).
/Palatino-Roman 17 selectfont %so much prettier than Times
x y moveto
(new text) show
To do the merging, you can use perl: emit the beginning of the document as a HERE-document, construct some text-writing lines by program, emit the tail of the document. Here's an example of generating postscript with PERL
Or you can take data from the command-line (with ghostscript) by using the -- option ($gs -q -- program.ps arg1 arg2 ... argn). These arguments are accessible to the program through an array named /ARGUMENTS.
So, say you have a nice graphic of a scary clown holding a blank sign about 1 inch wide, 3 inches tall, top left corner at 4 inches from the left, 4 inches from the bottom. You can insert this code into the ps program, just before showpage.
initgraphics
/Palatino-Roman 12 selectfont
4 72 mul 4 72 mul moveto
ARGUMENTS {
gsave show grestore 0 -14 rmoveto
} forall
Now you can make him say funny things ($gs -- clown.ps "On a dark," "and stormy night...").
I think it's better to create PDF form and fill it with pdftk fill_form in batch:
$ pdftk form.pdf fill_form data.fdf output out.pdf flatten
Form data should be in Forms Data Format (it's just XML file with field names and values specified).
Note the flatten command. It is required to convert filled form to plain document.
Another way is to create set of PDF documents "with names in certain places" and transparent background, and pdftk stamp each of them over the template:
$ pdftk template.pdf stamp words.pdf output out.pdf

How to open PDF raw?

I've been wanting to see the insides of a PDF for a while, like, the raw source code of it so I can look at it. Any way of doing that?
Looking at the raw code of PDFs will not serve you much unless you also have an idea about its internal structure. You should get yourself a copy of the official PDF reference (download PDF), and you should have read some introductionary article such as this [gone] or this to begin with.
Even after such a preparation, you'll not discover much useful when staring at the raw code. Because PDFs usually will contain parts which are "filtered" (that means: compressed).
How to look at the real PDF source behind the 'raw' binary parts
Jay Birkenbilt's qpdf is a very useful commandline tool (available for Linux, Mac OSX, Windows, and as source code, under the open source Artistic License), which can unpack most filtered content and re-organize the internal structure in a way that gives you much more insight into it (all objects are numerically ordered, etc.). The commandline to achieve this is:
qpdf --qdf original.pdf unpacked.pdf
Another useful and free tool (GPL licensed, but Linux-only AFAIK) to look into PDFs is of course PDFEdit. This one even comes with a GUI (if you prefer that), while still allowing you access to the internal structure and "raw" PDF code.
If the purpose is just to look into the file, then any simple text editor will do, ex, Notepad. PDF is just a text based format, including embedded content byte streams. Raw PDF looks like this:
>>
/Border [0 0 0]
/Rect [121.02 332.48 363.24 343.64]
/StructParent 1321
/Subtype /Link
/Type /Annot
>>
endobj
64579 0 obj
<<
/Filter /FlateDecode
/Length 5771
>>
stream
Ũn0x/�+�}�ǹ����\֛ bYO�5[��X��W��L��(�������V�A3�C���������u큋_�a��ךm2N�6� ��A��8
�d���NQ⺢GI��G�[��)�̉Y��R�y{R����&�&�;��g�k1���ҋeTC�(W��`���*��(;�AEc<= mnZ+��|T��v
�.��зe�aޞ��V4�b���L����k�Oj.ֿ�y�����kc|I�� ��C�0��Hf�7d�/�z���m��o��A��B��IJ�%�.
!�%f�б���&�ޒ�4Ύ7�l�3���3`�
endstream
endobj
64580 0 obj
<<
/Border [0 0 0]
/Dest <E4AE7DD2769553EF1668>
/Rect [219 648.5 256.8 659.66]
/StructParent 1323
/Subtype /Link
/Type /Annot
>>
What you see are basic COS objects like name, dictionary, stream and so on. All objects are described in PDF 32000 standard, see section 7.3 Objects.
Use a Hex editor. Of course, unless you know the PDF specification (PDF, 8.6 MB), you won't recognize much.
In addition to the qpdf tool conversion into postscript might be helpful.
PDF is a subset of PS. Usually its quite easy to figure out, e.g. where the labels of a graph are. You can either use pdf2ps or invoke ghostscript
gs -sDEVICE=pswrite some.pdf -sOutputFile=some.ps -dNOPAUSE -c quit
When you generate your PDFs using pdflatex you can disable compression with an option. This makes the PDF more readable.
Some more recent observations on the other answers.
Adobe keep moving about their Open Sourced copy of the 2008 standard so currently that is here https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
The Web Archive have currently a copy here https://ia601003.us.archive.org/5/items/pdf320002008/PDF32000_2008.pdf
They should be identical 22,491,828 bytes so beware neither includes any errata.
A pdf CAN be plain mime "text/pdf" as perfectly ? annotated generated from a console keyboard or command line (too slow) or a batch file. I wont bore you with the whole file but it starts like this:-
REM Start with File "Magic" Signatures for a PDF
echo %%PDF-1.0>!Fname!
echo %%âãÏÓ>>!Fname!
echo %%01) Prepare file references>>!Fname!
for %%Z in (!Fname!) do set "FZ1=%%~zZ"
echo 1 0 obj>>!Fname!
echo ^<^</Names^<^</Dests 2 0 R^>^>/Outlines 3 0 R>>/PageLayout/OneColumn/PageMode/UseOutlines>>!Fname!
REM ToDo add files
REM /Lang (ga-IE)/MarkInfo^<^</Marked true^>^>/Names ^<^<^/EmbeddedFiles [(file.ext) 3 0 R]^>^>>>!Fname!
echo /Pages 4 0 R/Type/Catalog/ViewerPreferences^<^</DisplayDocTitle true^>^>^>^>>>!Fname!
echo endobj>>!Fname!
echo %%02) Prepare Named Destinations>>!Fname!
Thus the annotated RAW PDF (note I had edited the order in the cmd file in preparation for an XMP data section, so not identical) could looks like :-
%PDF-1.3
%âãÏÓ
%01) Prepare file references
1 0 obj
<</Lang(ga-IE)/Names<</Dests 3 0 R>>/Outlines 4 0 R/PageLayout/OneColumn/PageMode/UseOutlines
/PageLabels<</Nums[0<</S/A>>]>>/Pages 5 0 R/Type/Catalog/ViewerPreferences<</DisplayDocTitle true>>>>
endobj
%02) Reserved for big meta data
2 0 obj
<< >>
endobj
%03) Prepare Named Destinations
3 0 obj
<</Names [(Page1) [6 0 R /XYZ 0 792 null] (QRCode) [6 0 R /XYZ 25.0 317.0 1]]>>
endobj
%04) Prepare Outline / Bookmarks
...
...
Many suggestions by others for decompress binary application/PDF into text/PDF and some may be a hybrid thus still have binarized application text.
The 3 most common designed for the task are qpdf (already mentioned, but uses a hybrid QDF) PDFtk (uncompress) and Mutool (different CLI options), that's the one I play with most, as its easy in GL GUI to change the output settings. The output can be modified in MS Notepad, whilst previewing result.
So any text editing script can write or edit a PDF even with graphics, And several applications can convert RAW "binary" PDF into RAW "textual" PDF. However never attempt to edit PDF whilst temporarily in its textual base64 RePrEx (possible, but totally impractical)

Programmatically add comments to PDF header

Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?