How to decode PDF file and encode it back? - pdf

My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.
My usual strategy was to use an old software called "Pdfedit" that could decode PDF-Files, all the byte-streams would then be human-readable, edit the relevant part of the PDF containing the glyph mappings, and open the file with Adobe Acrobat that automatically re-encoded it.
Now I have some PDFs that cause "Pdfedit" to crash upon opening. I tried using PDF-Parser but its output cannot be re-encoded by Adobe Acrobat.
Also, the relevant parts used to look like this decoded:
/CMapType 2 def
1 begincodespacerange
<00><04>
endcodespacerange
5 beginbfchar
<00><0000>
<01><0000>
<02><263A>
<03><0000>
<04><0000>
endbfchar
endcmap
But now I use the following command python3 pdf-parser.py -f -n /path/to/file.pdf > dump.txt and inside dump.txt the relevant part looks like this:
b'/CMapType 2 def\n1 begincodespacerange\n<00><04>\nendcodespacerange\n5 beginbfchar\n<00><0000>\n<01><0000>\n<02><263A>\n<03><0000>\n<04><0000>\nendbfchar\nendcmap\nCMapName currentdict/CMap defineresource pop end end'
So it is a bytestring and any linebreak is rendered literally as \n. The txt file that contains this cannot be interpreted as a PDF by Adobe Acrobat.
I have now also realized that many elements such as %%EOF are delimited by ''.
The true issue is how to get an Acrobat-readable output from pdf-parser.py, as the shell-command > does not work and stdout in the shell is also faulty.
I will try out a few things but could really need some help on this!

Answering my own question in case this is relevant for someone down the line.
Didier Stevens, the dev behind the pdf-parser, answered that his tool is not made for this. He recommended qpdf instead.
That was indeed the solution. Make sure you use the flag --stream-data=uncompress so that compressed parts are also accessible in the output. The command to use with qpdf is:
qpdf old_file.pdf --stream-data=uncompress --decode-level=all new_file.txt
You can output new_file also as .pdf. In any case you will be able to open it in the text editor. Once you're done applying the changes you wish to apply, you can change the ending to pdf and process it further with acrobat or any other conversion program.

Related

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?
I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.
I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.
If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).
Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?

I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:
$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf
This does not remove the PageLabel* metadata which can be verified with:
$ pdftk example_orig.pdf dump_data | grep PageLabel
How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.
I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.
Using a text editor to remove PDF metadata
If it is the first time you edit a PDF, make a backup copy first.
Open your PDF with a text editor that can handle binary blobs. vim -b will be fine.
Locate the /Info dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key names plus the (some values) following them).
Be careful to not use more spaces than there were characters initially. Otherwise your xref table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).
For additional measure, locate the /XML string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).
In case your PDF does not make the /Info dictionary accessible, transform it with the help of qpdf.
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
Apply the procedure outlined above. (The qdf---orig.pdf now should be much better suited for
Re-compact your edited file:
qpdf qdf---orig.pdf edited---orig.pdf
Done! Enjoy your edited---orig.pdf. Check if it has all the data removed:
pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files provided, it became clear to me that the /PageLabel key is not part of the /Info dictionary (PDF's Document Information Dictionary), but of the /Root object.
That's probably one reason why pdftk was unable to update it with the method the OP described.
The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.
In the case of the OP's example files, the complication comes into play: the /Root object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm). That means one has to decompress it with the help of qpdf first:
Use qpdf:
qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
Open the resulting file in binary mode with vim:
vim -b q-notes.pdf
Locate the 1 0 obj marker for the beginning of the /Root object, containing a dictionary named /PageLabels.
(a) To disable page labels altogether, just replace the /PageLabels string by /Pagelabels, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)
(b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as
<feff0031>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
(These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)
Edit these values to read:
<feff0031>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
[....]
<feff0035>
[....]
<feff0036>
Save the file and run qpdf again in order to re-compress the PDF:
qpdf q-notes.pdf notes.pdf
These now hopefully are the page labels the OP is looking for....
Since the OP seems to be familiar with editing pdftk's output of dump_data output, he can possibly edit the output and use update_data to apply the fix to the PDF without needing to resort to qpdf and vim.
Update 2:
User #Iserni posted a very good, short and working answer, which limits itself to one command, pdftk, which the OP seems to be familiar with already, plus sed -- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf like my answer did.
Unfortunately #Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!
So temporarily, I'll include a copy of #Iserni's answer here, until his is undeleted again:
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
# Remove temp file
rm -f temp.pdf mangled.pdf
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
rm -f temp.pdf mangled.pdf

Rendering the whole media box of a pdf page into a png file using ghostscript

I'm trying to render Pdfs pages into png files using Ghostscript v9.02. For that purpose I'm using the following command line:
gswin32c.exe -sDEVICE=png16m -o outputFile%d.png mypdf.pdf
This is working fine when the pdf crop box is the same as the media box, but if the crop box is smaller than the media box, only the media box is displayed and the border of the pdf page is lost.
I know usually pdf viewers only display the crop box but I need to be able to see the whole media page in my png file.
Ghostscript documentation says that per default the media box of a document is rendered, but this does not work in my case.
As anyone an idea how I could achieve rendering the whole media box using ghostscript?Could it be that for png file device, only the crop box is rendered? Am I maybe forgetting a specific command?
For example, this pdf contains some registration marks outside of the crop box, which are not present in the output png file. Some more information about this pdf:
media box:
width: 667
height: 908 pts
crop box:
width: 640
height: 851
OK, now that revers has re-stated his problem into that he is looking for "generic code", let me try again.
The problem with a "generic code" is that there are many "legal" formal representations of "CropBox" statements which could appear in a PDF. All of the following are possible and correct and set the same values for the page's CropBox:
/CropBox[10 20 500 700]
/CropBox[ 10 20 500 700 ]
/CropBox[10 20 500 700 ]
/CropBox [10 20 500 700]
/CropBox [ 10 20 500 700 ]
/CropBox [ 10.00 20.0000 500.0 700 ]
/CropBox [
10
20
500
700
]
The same is true for ArtBox, TrimBox, BleedBox, CropBox and MediaBox. Therefor you need to "normalize" the *Box representation inside the PDF source code if you want to edit it.
First Step: "Normalize" the PDF source code
Here is how you do that:
Download qpdf for your OS platform.
Run this command on your input PDF:
qpdf --qdf input.pdf output.pdf
The output.pdf now will have a kind of normalized structure (similar to the last example given above), and it will be easier to edit, even with a stream editor like sed.
Second Step: Remove all superfluous *Box statements
Next, you need to know that the only essential *Box is MediaBox. This one MUST be present, the others are optional (in a certain prioritized way). If the others are missing, they default to the same values as MediaBox. Therefor, in order to achieve your goal, we can simply delete all code that is related to them. We'll do it with the help of sed.
That tool is normally installed on all Linux systems -- on Windows download and install it from gnuwin32.sf.net. (Don't forget to install the named "dependencies" should you decide to use the .zip file instead of the Setup .exe).
Now run this command:
sed.exe -i.bak -e "/CropBox/,/]/s#.# #g" output.pdf
Here is what this command is supposed to do:
-i.bak tells sed to edit the original file inline, but to also create a backup file with a.bak suffix (in case something goes wrong).
/CropBox/ states the first address line to be processed by sed.
/]/ states the last address line to be processed by sed.
s tells sed to do substitutions for all lines from first to last addressed line.
#.# #g tells sed which kind of substitution to do: replace each arbitrary character ('.') in the address space by blanks (''), globally ('g').
We substitute all characters by blanks (instead of by 'nothing', i.e. deleting them) because otherwise we'd get complaints about "PDF file corruption", since the object reference counting and the stream lengths would have changed.
Third step: run your Ghostscript command
You know that already well enough:
gswin32c.exe -sDEVICE=png16m -o outputImage_%03d.png output.pdf
All the three steps from above can easily be scripted, which I'll leave to you for your own pleasure.
First, let's get rid of a misunderstanding. You wrote:
"This is working fine when the pdf crop box is the same as the media box, but if the crop box is smaller than the media box, only the media box is displayed and the border of the pdf page is lost."
That's not correct. If the CropBox is smaller than the MediaBox, then only the CropBox should be displayed (not the MediaBox). And that is exactly how it was designed to work. This is the whole idea behind the CropBox concept...
At the moment I cannot think of a solution that works automatically for each PDF and all possibly values that can be there (unless you want to use payware).
To manually process the PDF you linked to:
Open the PDF in a good text editor (one that doesn't mess with existing EOL conventions, and doesn't complain about binary parts in the file).
Search for all spots in the file that contain the /CropBox keyword.
Since you have only one page in the PDF, it should find only one spot.
This could read like /CropBox [12.3456 78.9012 345.67 890.123456].
Now edit this part, carefully avoiding to add to (or lose from) the number of already existing characters:
Set the value to your wanted one: /CropBox [0.00000 0.00000 667.00 908.000000]. (You can use spaces instead of my .0000.. parts, but if I do, the SO editor will eat them and you'll not see what I originally typed...)
Save the file under a new name.
A PDF viewer should now show the full MediaBox (as of your specification).
When you convert the new file with Ghostscript to PNG, the bigger page will be visible.

Why does pdftk chop the tops off the characters in my form fields?

When I populate acrobat form fields by importing an FDF file into NitroPDF, things look fine. When I type data into the form fields manually in Acrobat 8, things look fine. When I use pdftk (on Windows XP or 2K), the tops of the characters in each form field are chopped off. Is there a parameter I'm missing somewhere? There aren't that many settings in pdftk...
Here's what I'm running:
pdftk form.pdf fill_form data.fdf output out.pdf flatten
Digging deeper, it appears supplied text:
<</T (A) /V (123)>>
Gets reworked to:
<</T (A) /V ([fe][ff][nul]1[nul]2[nul]3)>>
(I determined this by loading an "un-flattened" out.pdf into NitroPDF and exporting the FDF).
I ended up proofing the document using pdftk and Acrobat Reader as I worked, instead of the import in NitroPDF. It appears that the baseline for the characters is different. To get the results I was after, I had to make each field about twice the height required by NitroPDF and overlap fields.
I would still recommend NitroPDF for the rest of its capabilities.
I believe PDFTK only supports earlier versions of the PDF standard (up to 1.4 I think) so maybe it's just starting to show its age?

Programmatically add comments to PDF header

Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?