Merge PDF files with reversible process (extract original files) - pdf

There are countless questions and answers that present different solutions to merge 2 or more PDF files and how to extract specific pages and create a PDF with this subset.
Unfortunately I could not find a way (either using a library or command line tools, since it will be scripted) to merge files, such that the resulting file is a valid PDF and later "split back" this file in separate files, using the same page ranges, to obtain the exact same original files (at the binary level).
Is this possible?

Once you merged the PDF files you cannot split the result and obtain the exact same original files at binary level. Source PDF files are not included as opaque binaries blocks in the merged file.
One possible solution solution, as #mkl said, is to use a PDF portfolio to embed the source files as they are. When viewing the portfolio you will see each file as it is, not as a long merged PDF file.

Related

Extract xml from ZUGFeRD PDF with Ghostscript

We would like to automate the processing of Zugferd invoices.
Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?
as mentioned by KenS Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good !? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad) and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.
Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.
The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/
For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments

Preserve directory structure when unpacking attachments from PDF with pdftk?

I am trying to pack and unpack attachments including a subdirectory hierarchy to a PDF with pdftk ... attach_files and pdftk ... unpack_files. However, while attach_files is capable of representing the subdirectory information by including the / separator in file names, unpack_files puts all files into one flat directory, silently overwriting files if the same name occurs multiple times. Is it possible to get preservation of the hierarchy when unpacking?
As workarounds I have used:
Packing the attachments into a zip file and attaching the zip file. However, this way the attachment hierarchy is no longer easily accessible.
Applying a bijective transformation on the path names, that maps the hierarchy to a flat structure and back. However, this way unpacking is possible only with a script doing the transformation.
Being directly able to preserve the hierarchy information already stored in the PDF would be preferable.
Unfortunately not with the current version of pdftk, it is hardcoded to drop path information both when attaching and unpacking files. In fact, I would be surprised if any hierarchy information got stored in the PDF using pdftk.
That being said, it would not be too hard to write a patch to change this behaviour, I suggest opening an issue with a feature request.

Can QPDF utility be used to extract attachments from a PDF file?

I have a PDF file with other PDF files attached to it. Acrobat shows them in "Attachments" tab and allows to open them in turn.
QPDF documentations says something about extracting attachments but I failed to find any particular commands that do that.
Is it possible to extract these attachments and have them stored on the disk as separate PDF files?
UPDATE: Just a notice to explain better what you can see in the UI: "Attachments" tab was present in older versions of Acrobat, as well as a special page of the container document recommending to download newer version of Acrobat (this page seems to be really existing as it is shown in other viewers as well as on preview image). Latest versions of Acrobat (Reader) skip this page and get you to the first attached document, with the list of all attachments shown on the left side of the screen.
I found an old GitHub issue which a little bit clarify the possibilities of attachment extraction.
It is possible to extract attachments from PDF files using the qpdf
library by understanding the PDF file structure and pulling the
attachments out "manually" by knowing which objects to extract. There
is nothing in the public API at the moment nor in the command-line
tool that enables you to work with attachments as a first-class thing,
but there is an item in the TODO list, and there is some private code
used internally to detect cases where attachments are encrypted
differently from the rest of the file. The main reason, aside from
lack of time, that attachments are not more directly supported is
because there have been various ways that they are stored in the file,
and I don't know whether I have examples of all of them. I'm reluctant
to add a feature for attachments that may miss some attachments in
some older PDF files.
https://github.com/qpdf/qpdf/issues/24
So, it seems it is possible but you should examine the details of the pdf file.
Starting with qpdf 10.2, you can work with file attachments in PDF files from the command line. The following options are available:
http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.attachments

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

Extract embedded PDF file without a full parse

I want to build a utility to extract embedded files from a PDF (see section 7.11.4 of the spec). However I want the utility to be "small" and not depend on a full PDF parsing framework. I'm wondering if the file format is such that a simple tool could scan through the document for some token or sequence, and from that know where to start extracting the embedded file(s).
Potential difficulties include the possibility that the token or sequence that you scan for could validly exist elsewhere in the document leading to spurious or corrupt document extraction.
I'm not that familiar with the PDF spec, and so I'm looking for
confirmation that this is possible
a general approach that would work
There are at least two scenarios that are going to make your life difficult: encrypted files, and object streams (a compressed object that contains a collection of objects inside).
About the second item (object streams), some PDF generation tools will take most of the objects (dictionaries) inside a PDF file, put them inside a single object, and compress this single object (usually with deflate compression). These means that you cannot just skim through a PDF file looking for some particular token in order to extract some piece of information that you need while ignoring the rest. You will need to actually interpret the structure of PDF files at least partially.
Note that the embedded files that you want to extract are very likely to be compressed as well, even if an objects stream is not used.
Your program will need to be able to do at least the following:
- Processing xref tables
- Processing object streams
- Applying decoding/decompression filters to a data stream.
Once you are able to get all objects from the file, you could in theory go through all of them looking for dictionaries of type EmbeddedFile. This approach has the disadvantage that you might extract files that are not been referenced from anywhere inside the document (because a user deleted it at some point of the file's history for example)
Another approach could be to actually navigate through the structure of the file looking for embedded files on the locations specified by the PDF spec. You can find embedded files in at least the following elements (this list is from the top of my head, there might be a lot more that these):
- Names dictionary
- Document outlines
- Page annotations