Preserve directory structure when unpacking attachments from PDF with pdftk? - pdf

I am trying to pack and unpack attachments including a subdirectory hierarchy to a PDF with pdftk ... attach_files and pdftk ... unpack_files. However, while attach_files is capable of representing the subdirectory information by including the / separator in file names, unpack_files puts all files into one flat directory, silently overwriting files if the same name occurs multiple times. Is it possible to get preservation of the hierarchy when unpacking?
As workarounds I have used:
Packing the attachments into a zip file and attaching the zip file. However, this way the attachment hierarchy is no longer easily accessible.
Applying a bijective transformation on the path names, that maps the hierarchy to a flat structure and back. However, this way unpacking is possible only with a script doing the transformation.
Being directly able to preserve the hierarchy information already stored in the PDF would be preferable.

Unfortunately not with the current version of pdftk, it is hardcoded to drop path information both when attaching and unpacking files. In fact, I would be surprised if any hierarchy information got stored in the PDF using pdftk.
That being said, it would not be too hard to write a patch to change this behaviour, I suggest opening an issue with a feature request.

Related

Merge PDF files with reversible process (extract original files)

There are countless questions and answers that present different solutions to merge 2 or more PDF files and how to extract specific pages and create a PDF with this subset.
Unfortunately I could not find a way (either using a library or command line tools, since it will be scripted) to merge files, such that the resulting file is a valid PDF and later "split back" this file in separate files, using the same page ranges, to obtain the exact same original files (at the binary level).
Is this possible?
Once you merged the PDF files you cannot split the result and obtain the exact same original files at binary level. Source PDF files are not included as opaque binaries blocks in the merged file.
One possible solution solution, as #mkl said, is to use a PDF portfolio to embed the source files as they are. When viewing the portfolio you will see each file as it is, not as a long merged PDF file.

Prevent unprocessed DOT files from being deleted when using Doxywizard

I'm using Doxygen wizard to generate documentation for a big embedded C project. I'm able to generate graphs and class diagrams using Dot and Graphviz. However, I would like to edit some dependency graphs by hand - there's information that is not needed all the time e.g. the graph depth is too much.
I noticed when running Doxywizard that before the diagram files are generated and saved as PNG files, "raw files" for lack of a better word are created that hold the code to generate the graphs using Graphviz. These are DOT files that can be opened in the text editor. These files are deleted once the diagram images are generated.
I was able to access them by stopping the Doxywizard mid-process before they got deleted. Is there any way to prevent these DOT files from being deleted?
Doxygen has the possibility to retain the dot files. In the doxygen settings file (Doxyfile) there is the setting DOT_CLEANUP setting this to DOT_CLEANUP=NO will retain the files.
From the documentation:
DOT_CLEANUP
If the DOT_CLEANUP tag is set to YES, doxygen will remove the intermediate files that are used
to generate the various graphs.
Note: This setting is not only used for dot files but also for msc temporary files.
The default value is: YES
See also: https://www.doxygen.nl/manual/config.html and more specifically: https://www.doxygen.nl/manual/config.html#cfg_dot_cleanup
And in the doxywizard:
i.e.
expert tab
topics "pane" the last item dot
in the list of possibilities the last item DOT_CLEANUP

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

Extract embedded PDF file without a full parse

I want to build a utility to extract embedded files from a PDF (see section 7.11.4 of the spec). However I want the utility to be "small" and not depend on a full PDF parsing framework. I'm wondering if the file format is such that a simple tool could scan through the document for some token or sequence, and from that know where to start extracting the embedded file(s).
Potential difficulties include the possibility that the token or sequence that you scan for could validly exist elsewhere in the document leading to spurious or corrupt document extraction.
I'm not that familiar with the PDF spec, and so I'm looking for
confirmation that this is possible
a general approach that would work
There are at least two scenarios that are going to make your life difficult: encrypted files, and object streams (a compressed object that contains a collection of objects inside).
About the second item (object streams), some PDF generation tools will take most of the objects (dictionaries) inside a PDF file, put them inside a single object, and compress this single object (usually with deflate compression). These means that you cannot just skim through a PDF file looking for some particular token in order to extract some piece of information that you need while ignoring the rest. You will need to actually interpret the structure of PDF files at least partially.
Note that the embedded files that you want to extract are very likely to be compressed as well, even if an objects stream is not used.
Your program will need to be able to do at least the following:
- Processing xref tables
- Processing object streams
- Applying decoding/decompression filters to a data stream.
Once you are able to get all objects from the file, you could in theory go through all of them looking for dictionaries of type EmbeddedFile. This approach has the disadvantage that you might extract files that are not been referenced from anywhere inside the document (because a user deleted it at some point of the file's history for example)
Another approach could be to actually navigate through the structure of the file looking for embedded files on the locations specified by the PDF spec. You can find embedded files in at least the following elements (this list is from the top of my head, there might be a lot more that these):
- Names dictionary
- Document outlines
- Page annotations

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).