How Can I Share Referenced Resources Between PDF Files - pdf

I create hundreds of PDF files with the same images and fonts. I there a way I can share these resources between all the files instead of having them embedded in each PDF? It sure would be a disk space saver.

No. PDFs are meant to be stand-alone files which fully encompass font information, vector graphics and whatnot in a single file. Sharing between files would break this. If you're looking to save space (and application requirements), you might consider generating the PDFs on the fly.

You can embed external links for things like files if you just want to share linked files.

Related

Is it possible to obfuscate PDF file binary data?

Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.

Digital Asset Management tool for large files that are not photos or videos

Most DAMs that I have found are geared towards media like photos and videos. I have need to manage large binary files like ISOs and IMG files.
Does anybody know of a DAM that can manage non-media files? Specifically something that is on premise? Going to a DAM in the cloud would be too expensive because of the amount of storage we would need and the bandwidth it would consume.
DAMs have specific functionality tailored towards visual content. For example, DAM systems will create previews for the files stored and also, possibly, extract metadata from the file itself. In addition to that, it will also provide you options to transform and download content in various formats. Considering that all these options are part of the DAM package, I would not expect too much from them with respect to previews, metadata extraction and transformations when it comes to large binary files, such as ISO and IMG files.
You can however, use most of the DAMs to upload any file you want. It will simply take it and allow you to tag metadata against it. An example would be Elvis DAM where you can simply upload content (I would use hot folder type of uploads for large files) and tag them with metadata. You can create custom fields such as OS version, applications, etc. and store it against the ISO files. These will become searchable and it will scale to hold all of this information and allow you to quickly find your content.
There might be other simpler and less expensive solutions out there that might just simply keep a file and assign metadata to it.
Try NeoFinder
It's original incarnation was as a catalog program for CDs, but it supports extensive metadata for tagging, as well as pulling metadata from images.
https://www.cdfinder.de
We solved our need by using Git Large File Storage (LFS) to manage our large binary files. We tried out git-annex as well, which worked well, but in the end we went with Git LFS.

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Providing an embedded webkit with resources from memory

I'm working on an application that embeds WebKit (via the Gtk bindings). I'm trying to add support for viewing CHM documents (Microsoft's bundled HTML format).
HTML files in such documents have links to images, CSS etc. of the form "/blah.gif" or "/layout.css" and I need to catch these to provide the actual data. I understand how to hook into the "resource-request-starting" signal and one option would be to unpack parts of the document to temporary files and change the uri at this point to point at these files.
What I'd like to do, however, is provide WebKit with the relevant chunk of memory. As far as I can see, you can't do this by catching resource-request-starting, but maybe there's another way to hook in?
An alternative is to base64-encode the image into a data: URI. It's not exactly better than using a temporary file, but it may be simpler to code.