For linearized PDF how to determine the length of the cross-reference stream in advance? - pdf

When generating a linearized PDF, a cross-reference table should be stored in the very beginning of the file. If it is a cross-reference stream, this means the content of the table will be compressed and the actual size of the cross-reference stream after compression is unpredictable.
So my question is:
How to determine the actual size of this cross-reference stream in advance?
If the actual size of the stream is unpredictable, after the offsets of objects are written into the stream and the stream is written into the file, it will change the actual offsets of the following objects again, won't it? Do I miss something here?
Any hints are appreciated.

How to determine the actual size of this cross-reference stream in advance?
First of all you don't. At least not exactly. You described why.
But it suffices to have an estimate. Just add some bytes to the estimate and later-on pad with whitespaces. #VadimR pointed out that such padding can regularly be observed in linearized PDFs.
You can either use a rough estimate as in the QPDF source #VadimR referenced or you can try for a better one.
You could, e.g. make use of predictors:
At the time you eventually have to create the cross reference streams, all PDF objects can already be serialized in the order you need with the exception of the cross reference streams and the linearization dictionary (which contains the final size of the PDF and some object offsets). Thus, you already know the differences between consecutive xref entry values for most of the entries.
If you use up predictors, you essentially only store those differences. So, you already know most of the data to compress. Changes in a few entries won't change the compressed result too much. So this probably gives you a better estimate.
Furthermore, as the first cross reference stream does not contain too many entries in general, you can try compressing that stream multiple times for different numbers of reserved bytes.
PS: I have no idea what Adobe do use in their linearization code. And I don't know whether it makes sense to fight for a few bytes more or less here; after all linearization is most sensible for big documents for which a few bytes more or less hardly count.

Related

PDF Entropy calculation

Last time mkl helped me a lot, hopefully he (or someone else) can help me with these questions too. Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
Are these bytes used for padding? I have tried several files, and they all have padding characters. This is quite remarkable, as I would expect that this substantial amount of low entropy bytes should significantly lower the average entropy of the PDF file. However, this does not seem to be the case, as the average entropy of a PDF file is almost eight bits)
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?) (magenta = high entropy/random, how darker the color, how lower the entropy, In generated this image with http://binvis.io/#/)
These are the entropy values ​​of a .doc file (**not **docx), that I converted to a PDF with version 1.4, as this version should not contain object streams etc. However, the entropy values ​​of this file are still quite high. I would think that the entropy of a PDF with version <1.5 would have a lower entropy value on average, as it does not use object streams, but the results are similar to a PDF with version 1.5
I hope somebody can help me with these questions. Thank you.
Added part:
The trailer dictionary has a variable length, and with PDF 1.5 (or higher) it can be part of the central directory stream, not only the length but also the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested). The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer). However, when I encrypt a file with Adobe Acrobat Pro, the structure of the PDF file is preserved. This makes sense, since the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
PDF Header encrypted with EasyCrypt:
PDF Header encrypted with Adobe Acrobat Pro:
Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
Are these bytes used for padding?
Those bytes are part of a metadata stream. The format of the metadata is XMP. According to the XMP spec:
Padding It is recommended that applications allocate 2 KB to 4 KB of padding to the packet. This allows the XMP to be edited in place, and expanded if necessary, without overwriting existing application data. The padding must be XML-compatible whitespace; the recommended practice is to use the space character (U+0020) in the appropriate encoding, with a newline about every 100 characters.
So yes, these bytes are used for padding.
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?)
Indeed, there is. The pdf document-wide metadata streams are intended to be readable by applications, too, that don't know the PDF format but do know the XMP format. Thus, these streams should not be compressed or encrypted.
...
I don't see a question in that item.
Added part
the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested)
Well, as the stream in question contains cross reference information for the objects in the PDF, it usually is only finished pretty late in the process of creating the PDF an, therefore, added pretty late to the PDF file. Thus, an end-ish position of it usually is to be expected.
The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
As already discussed, assuming a fixed position or length of the trailer in general is wrong.
If you wonder why they assumed such a fixed size nonetheless, you should ask them.
If I were to guess why they did, I'd assume that their set of 200 PDFs simply was not generic. In the paper they don't mention how they selected those PDFs, so maybe they used a batch they had at their hands without checking how special or how generic it was. If those files were generated by the same PDF creator, chances indeed are that the trailers have a constant (or near constant) length.
If this assumption is correct, i.e. if they worked with a not-generic set of test files only, then their results, in particular their entropy values and confidence intervals and the concluded quality of the approach, are questionable.
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer).
In the paper they show a hex dump of their file encrypted by EasyCrypt:
Here there is some metadata (albeit not PDF specific) that should show less entropy.
As your EasyCrypt encryption results differ, there appear to be different modes of using EasyCrypt, some of which add this header and some don't. Or maybe EasyCrypt used to add such headers but doesn't anymore.
Either way, this again indicates that the research behind the paper is not generic enough, taking just the output of one encryption tool in one mode (or in one version) as representative example for data encrypted by non-ransomware.
Thus, the results of the article are of very questionable quality.
the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
If I haven't missed anything, they merely mention that A constant regularity exists in the header portion of the normally encrypted files, they don't say that this constant regularity does conform to this standardised format.

Reading specific bytes of data from a large text file... quickly

For argument's sake, let's say you have a single, enormous file to hold your map save data. The game that comes to mind as a great example is Terraria. They save all MapWidth*MapHeight tile data within a single map file (Horrible idea, really) but they can render only what is visible within the camera (And some outer-lying tiles for smoothness sake) based on the camera position.
So my question is, "How can they search through all of that data in real time starting at the camera position?"
That would entail reading through potentially millions of tile data just to get to the screen coordinates. I understand you could skip bytes of data based on the x/y coordinates if the tile data was consistent (This is all I can find in my week or so of searching), but that is where my problem lies. The tile data is dynamic. If one tile is empty, the data beyond "isValid" is nonexistent. So that is less bytes to search through. If a tile has water, multiple states, a background, etc... it contains all the data and is the largest in terms of bytes. So it is not constant at all. In that case we cannot just skip X amount of bytes as it changes (Constantly as tiles are modified).
My current solutions are: Read it line by line (Ugh), use chunk files, or ensure fixed line sizes (Padding? Data wasted... Ugh).
I know chunks would be the best option, but being able to reach that deep into text files quickly would still be a nice thing to know.
If you have chunk-based data, you need a chunk-based reader, simple as that.
Additionally, if you're particularly interested only in certain parts of the data and you can process it first, is to build a second file/list that stores the offsets to the start of every object in the first file.
In that case, whenever you need to reference an object, you look up the offset first and then do a straight jump to it in your original file. It still requires you to read through the whole file at-least once.

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).

Will a BLOB Column always consume the defined size even if less data is inserted?

I need to store images in a DB2 Blob field. The average image size is about 200KB but in rare cases there will be images with 2-4MB. I don't want to reject these images so I guess I'd define a BLOB(5M). Is this okay to do or will this Blob always consume the 5MB even if most of it is unused?
What is the common way to deal with the Blob size if it is hard to find an average?
The blob will only use as much space as necessary. There is no overhead in defining a large maximum (think of it as a "constraint" rather than a physical thing)