PDF Entropy calculation - pdf

Last time mkl helped me a lot, hopefully he (or someone else) can help me with these questions too. Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
Are these bytes used for padding? I have tried several files, and they all have padding characters. This is quite remarkable, as I would expect that this substantial amount of low entropy bytes should significantly lower the average entropy of the PDF file. However, this does not seem to be the case, as the average entropy of a PDF file is almost eight bits)
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?) (magenta = high entropy/random, how darker the color, how lower the entropy, In generated this image with http://binvis.io/#/)
These are the entropy values ​​of a .doc file (**not **docx), that I converted to a PDF with version 1.4, as this version should not contain object streams etc. However, the entropy values ​​of this file are still quite high. I would think that the entropy of a PDF with version <1.5 would have a lower entropy value on average, as it does not use object streams, but the results are similar to a PDF with version 1.5
I hope somebody can help me with these questions. Thank you.
Added part:
The trailer dictionary has a variable length, and with PDF 1.5 (or higher) it can be part of the central directory stream, not only the length but also the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested). The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer). However, when I encrypt a file with Adobe Acrobat Pro, the structure of the PDF file is preserved. This makes sense, since the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
PDF Header encrypted with EasyCrypt:
PDF Header encrypted with Adobe Acrobat Pro:

Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
Are these bytes used for padding?
Those bytes are part of a metadata stream. The format of the metadata is XMP. According to the XMP spec:
Padding It is recommended that applications allocate 2 KB to 4 KB of padding to the packet. This allows the XMP to be edited in place, and expanded if necessary, without overwriting existing application data. The padding must be XML-compatible whitespace; the recommended practice is to use the space character (U+0020) in the appropriate encoding, with a newline about every 100 characters.
So yes, these bytes are used for padding.
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?)
Indeed, there is. The pdf document-wide metadata streams are intended to be readable by applications, too, that don't know the PDF format but do know the XMP format. Thus, these streams should not be compressed or encrypted.
...
I don't see a question in that item.
Added part
the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested)
Well, as the stream in question contains cross reference information for the objects in the PDF, it usually is only finished pretty late in the process of creating the PDF an, therefore, added pretty late to the PDF file. Thus, an end-ish position of it usually is to be expected.
The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
As already discussed, assuming a fixed position or length of the trailer in general is wrong.
If you wonder why they assumed such a fixed size nonetheless, you should ask them.
If I were to guess why they did, I'd assume that their set of 200 PDFs simply was not generic. In the paper they don't mention how they selected those PDFs, so maybe they used a batch they had at their hands without checking how special or how generic it was. If those files were generated by the same PDF creator, chances indeed are that the trailers have a constant (or near constant) length.
If this assumption is correct, i.e. if they worked with a not-generic set of test files only, then their results, in particular their entropy values and confidence intervals and the concluded quality of the approach, are questionable.
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer).
In the paper they show a hex dump of their file encrypted by EasyCrypt:
Here there is some metadata (albeit not PDF specific) that should show less entropy.
As your EasyCrypt encryption results differ, there appear to be different modes of using EasyCrypt, some of which add this header and some don't. Or maybe EasyCrypt used to add such headers but doesn't anymore.
Either way, this again indicates that the research behind the paper is not generic enough, taking just the output of one encryption tool in one mode (or in one version) as representative example for data encrypted by non-ransomware.
Thus, the results of the article are of very questionable quality.
the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
If I haven't missed anything, they merely mention that A constant regularity exists in the header portion of the normally encrypted files, they don't say that this constant regularity does conform to this standardised format.

Related

does PDF support data degradation protection?

So we can add signatures to PDF files, which sign the content hash of the document.
however, if one bit flips due to bitrot, the file will be corrupt and the signature worthless.
Does PDF have some built in data integrity protection that would allow it to repair bitrot to a certain degree?
I'm aware that this can be achieved on a filesystem level, but I wonder if the PDF format itself also has facilities for this, and if so, how they can be enabled and whether they are included in PDF/A?
Does PDF have some built in data integrity protection that would allow it to repair bitrot to a certain degree?
No. Quite the contrary, data streams in PDFs may be (and often are) compressed using FLATE. In uncompressed content streams a bit flip usually only damages a single instruction or two, often having only an effect on small parts of the page rendering. But in a compressed content stream it usually damages all instructions starting at the flip. If this happens early in the stream, the whole page cannot be rendered anymore.

For linearized PDF how to determine the length of the cross-reference stream in advance?

When generating a linearized PDF, a cross-reference table should be stored in the very beginning of the file. If it is a cross-reference stream, this means the content of the table will be compressed and the actual size of the cross-reference stream after compression is unpredictable.
So my question is:
How to determine the actual size of this cross-reference stream in advance?
If the actual size of the stream is unpredictable, after the offsets of objects are written into the stream and the stream is written into the file, it will change the actual offsets of the following objects again, won't it? Do I miss something here?
Any hints are appreciated.
How to determine the actual size of this cross-reference stream in advance?
First of all you don't. At least not exactly. You described why.
But it suffices to have an estimate. Just add some bytes to the estimate and later-on pad with whitespaces. #VadimR pointed out that such padding can regularly be observed in linearized PDFs.
You can either use a rough estimate as in the QPDF source #VadimR referenced or you can try for a better one.
You could, e.g. make use of predictors:
At the time you eventually have to create the cross reference streams, all PDF objects can already be serialized in the order you need with the exception of the cross reference streams and the linearization dictionary (which contains the final size of the PDF and some object offsets). Thus, you already know the differences between consecutive xref entry values for most of the entries.
If you use up predictors, you essentially only store those differences. So, you already know most of the data to compress. Changes in a few entries won't change the compressed result too much. So this probably gives you a better estimate.
Furthermore, as the first cross reference stream does not contain too many entries in general, you can try compressing that stream multiple times for different numbers of reserved bytes.
PS: I have no idea what Adobe do use in their linearization code. And I don't know whether it makes sense to fight for a few bytes more or less here; after all linearization is most sensible for big documents for which a few bytes more or less hardly count.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

Good library for Digital watermarking

Can somebody help me, to find a library, or a detailed description of algorithm, that could embed a Digital watermark(invisible watermark, just a kind of steganography) to a jpeg/png file. But the quality of algorithm, should be great. It should be possible to extract this mark after rotation and expansion(if possible) of image.
Mark is just a key 32bytes.
I found a good site, but the algorithm are made for the NetPBM format, that is dead...
I know that there is a LSB method, but it is not stable to the expansion. Are there something better?
Changing metadata, is not suitable, because it is visible changes.
This maybe won't really be an answer, as I don't think it would be easy to give a magical, precise answer on this question.Watermarking is complex, and the best way to do it is by yourself : this will make things more hard for an attacker trying to reverse engineer your code. One could even read your question here, guess what library you used, and attack your system more easily.
Making Steganography resist to expansion in JPEG images is also very hard, because the JPEG compression is reapplied after the expansion. There are in fact a bunch of JPEG steganography algorithms. Which one you should use, depends on what exactly do you require :
Data confidentiality ?
Message presence confidentiality ?
Message coherence after JPEG changes ?
Resistance to "Known Cover" attacks (when attackers try to find the message, based on the steganographic system) ?
Resistance to "Known Message" attacks (when attackers try to find the steganographic system used, based on the message) ?
From what I know, usually, algorithm that resist to JPEG changes (picture recompression) are often really easier to attack, whereas algorithms that run the "encode" stage during the JPEG compression (after the DCT (lossy) transform, and before the Huffmann (non-lossy) transform) are more prone to resist.
Also, one key factor about steganography is scale : if you have only 32bytes of data to encode in a, say, 256*256px image, don't use an algo that can encode 512bytes of data in the same size. Either use a scalable algorithm, either use an algorithm at its efficient scale.
Also, the best way to do good steganography is to know its limitations,and to know how steganalyzers work. Try these tools, so you can understand what attackers will do to your picture.^
Now, I cannot tell you what steganographic system will be the best for you, but I can give you some indications :
jSteg - Quite old, I don't think it will resist to JPEG changes
OutGuess - Quite old too, but one of the best algorithms
F5 (and F3/F4) - More recent, good algorithm, scientifical research behind.
stegHide
I think all of these are LSB based : the encoding is done during the JPEG compression, after the DCT and Quantization. The only non LSB-based steganography system I heard of was mentionned in this research paper, however, I did not read it to the end yet, so I cannot tell if this will meet your needs.
However, I'm not sure there exists a real steganography algorithm resisting to JPEG compression, to JPEG resize and rotation, resisting to visual and statisticals attacks. Or I'm not aware of it.
Sorry for the lack of precise answer, I tried to give you what I know on the subject, as it's always better to be more informed. Sorry also for the lack of proper English, I'm French, nobody's perfect :)
Pistache is right in what he told you regarding the watermarking implementation algorithms. I will try to help you by showing one algorithm for the given requirements.
Before explaining you the algorithms first I guess that the distinction between the JPG and PNG formats should be done.
JPEG is a lossy format, i.e, the images are susceptible to compression that could remove the watermark. When you open an image for manipulation purposes and you save it, upon the writing procedure, a compression is made by using DCT filtering that removes some important components of the image.
On the other hand, PNG format is lossless, and that means that images are not susceptible to such kind of compression when stored after manipulation.
As a matter of fact, JPEG is used as a watermarking scheme attack due to its compressing characteristic that could remove the watermark if an attacker performed the compression.
Now that you know the difference between both formats, I can tell you a suitable algorithm resistant to the attacks that you mentioned.
Regarding methods to embed a watermark message for PNG files you can use the histogram embedding method. The histogram embedding method changes values on the histogram by changing the values of the neighbor bins. For example imagine that you have a PNG image in grayscale.
Therefore, you'll have only one channel for embedding and that means that you have one histogram with 256 bins. By selecting the neighbor bins x and x+1, you change the values of x and x+1 by moving the pixels with the bright x to x+1 or the other way around, so that (x/(x+1))>T for embedding a '1' or ((x+1)/x)>T for embedding a '0'.
You can repeat the same procedure for the whole histogram length and therefore you can embed in the best case up to 128bits. However this payload is less than what you asked. Therefore I suggest you to split the image into parts, for example blocks, and if you split one image into 4 components you'd be able to embed in the best case up to
512 bits which means 64 bytes.
This method although is very, susceptible to filtering and compression if applied straight in the space domain. Therefore, I suggest you to compute before the DWT of the image and use its low-frequency sub-band. This will provide you better transparency and robustness increased for the warping, resizing etc attacks and compression or filtering as well.
There are other approaches such as LPM (Log Polar Maps) but they are very complex to implement and I think for your case this approach would be fine.
I can suggest you two papers, the first is:
Watermarking digital image and video data. A state-of-the-art overview
This paper will give you some basic notions of watermarking and explain more in detail the LSB algorithm. And the second paper is:
Real-Time Compressed- Domain Video Watermarking Resistance to Geometric Distortions
This paper will explain the algorithm that I just explained now.
Cheers,
I do not know if you are considering approaches different to steganography. Instead of storing data hidden in the pixel data you could create a new data block in the JPEG file and store encripted data.
Take a look at the JPEG file structure on Wikipedia
You can create an application specific data block, using the marker 0xFF 0xEn. Doing so, any change in the image pixels do not change the information stored in the image. Moreover, many image editing software respect custom data blocks and will keep them even after image manipulation.

Encrypt / Decrypt uidata with "homemade" algorithm

Just working on a algorithm and so far i can encrypt and decrypt a number, which works fine. My question now is how do i go abouts encrypting an image? How does the UIdata look and shold i convert the image to that before I start? Never done anything on this level in terms of encryption and any input would be great! Thanks!
You'll probably want to encrypt in small chunks - perhaps a byte or word/int (4 bytes), maybe even a long (8 bytes) at a time depending on how your algorithm is implemented.
I don't know the signature of your algorithm (i.e. what types of input it takes and what types output it gives), but the most common ciphers are block ciphers, i.e. algorithms which have a input of some block size (nowadays 128 bits = 16 bytes is a common size), and a same-sized output, additionally to a key input (which should also have at least 128 bits).
To encrypt longer pieces of data (and actually, also for short pieces if you send multiple such pieces with the same key), you use a mode of operation (and probably additionally a padding scheme). This gives you an algorithm (or a pair of such) with an arbitrary length plaintext input, and slightly bigger ciphertext output (which the decryption algorithm undoes then).
Some hints:
Don't use ECB mode (i.e. simply encrypting each block independently of the others).
Probably you also should apply a MAC, to protect your data against malicious modifications (and also breaking of the encryption scheme by choosen-ciphertext attacks). Some modes of operation already include a MAC.