So we can add signatures to PDF files, which sign the content hash of the document.
however, if one bit flips due to bitrot, the file will be corrupt and the signature worthless.
Does PDF have some built in data integrity protection that would allow it to repair bitrot to a certain degree?
I'm aware that this can be achieved on a filesystem level, but I wonder if the PDF format itself also has facilities for this, and if so, how they can be enabled and whether they are included in PDF/A?
Does PDF have some built in data integrity protection that would allow it to repair bitrot to a certain degree?
No. Quite the contrary, data streams in PDFs may be (and often are) compressed using FLATE. In uncompressed content streams a bit flip usually only damages a single instruction or two, often having only an effect on small parts of the page rendering. But in a compressed content stream it usually damages all instructions starting at the flip. If this happens early in the stream, the whole page cannot be rendered anymore.
Related
Last time mkl helped me a lot, hopefully he (or someone else) can help me with these questions too. Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
Are these bytes used for padding? I have tried several files, and they all have padding characters. This is quite remarkable, as I would expect that this substantial amount of low entropy bytes should significantly lower the average entropy of the PDF file. However, this does not seem to be the case, as the average entropy of a PDF file is almost eight bits)
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?) (magenta = high entropy/random, how darker the color, how lower the entropy, In generated this image with http://binvis.io/#/)
These are the entropy values of a .doc file (**not **docx), that I converted to a PDF with version 1.4, as this version should not contain object streams etc. However, the entropy values of this file are still quite high. I would think that the entropy of a PDF with version <1.5 would have a lower entropy value on average, as it does not use object streams, but the results are similar to a PDF with version 1.5
I hope somebody can help me with these questions. Thank you.
Added part:
The trailer dictionary has a variable length, and with PDF 1.5 (or higher) it can be part of the central directory stream, not only the length but also the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested). The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer). However, when I encrypt a file with Adobe Acrobat Pro, the structure of the PDF file is preserved. This makes sense, since the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
PDF Header encrypted with EasyCrypt:
PDF Header encrypted with Adobe Acrobat Pro:
Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
Are these bytes used for padding?
Those bytes are part of a metadata stream. The format of the metadata is XMP. According to the XMP spec:
Padding It is recommended that applications allocate 2 KB to 4 KB of padding to the packet. This allows the XMP to be edited in place, and expanded if necessary, without overwriting existing application data. The padding must be XML-compatible whitespace; the recommended practice is to use the space character (U+0020) in the appropriate encoding, with a newline about every 100 characters.
So yes, these bytes are used for padding.
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?)
Indeed, there is. The pdf document-wide metadata streams are intended to be readable by applications, too, that don't know the PDF format but do know the XMP format. Thus, these streams should not be compressed or encrypted.
...
I don't see a question in that item.
Added part
the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested)
Well, as the stream in question contains cross reference information for the objects in the PDF, it usually is only finished pretty late in the process of creating the PDF an, therefore, added pretty late to the PDF file. Thus, an end-ish position of it usually is to be expected.
The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
As already discussed, assuming a fixed position or length of the trailer in general is wrong.
If you wonder why they assumed such a fixed size nonetheless, you should ask them.
If I were to guess why they did, I'd assume that their set of 200 PDFs simply was not generic. In the paper they don't mention how they selected those PDFs, so maybe they used a batch they had at their hands without checking how special or how generic it was. If those files were generated by the same PDF creator, chances indeed are that the trailers have a constant (or near constant) length.
If this assumption is correct, i.e. if they worked with a not-generic set of test files only, then their results, in particular their entropy values and confidence intervals and the concluded quality of the approach, are questionable.
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer).
In the paper they show a hex dump of their file encrypted by EasyCrypt:
Here there is some metadata (albeit not PDF specific) that should show less entropy.
As your EasyCrypt encryption results differ, there appear to be different modes of using EasyCrypt, some of which add this header and some don't. Or maybe EasyCrypt used to add such headers but doesn't anymore.
Either way, this again indicates that the research behind the paper is not generic enough, taking just the output of one encryption tool in one mode (or in one version) as representative example for data encrypted by non-ransomware.
Thus, the results of the article are of very questionable quality.
the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
If I haven't missed anything, they merely mention that A constant regularity exists in the header portion of the normally encrypted files, they don't say that this constant regularity does conform to this standardised format.
We are trying to figure out the best way to create a web service that delivers high quality text books to remote tablets and desktop clients. The books are copyrighted and sold to users so the delivery must be protected as much as possible against copy. The books' layout is very complicated, with lots of images, pictures, textures, tables, diagrams and the like. They are produced by InDesign in PDF format.
So far, our best guess is to store the PDF in single pages (a PDF per page) and scramble them with asymmetric keys, so all the decryption can be processed in memory with no temporary file generated.
Our concern is that PDF is a proprietary format and sometimes the file is too big (quality is an important concern for the client).
Is there any Open Source alternative to PDF, capable of delivering high quality, complicated layouts in smaller files?
Your only way around this if it is to be viewable offline is to encrypt the document and issue licence keys for it to be viewable.
There are commercial packages that will allow you to do this enabling you to limit the licence to machine, user or time period.
Ultimately you can't stop people coming up with ingenious ways of copying it, just make it more difficult.
You can use raster image with high quality as PDF alternative.
Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.
So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.
I have an input PDF file (usually, but not always generated by pdfTeX), which I want to convert to an output PDF, which is visually equivalent (no matter the resolution), it has the same metadata (Unicode text info, hyperlinks, outlines etc.), but the file size is as small as possible.
I know about the following methods:
java -cp Multivalent.jar tool.pdf.Compress input.pdf (from http://multivalent.sourceforge.net/). This recompresses all streams, removes unused objects, unifies equivalent objects, compresses whitespace, removes default values, compresses the cross-reference table.
Recompressing suitable images with jbig2 and PNGOUT.
Re-encoding Type1 fonts as CFF fonts.
Unifying equivalent images.
Unifying subsets of the same font to a bigger subset.
Remove fillable forms.
When distilling or otherwise converting (e.g. gs -sDEVICE=pdfwrite), make sure it doesn't degrade image quality, and doesn't increase (!) the image sizes.
I know about the following techniques, but they don't apply in my case, since I already have a PDF:
Use smaller and/or less fonts.
Use vector images instead bitmap images.
Do you have any other ideas how to optimize PDF?
Optimize PDF Files
Avoid Refried Graphics
For graphics that must be inserted as bitmaps, prepare them for maximum compressibility and minimum dimensions. Use the best quality images that you can at the output resolution of the PDF. Inserting compressed JPEGs into PDFs and Distilling them may recompress JPEGs, which can create noticeable artifacts. Use black and white images and text instead of color images to allow the use of the newer JBIG2 standard that excels in monochromatic compression. Be sure to turn off thumbnails when saving PDFs for the Web.
Use Vector Graphics
Use vector-based graphics wherever possible for images that would normally be made into GIFs. Vector images scale perfectly, look marvelous, and their mathematical formulas usually take up less space than bitmapped graphics that describe every pixel (although there are some cases where bitmap graphics are actually smaller than vector graphics). You can also compress vector image data using ZIP compression, which is built into the PDF format. Acrobat Reader version 5 and 6 also support the SVG standard.
Minimize Fonts
How you use fonts, especially in smaller PDFs, can have a significant impact on file size. Minimize the number of fonts you use in your documents to minimize their impact on file size. Each additional fully embedded font can easily take 40K in file size, which is why most authors create "subsetted" fonts that only include the glyphs actually used.
Flatten Fat Forms
Acrobat forms can take up a lot of space in your PDFs. New in Acrobat 8 Pro you can flatten form fields in the Advanced -> PDF Optimizer -> Discard Objects dialog. Flattening forms makes form fields unusable and form data is merged with the page. You can also use PDF Enhancer from Apago to reduce forms by 50% by removing information present in the file but never actually used. You can also combine a refried PDF with the old form pages to create a hybrid PDF in Acrobat (see "Refried PDF" section below).
see article
From PDF specification version 1.5 there are two new methods of compression, object streams and cross reference streams.
You mention that the Multivalent.jar compress tool compresses the cross reference table. This usually means the cross reference table is converted into a stream and then compressed.
The format of this cross reference stream is not fixed. You can change the bit size of the three "columns" of data. It's also possible to pre-process the stream data using a predictor function which will improve the compression level of the data. If you look inside the PDF with a text editor you might be able to find the /Predictor entry in the cross reference stream dictionary to check whether the tool you're using is taking advantage of this feature.
Using a predictor on the compression might be handy for images too.
The second type of compression offered is the use of object streams.
Often in a PDF you have many similar objects. These can now be combined into a single object and then compressed. The documentation for the Multivalent Compress tool mentions that object streams are used but doesn't have many details on the actual choice of which objects to group together. The compression will be better if you group similar objects together into an object stream.