what is the maximum extent of compressing a pdf file? - pdf

Whenever I try to compress a PDF file to a lower possible size, by either using ghostscript or pdftk or pdfopt, I end up having a file near to half the size of original. But lately, I am getting files of size in 1000 MB range, which are compressing to say a few hundreds. Can we further reduce them?
The pdf is made from jpg images which are of higher resolutions, cant we reduce the size of those images and further bring in some more reduction in size?

As far I know, without degrading jpeg streams and loosing quality, you can try the special feature offered by
Multivalent
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355/Multivalent20060102.jar.html
java -cp path/to.../multivalent.jar tool.pdf.Compress -compact file.pdf
resulting output will be compressed in a special way. the resulting file needs Multivalent browser to be read again
it is unpredictable how much space you can save (many times you cannot save any further space)

Related

How can I optimize my pdf repository after splitting it by page?

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.
Here is the sample pdf from repository
https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf
Any help regarding pdf splitting is welcomed
Thanks!
I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:
$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.
So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:
cpdf -remove-metadata in.pdf -o small.pdf
on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.

How to convert scanned document images to a PDF document with high compression?

I need to convert scanned document images to a PDF document with high compression. Compression ratio is very important. Can someone recommend any solution on C# for this task?
Best regards, Alexander
There is a free program called PDFBeads that can do it. It requires Ruby, ImageMagick and optionally jbig2enc.
The PDF format itself will probably add next to no overhead in your case. I mean your images will account for most of the output file size.
So, you should compress your images with highest possible compression. For black-and-white images you might get smallest output using FAX4 or JBIG2 compression schemes (both supported in PDF files).
For other images (grayscale, color) either use smallest possible size, lowest resolution and quality, or convert images to black-and-white and use FAX4/JBIG2 compression scheme.
Please note, that most probably you will lose some detail of any image while converting to black-and-white.
If you are looking for a library that can help you with recompression then have a look at Docotic.Pdf library (Disclaimer: I am one of developers of the library).
The Optimize images sample code shows how to recompress images before adding them to PDF. The sample shows how to recompress with JPEG, but for FAX4 the code will be almost the same.

Batch files for saving several pdf as reduced size pdf

I have several PDF in my server machine. they are taking lot of space. We are trying to reduce the size of PDF to optimize PDF scrolling performance and reduce size consumption. We have thought of using Adobe Acrobat 11 SDK. While using the trial, I have found the way of reducing PDF size by
file>saveasother>reducedsize PDF
Is any batch file possible that can reduce the size of PDF on the given folders. Alternatively a JavaScript or IAC would be also possible.
Any help would be greatly appreciated.
Thanks.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

Efficient thumbnail generation of huge pdf file?

In a system I'm working on we're generating thumbnails as part of the workflow.
Sometimes the pdf files are quite large (print size 3m2) and can contain huge bitmap images.
Are there thumbnail generation capable programs that are optimized for memory footprint handling such large pdf files?
The resulting thumbnail can be png or jpg.
ImageMagick is what I use for all my CLI graphics, so maybe it can work for you:
convert foo.pdf foo-%png
This produces three separate PNG files:
foo-0.png
foo-1.png
foo-2.png
To create only one thumbnail, treat the PDF as if it were an array ([0] is the first page, [1] is the second, etc.):
convert foo.pdf[0] foo-thumb.png
Since you're worrying about memory, with the -cache option, you can restrict memory usage:
-cache threshold megabytes of memory available to the pixel cache.
Image pixels are stored in memory
until threshold megabytes of memory have been
consumed. Subsequent pixel operations
are cached on disk. Operations to
memory are significantly faster but
if your computer does not have a
sufficient amount of free memory you
may want to adjust this threshold
value.
So to thumbnail a PDF file and resize it,, you could run this command which should have a max memory usage of around 20mb:
convert -cache 20 foo.pdf[0] -resize 10%x10% foo-thumb.png
Or you could use -density to specify the output density (900 scales it down quite a lot):
convert -cache 20 foo.pdf[0] -density 900 foo-thumb.png
Should you care? Current affordable servers have 512 GB ram. That supports storing a full colour uncompressed bitmap of over 9000 inches (250 m) square at 1200 dpi. The performance hit you take from using disk is large.