I need to automatically reduce the size of some user uploaded pdfs so that they can be sent via email.
I have a little imagemagick oneliner that reduces the size for me:
convert -density 120 -quality 10 -compress jpeg original.pdf output.pdf
basically exports every page of the pdf in jpg, updates density and quality and repacks the pages in a new PDF.
this works perfectly, except that with this command sometimes the files end up bigger, and I need to rerun tweaking density and quality to get the lowest size where the text in the pdf documents is still readable.
I'm not sure how to automate it. I thought to use identify to get characteristics of the files (height width density... ) and do stuff like half the figures or sth similar. but I'm struggling to get this info about the files.
Any suggestions?
Thanks,
Related
I can use Acrobat to reduce a PDF file of 30MB to 10MB. The input PDF is just the result of combining of many monochrome tiff files like the following.
$ file x.tiff
x.tiff: TIFF image data, little-endian, direntries=14, height=2957, bps=1, compression=bi-level group 4, PhotometricIntepretation=WhiteIsZero, orientation=upper-left, width=1627
The tiff files are converted to pdf files using the following command.
convert x.tiff x.pdf
The single page PDF files are then merged to a single PDF file by the following command.
cpdf input1.pdf input2.pdf ... -o output.pdf
The OCR (Searchable Image (Exact)) is done on the pdf file. I am surprized that the file size can be reduced to only 1/3.
Note that I don't see any changes in image resolution. For example, when I zoom in, I see squares for pixels. The image in pdf still looks black-white, there are no gray pixels.
What can be done to reduce the PDF files by such a big amount?
You may want to run the PDF through pdfsizeopt. For such a PDF, pdfsizeopt will most probably recompress the images with JBIG2 compression, which makes them smaller without loss of quality or reducing the image resolution. However, it's unlikely that this will make the PDF much smaller than by a factor of 3.
pdfsizeopt --use-pngout=no output.pdf optimized_output.pdf
If you need an even smaller PDF, you may want to reduce the image resolution (number of image pixels) first (before running pdfsizeopt):
convert x.tiff -resize 50% x.pdf
If you are unsure what is using much space in a PDF, run:
pdfsizeopt --stats output.pdf
In my battle against cheating (I am a teacher), I would like to convert my LaTex PDF's to images so that students cannot cut and paste out of the files. I am currently using ImageMagick to do this:
convert -density 300 mwe.pdf mwe_convert.pdf
While this works, when zoomed insufficiently on the PDF aliasing causes unfortunate problems. My understanding is that antialiasing is on by default in convert (https://legacy.imagemagick.org/Usage/antialiasing/)
If I instead use the following command, things are better:
convert -density 800 -resample 300 mwe.pdf mwe_convert.pdf
With this PDF, as I zoom in and out it does a better job of preserving fine lines in the file. I'm guessing I'm getting some different type of effective antialiasing by doing this, but the options to convert have outsmarted me.
The problem is that using density and resample causes the paper size in mwe_convert.pdf to be 3.19 × 4.12 inch (according to identify). This means that viewing the file causes it to open small on the screen, and that you need crazy magnifications to make it readable on the screen.
So my question for the crowd is, is there (a) a way to do the density/resample and get the correct paper size at the end, or (b) a better way to achieve my goal.
I cannot include a PDF here as an MWE. I can show what I'm seeing. Here's a screen shot of the original LaTex PDF.
Here's a screen shot of the -density 300 without the resample:
Here's a screen shot of the -density 800 -resample 300. Note that the PDF was even smaller on the screen and the equals sign was still visible.
Here is a way to get back to letter-sized pages:
pdfjam --outfile converted_pdf_file.pdf --letterpaper letter_sized_pages.pdf
Said pdfjam command is found in the texlive-extra-utils package (for Debian).
first question here.
So i was using the ghostscript command to shrink my pdf which yieled good results (around 30-40% decrease in size). However, one day last week it stopped shrinking them and instead returned me a pdf of the size or even a bit heavier (around 1% or less). Therefore I don't know what's going on since the command used to work fine and i was able to shrink some pdf easily...
I will note that when using gs on my pdfs it always return an error about some glyphs missing in the GlyphLessFont but i don't think it's related to my issue (though if you could redirect me to fixing the glyphlessfont that would be much appreciated).
Here's the command I use :
`gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf`
Here's also a pdf sample that was shrinked correctly (original file size 4.7mo / shrinked version 2.9mo) https://nofile.io/f/39Skta4n25R/bulletin1_ocr.pdf
EDIT: light version that worked for the file above : https://nofile.io/f/QOKfG34d5Cg/bulletin1_light.pdf
Here's the input and output file of another pdf that didn't work
(input) https://nofile.io/f/sXsU0Mcv35A/bulletin15_ocr.pdf
(output through the gs command above) https://nofile.io/f/STdJYqqt6Fq/out.pdf
you'll notice that both input and output file are 27.6mo whereas the first file was reduced.
I would also add that i've performed OCR on these pdf using pdfocr and the tesseract engine and that's why i didn't try to convert to png to reduce the size, i need the extra OCR layer so that we can publish those file for our website and we want them to be lighter if possible.
Final info : ghostscript -v is 9.10 (2013-08-30) and tesseract is 3.03 with leptonica-1.70 and pdfocr is 0.1.4
Hope you guys can help !
EDIT2: while waiting for the answer I continued my scanning and ocring of the documents and it appears that after passing my pdf through pdfocr it was shrinked like it used to with the ghostscript. Therefore i wonder if the script pdfocr does the shrinking with ghostscript since i know it invokes it for other tasks during the process of OCRisation.
The PDF has a media size of 35.44 by 50.11 inches, is that really the size of the original ?
Given that you appear to commonly use OCR I assume that, in general, your PDF files simply consist of very large images. In that case the major impact on the file size is going to come from downsampling the images. If you look at the documentation you can see that the /screen settings downsample images to 72 dpi, with a threshold of 1.5 (so images over 72 * 1.5 = 107 dpi will be downsampled to 72, anything less is regarded as not worth it)
Your PDF file has a media size of 35.44 x 50.11 inches. Its rather a large file (26 pages) so I'll limit myself to considering page 1. On this page there is one image, and a bunch of invisible text, placed there by Tesseract. The image on page 1 is a 8-bit RGB image with dimensions 2481x3508, and it covers the entire page.
So the resolution of that image is 2481 / 35.44 by 3500 / 50.11 = 70.00 x 69.84
Since that is less than 72 dpi, pdfwrite isn't going to downsample it.
Had your media been 8.5 x 11 inches then the image would have had an effective resolution of 2481 / 8.5 by 2500 / 11 = 291.8 x 318.18 and so would have been downsampled by a factor of about 4.
However..... for me your 'working' PDF file also has a large media size, and the images are also already below the downsampling resolution. When I run that file using your command line, the output file is essentially the same size as the input file.
So I'm at a loss to see how you could ever have experienced the reduced file size. Perhaps you could post the reduced file as well.
EDIT
So, the reason that your files are smaller after passing through Ghostscript is because the vast majority of the content is the scanned pages. These are stored in the PDF file as DCT encoded images (JPEG).
The resolution of the images is low enough (see above) that they are not downsampled. However, the way that old versions of Ghostscript work is that image data is always decompressed on reading, and then recompressed when writing.
Because JPEG is a lossy image format, this means that the decompressed and recompressed image is of lower quality than the original, and the way that loss of quality is applied means that the data compresses better.
So a quirk of the way that Ghostscript works results in you losing quality, but getting smaller files. Note that for current versions of Ghostscript, the JPEG data is passed through unchanged, unless your configuration requires it to be donwsampled, or colour converted.
So why doesn't it compress the other file ? Well for current code, of course, which is what I'm using, it won't, because the image doesn't need downsampling or anything.
Now, when I run it through an old version of Ghostscript which I have here (9.10, chosen because that's what your working reduced file is using) then I do indeed see the file size reduced. It goes down from 26MB to 15MB.
When I look at your 'not working' reduced file, I see that it has been produced by Ghostscript 9.23, not Ghostscript 9.10.
So the reason you see a difference in behaviour is because you have upgraded to a newer version of Ghostscript which does a better job of preserving the image data unchanged.
If you really want to reduce the quality of the images you can set -dPassThroughJPEGImages=false but IMO you'd do better to either get the media size of the original PDF coreect (surely the pages are not really 35x50 inches ?) or set the ColorImageResolution to a lower value.
I'm organizing a large amount of PDFs, some of which need to be inverted, or have their contrast adjusted. But when I use convert to modify a PDF, the new file size become much bigger than the original file size, using the density and quality command to achieve the original quality. A typical command looks like this:
convert -density 300 OrignalPDF.pdf -quality 100 -negate NewPDF.pdf
This results in a pdf that looks very nearly as sharp as the original, but when switching between the two (with the original inverted within the pdf viewer's settings (qpdfview)), one notices that the new one seems very slightly shrunken and that all the lines become slightly thicker/bolder. Obviously this isn't too bad, but shouldn't I be able to invert the colors with almost no noticeable changes?
This slight change becomes even more ridiculous when one notices the size disparity: the original image was 276 KB and the modified file is 28 MB. That's more than 100 times larger! Given that I have hundreds of PDFs, out of which more than 20 or 30 need to be (custom) modified, how can I keep the total size near the original total size, while retaining quality?
Imagemagick's documentation says:
However the reading of these formats is very complicated, as they are full computer languages designed specifically to generate a printed page on high quality laser printers. This is well beyond the scope of ImageMagick, and so it relies on a specialized delegate program known as "ghostscript" to read, and convert Postscript and PDF pages to a raster image.
So, ImageMagick converts PDF to raster image first and then it makes a simple PDF from this raster image. And the output PDF is unsearchable, contains no vectors, no hidden text etc but just the page wide raster image. But PDF (and PostScript) is not just a set of images but a set of commands, text, vectors, fonts, and even a sub-scripts inside (to calculate output color, for example). PDF is more like an application rather than a static image.
Anywa, I suppose you may have 2 types of input PDF files:
with page-wide images inside (for example, scanned documents). You should process 1st type only using imagemagick. This type of files will be converted into the nearly the same file size.
with pure text and vectors inside (for example, PDF invoices). This type of files should not be processed using imagemagick as the conversion damages the input file (and finally increases the output file size). If you still need to adjust contrast or compression of images inside files of this type then consider using the ghostscript directly, check this tutorial.
I have a number of PDF textbooks, and some of them are upwards of 400 megabytes for 1000 pages while others (which look similar in quality) are only 10 megabytes for 1500 pages!! I thought it might be the image quality, but the images are fairly similar in quality. Next, I took a look at the text when I zoomed in, and saw that the larger books look like they have rasterized text while the smaller files looked like they had vector text. Is this it?
If so, how can I start making PDF files in vector format? Is it possible to scan a document / use OCR to recognize the text, and then somehow convert the rasterized text into vector format? Also, can you convert rasterized texts into vector format?
Cheers,
Evans
Check this command on a sample each from your two different PDF types:
pdfimages -list -f 1 -l 10 the.pdf
(Your version of PDF images should be a recent one, the Poppler variant.) This gives you a list of all images from the first 10 pages. It also lists the image dimensions (width, height) in pixels, as well as the image size (in Bytes) and the respective compression.) If you can bear with it, you can also run:
pdfimages -list the.pdf
This gives you a list of all images from all pages.
I bet the larger one has more images listed.
PDFs from scans vs. PDFs "born digital" ?
Also run:
pdffonts -f 1 -l 10 the.pdf
and
pdffonts the.pdf
My guess is this: your large PDF types do not list any fonts. That means, very likely the pages of these PDFs originate from scanned papers.
The smaller ones were "born digital"...