Large PDF sizes but less quality - pdf

I'm organizing a large amount of PDFs, some of which need to be inverted, or have their contrast adjusted. But when I use convert to modify a PDF, the new file size become much bigger than the original file size, using the density and quality command to achieve the original quality. A typical command looks like this:
convert -density 300 OrignalPDF.pdf -quality 100 -negate NewPDF.pdf
This results in a pdf that looks very nearly as sharp as the original, but when switching between the two (with the original inverted within the pdf viewer's settings (qpdfview)), one notices that the new one seems very slightly shrunken and that all the lines become slightly thicker/bolder. Obviously this isn't too bad, but shouldn't I be able to invert the colors with almost no noticeable changes?
This slight change becomes even more ridiculous when one notices the size disparity: the original image was 276 KB and the modified file is 28 MB. That's more than 100 times larger! Given that I have hundreds of PDFs, out of which more than 20 or 30 need to be (custom) modified, how can I keep the total size near the original total size, while retaining quality?

Imagemagick's documentation says:
However the reading of these formats is very complicated, as they are full computer languages designed specifically to generate a printed page on high quality laser printers. This is well beyond the scope of ImageMagick, and so it relies on a specialized delegate program known as "ghostscript" to read, and convert Postscript and PDF pages to a raster image.
So, ImageMagick converts PDF to raster image first and then it makes a simple PDF from this raster image. And the output PDF is unsearchable, contains no vectors, no hidden text etc but just the page wide raster image. But PDF (and PostScript) is not just a set of images but a set of commands, text, vectors, fonts, and even a sub-scripts inside (to calculate output color, for example). PDF is more like an application rather than a static image.
Anywa, I suppose you may have 2 types of input PDF files:
with page-wide images inside (for example, scanned documents). You should process 1st type only using imagemagick. This type of files will be converted into the nearly the same file size.
with pure text and vectors inside (for example, PDF invoices). This type of files should not be processed using imagemagick as the conversion damages the input file (and finally increases the output file size). If you still need to adjust contrast or compression of images inside files of this type then consider using the ghostscript directly, check this tutorial.

Related

ghostscript shrinking pdf doesn't work anymore

first question here.
So i was using the ghostscript command to shrink my pdf which yieled good results (around 30-40% decrease in size). However, one day last week it stopped shrinking them and instead returned me a pdf of the size or even a bit heavier (around 1% or less). Therefore I don't know what's going on since the command used to work fine and i was able to shrink some pdf easily...
I will note that when using gs on my pdfs it always return an error about some glyphs missing in the GlyphLessFont but i don't think it's related to my issue (though if you could redirect me to fixing the glyphlessfont that would be much appreciated).
Here's the command I use :
`gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf`
Here's also a pdf sample that was shrinked correctly (original file size 4.7mo / shrinked version 2.9mo) https://nofile.io/f/39Skta4n25R/bulletin1_ocr.pdf
EDIT: light version that worked for the file above : https://nofile.io/f/QOKfG34d5Cg/bulletin1_light.pdf
Here's the input and output file of another pdf that didn't work
(input) https://nofile.io/f/sXsU0Mcv35A/bulletin15_ocr.pdf
(output through the gs command above) https://nofile.io/f/STdJYqqt6Fq/out.pdf
you'll notice that both input and output file are 27.6mo whereas the first file was reduced.
I would also add that i've performed OCR on these pdf using pdfocr and the tesseract engine and that's why i didn't try to convert to png to reduce the size, i need the extra OCR layer so that we can publish those file for our website and we want them to be lighter if possible.
Final info : ghostscript -v is 9.10 (2013-08-30) and tesseract is 3.03 with leptonica-1.70 and pdfocr is 0.1.4
Hope you guys can help !
EDIT2: while waiting for the answer I continued my scanning and ocring of the documents and it appears that after passing my pdf through pdfocr it was shrinked like it used to with the ghostscript. Therefore i wonder if the script pdfocr does the shrinking with ghostscript since i know it invokes it for other tasks during the process of OCRisation.
The PDF has a media size of 35.44 by 50.11 inches, is that really the size of the original ?
Given that you appear to commonly use OCR I assume that, in general, your PDF files simply consist of very large images. In that case the major impact on the file size is going to come from downsampling the images. If you look at the documentation you can see that the /screen settings downsample images to 72 dpi, with a threshold of 1.5 (so images over 72 * 1.5 = 107 dpi will be downsampled to 72, anything less is regarded as not worth it)
Your PDF file has a media size of 35.44 x 50.11 inches. Its rather a large file (26 pages) so I'll limit myself to considering page 1. On this page there is one image, and a bunch of invisible text, placed there by Tesseract. The image on page 1 is a 8-bit RGB image with dimensions 2481x3508, and it covers the entire page.
So the resolution of that image is 2481 / 35.44 by 3500 / 50.11 = 70.00 x 69.84
Since that is less than 72 dpi, pdfwrite isn't going to downsample it.
Had your media been 8.5 x 11 inches then the image would have had an effective resolution of 2481 / 8.5 by 2500 / 11 = 291.8 x 318.18 and so would have been downsampled by a factor of about 4.
However..... for me your 'working' PDF file also has a large media size, and the images are also already below the downsampling resolution. When I run that file using your command line, the output file is essentially the same size as the input file.
So I'm at a loss to see how you could ever have experienced the reduced file size. Perhaps you could post the reduced file as well.
EDIT
So, the reason that your files are smaller after passing through Ghostscript is because the vast majority of the content is the scanned pages. These are stored in the PDF file as DCT encoded images (JPEG).
The resolution of the images is low enough (see above) that they are not downsampled. However, the way that old versions of Ghostscript work is that image data is always decompressed on reading, and then recompressed when writing.
Because JPEG is a lossy image format, this means that the decompressed and recompressed image is of lower quality than the original, and the way that loss of quality is applied means that the data compresses better.
So a quirk of the way that Ghostscript works results in you losing quality, but getting smaller files. Note that for current versions of Ghostscript, the JPEG data is passed through unchanged, unless your configuration requires it to be donwsampled, or colour converted.
So why doesn't it compress the other file ? Well for current code, of course, which is what I'm using, it won't, because the image doesn't need downsampling or anything.
Now, when I run it through an old version of Ghostscript which I have here (9.10, chosen because that's what your working reduced file is using) then I do indeed see the file size reduced. It goes down from 26MB to 15MB.
When I look at your 'not working' reduced file, I see that it has been produced by Ghostscript 9.23, not Ghostscript 9.10.
So the reason you see a difference in behaviour is because you have upgraded to a newer version of Ghostscript which does a better job of preserving the image data unchanged.
If you really want to reduce the quality of the images you can set -dPassThroughJPEGImages=false but IMO you'd do better to either get the media size of the original PDF coreect (surely the pages are not really 35x50 inches ?) or set the ColorImageResolution to a lower value.

What makes some pdf files much smaller than others?

I have a number of PDF textbooks, and some of them are upwards of 400 megabytes for 1000 pages while others (which look similar in quality) are only 10 megabytes for 1500 pages!! I thought it might be the image quality, but the images are fairly similar in quality. Next, I took a look at the text when I zoomed in, and saw that the larger books look like they have rasterized text while the smaller files looked like they had vector text. Is this it?
If so, how can I start making PDF files in vector format? Is it possible to scan a document / use OCR to recognize the text, and then somehow convert the rasterized text into vector format? Also, can you convert rasterized texts into vector format?
Cheers,
Evans
Check this command on a sample each from your two different PDF types:
pdfimages -list -f 1 -l 10 the.pdf
(Your version of PDF images should be a recent one, the Poppler variant.) This gives you a list of all images from the first 10 pages. It also lists the image dimensions (width, height) in pixels, as well as the image size (in Bytes) and the respective compression.) If you can bear with it, you can also run:
pdfimages -list the.pdf
This gives you a list of all images from all pages.
I bet the larger one has more images listed.
PDFs from scans vs. PDFs "born digital" ?
Also run:
pdffonts -f 1 -l 10 the.pdf
and
pdffonts the.pdf
My guess is this: your large PDF types do not list any fonts. That means, very likely the pages of these PDFs originate from scanned papers.
The smaller ones were "born digital"...

Quality degradation of a text pdf after pdf>png>pdf

I have a very specific requirement where i must automatically stamp every page of a PDF file (for a faxing application), so here's the process i've made:
step 1: Convert PDF to PNG, one png file per page
cmd1: gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r400 -sOutputFile=image_raw.png input.pdf
cmd2: mogrify -resize 31.245% image_raw.png
input.pdf (input): https://www.dropbox.com/s/p2ajqxe99nc0h8m/input.pdf
image_raw.png (output): https://www.dropbox.com/s/4cni4w7mqnmr0t7/image_raw.png
step 2: Stamp every PNG file (using a third party tool ..)
image_stamped.png (output): https://www.dropbox.com/s/3ryiu1m9ndmqik6/image_stamped.png
step 3: Reconvert PNG files into one PDF file
cmd: convert -resize 1240x1753 -units PixelsPerInch -density 150x150 image_stamped.png output.pdf
output.pdf (output): https://www.dropbox.com/s/o9y0jp9b4pm08ci/output.pdf
The output file of the third step shal be "theoretically" the same as the input file in step 1 (plus the stamp on it) but it's not, the file is somehow blurry and it turns to be unreadeable for humans after faxing it since blurred pixels wouldnt pass through fax wires even if you may see no difference between input.pdf and output.pdf, try zooming in and you'll find that text characters are blurred on its edges.
What is the best parameters to play with at input (step 1) or output (step 3) ?
Thanks !
You are using anti-aliasing (TextAlphaBits=4). This 'smooths' the edges of text by introducing grey pixels between the black pixels of the text edges. At low resolutions (such as displays) this prevents the 'jaggies' in text and gives a more readable result. At higher resolutions its value is highly debatable.
Fax is a 1-bit monochrome medium, so the grayscale values have to be recreated by dithering. As you have discovered, this is not a good idea in a limited resolution device as it leads to a loss of sharpness.
I believe that if you remove the -dTextAlphaBits=4 you will see an immediate improvement. I would also suggest that you remove the GraphicsAlphaBits as well, since this will have the same effect on linework.
If you believe that you still want anti-aliasing you could try reducing the aggressiveness, you currnetly have it set to 4, try reducing it to 2.
Regarding the other comments;
Kurt is quite correct, as is fourat, and I'm afraid MarcB is mistaken, the -r400 sets the resolution for rendering, in dots per inch. If only one number is given it is used for both x and y resolution. It is possible to produce a fixed size raster using Ghostscript, but you use the -dFIXEDMEDIA with -sPAPERSIZE switches or the -g switch which also sets FIXEDMEDIA automatically.
While I do agree with yms and Kurt that converting the PDF to a bitmap format (PNG) and then back to PDF will result in a loss of quality, if the final PDF is only used for transmission via fax, it doesn't matter. The PDF must be rendered to a fax-resolution bitmap at some point in the process, its not a big problem if its done before the stamp is applied.
I don't agree with BitBank here, converting a vector representation to bitmap means rasterising it at a particular resolution. Once this is done, the resulting image cannot be rescaled without loss of quality, whereas the original vector representation can be as it is simply rendered again at a different resolution. Image in PDF refers to a bitmap, you can't have a vector bitmap. The image posted by yms clearly shows the effect of rendering a vector representation into an image.
One last caveat. I'm not familiar with the other tools being used here, but two of the command lines at least imply 'resize'. If you 'resize' a bitmap then the chances are that the tool will introduce the same kinds of artefacts (anti-aliasing) that you are having a problem with. Onceyou have created the bitmap you should not alter it at all. Its important that you create the PNG at the correct size in the first place.
And finally.....
I just checked your original PDF file and I see that the content of the page is already an image. Not only that its a DCT (JPEG) image. JPEG is a really poor choice of format for a monochrome image. Its a lossy compression format and always introduces artefacts into the image. If you open your original PDF file in Acrobat (or similar viewer) and zoom in, you can see that there are faint 'halos' around the text, you will also see that the text is already blurry.
You then render the image, quite probably at a different resolution to the original image resolution, and at the same time introduce more blurring by setting -dGraphicsAlphaBits. You then make further changes to the image data which I can't comment on. In the end you render the image again, to a monochrome bitmap. The dithering required to represent the grey pixels leads to your text being unreadable.
Here are some ways to improve this:
1) Don't convert text into images like this, it instantly leads to a quality loss.
2) Don't compress monochrome images using JPEG
3) If you are going to work with images, don't keep converting them back and forth, work with the original until you are done, then make a PDF file from that, if you really must.
4) If you really insist on doing all this, don't compound the problem by using more anti-aliasing. Remove the -dGraphicsAlphaBits from the command line. You might as well remove -dTextAlphaBits as well since your files contain no text. Please read the documentation before using switches and understand what it is you are doing.
You should really think about your workflow here. Obviously we don't know what you are doing or why, so there may well be good reasons why some things are not possible, but you should try and avoid manipulating images like this. Because these are not vector, every time you make a change to the image data you are potentially losing information which cannot be recovered at a later stage. By making many such transformations (and your workflow as depicted seems to perform as many as 5 transformations from the 'original' image data) you will unavoidably lose quality.
If possible retain everything as vector data. When it is unavoidable to move to image data, create the image data as you need it to be finally used, do not transform it further.
I've had a closer look at the files you provided, see here:
So, already the first image (image_raw), the result of the mogrify resize command, is fairly blurry at 1062x1375. While the blurriness does not get worse in the second image (image_stamped) which is the result of the third-party tool, the third image (extracted from your output.pdf), i.e. the result of that convert command, is even more blurred which is due to the graphic being resized (which is something you explicitly tell it to do).
I don't know at which resolution your fax program works, but there is more quality loss still, at least due to 24 bit colors to black-and-white transformation.
If you insist on the work flow (i.e. pdf->png->stamped png->pdf->fax) you should
in the initial rasterization already use the per-inch resolution your rastered image will have in all following steps (including fax transmission),
refrain from anti-aliasing and use of alpha bits (cf. KenS' answer), and
restrict the rasterized image to the colorspace available to the fax transmission, i.e. most likely black-and-white.
PS As KenS pointed out, already the original PDF is merely a container for an image (with some blur to start with). Therefore, an alternative way to improve your workflow is to extract that image instead of rendering it, to stamp that original image and only resize it (again without anti-aliasing) when faxing.

PDF compression How does Adobe do it?

This is a bit more of a fun question than a serious one, but how does the Adobe PDF format make documents so... portable?
I just created a small Word document, 235kb in size, containing multiple color photos and a few textual phrases. A PDF created using CutePDF (which I understand isn't the most efficient method of PDF creation) is only 176kb. That's a 25% compression ratio. When those files are placed into a compressed folder, the PDF is capable of 3% compression where the .docx can only take 2%. I'm sure that larger files would have even greater differences in size.
My question is, how does Adobe manage to make their files so much smaller? I understand that they are drawn from raster graphics, but my 3 bitmap files really can't be helped from raster that much, can they?
If you have Acrobat 9 there is a nice tool built-in so you can see how the PDF was put together (and compressions used). There is a blog post explaining how to use it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects
There are a few ways it can be compressing this:
Pdf files use lzw and zip compression.
If the image is scaled in the document, or is a larger dpi on disk than you allow for in cutepdf (for example, if cutepdf is set for 300dpi and the image is 600 dpi), it can be scaled in the pdf.
Microsoft stores TONS of info in the docx format, in xml. WAY more than is really needed to just export the info (for an example, try copying and pasting your text into a textbox cell, and look at the html info that comes out - I had a limit on a textbox size for a cms, and a 7 word sentence ballooned to 950 characters). This is so it can be later edited, and with a lot of esoteric info to make sure everything displays right in every possible permutation. The pdf doesn't need that info, and so it can just do the font and size, and strip out all the unnecessary info, saving a ton of space.
When you use such small files any overhead in the document format will have a disproportionate effect which is why you are seeing such large % differences.
I took a 2683KB JPEG and inserted it into a new word 2003 document. The resulting .doc file was 2725KB (or 2697KB as docx). Turning this into a PDF gives me a 2701KB PDF. So I am seeing a difference of 25KB, but only about 1% difference because of the size of the image data. It is about half what you got but maybe the version of word you have is more verbose when making docx?
For the PDF, acrobat shows space usage as 2691K image, 8.27K overhead and 1K fonts. PDF is quite a sparse format in its syntax which limits overhead and much of it has repeating strings so is easily compressible.
If you want to see what the PDF contains in a tree-like view you can download the demo version of CosEdit.

How does PS/PDF store and compress bitmaps?

I am experimenting with a system to scan letters and convert the scanned bitmaps to PDF with the goal to have a high resolution and a small PDF file size.
I am prototyping with scanner, GIMP for bitmap manipulation and ImageMagick for bitmap-to-PDF conversion.
My process looks as follows:
Scan in 3x8bit color, 600 DPI,
LZW-compressed true-color TIFF file
size is around 8 Mb.
Use GIMP to convert bitmap to indexed
image with a typical color table of 4
to 8 colors. That makes the image better compressible.
Use ImageMagick to convert the
LZW-compressed indexed TIFF file PDF,
with around 500K per page.
Now in order to make the image even better compressible, I could make the bitmap more compression-friendly. Before experimenting here, I would like to know how PS/PDF stores bitmaps.
Are bitmaps in PS/PDF run-lenght-encoded? Then I woud gain compression by removing single pixles form bitmap rows.
Do you have ideas for further optimizing here?
Do you know references to bitmap storage format in PS/PDF?
PDF supports many types of image compression, see: http://en.wikipedia.org/wiki/Pdf#Raster_images
I think you can specify which one to use with the imagemagick -compress option: http://www.imagemagick.org/script/command-line-options.php#compress
A few companies (Luratech and CamiNova are the only ones I know) make a "Mixed Raster Content" model in PDF. The files are viewable in the standard Adobe Reader but are very, very small -- comparable to DjVu.
"Mixed Raster Content" means they segment the image into a high resolution B&W mask (hard edges, lines, letters) and lower resolution smooth tone image (background pictures). The mask gets stored using a bitonal compression algorithm (probably JBIG2) and the smooth tone image gets compressed using JP2K (probably).
For bitmaps, IIRC, PDF uses deflate. But PDF can also store images with more specific image compression algorithms, such JPEG (lossy), CCITT (lossless), JBIG2 (lossy and lossless) and JPX (of JPEG2000, lossy and lossless).
Adobe's PDF reference might be a good place to start. From a very cursory look, it looks like images are stored uncompressed, but that doesn't feel right at all. It can also link to external images, in JPEG for instance.
The compression method is generally selected by the tool creating the PDF and you may have limited control over that.
If you have Acrobat 9.0 there is a really nice 'hidden' feature which allows you to see the object tree inside a PDF (you are interested in the XObjects under Resources). There is a short blog on using it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects