PDF Font Embedding Significantly Affecting File Size - pdf

I have a PDF which uses 'Calibri' as a font. Our printers insist that it must be embedded into the document, however when we do the PDF is approximately 3x larger.
We initially thought there isn't much we can do about this, but the printers sent over a document which has both 'Calibri' embedded and the smaller file size.
The difference between the two can be seen here:
Ours
Printers
It's clear that embedded Fonts are the culprit here.
How can we produce PDFs with this smaller file size?
The library we are using is Microsoft.Reporting.WebForms but I suspect we may need to do some post-processing to reduce the sizes, therefore do you have any suggestions?

Related

PDF Optimisation: pdftops -passfonts - How did it make PDF loads way faster?

A few weeks ago, our users pointed out that some large OCRed PDF (ABBYY generated) loads extremely slowly and asked us to do some optimisation on it.
After some investigation, the problem seems to be caused by the complex text embedded within the PDF. I tried different scripts to optimise the PDFs, such as ghostscript, qpdf, etc...
The only one I found did make a significant improvement was to use pdftops (from poppler) with the -passfonts option and convert it back to PDF with ghostscript ps2pdf: pdftops -passfonts intput.pdf output.ps && ps2pdf output.ps output.pdf.
However, the problem is I have no idea how -passfonts can make PDF loads faster and whether it is making a side effect that I am not aware of...
So can PDF gurus shed some lights on the reason/logic behinds this optimisation?
Thank you all in advance!!
Jeffrey
from http://linux.die.net/man/1/pdftops
-passfonts
By default, references to non-embedded 8-bit fonts in the PDF file are substituted with the closest "Helvetica", "Times-Roman", or "Courier" font. This option passes references to non-embedded fonts through to the PostScript file
When the file opens, the reader will look on the system for the non-embedded fonts, and load them when it finds them. The more non-embedded fonts there are, the more checks it has to make. Sometime fonts are not embedded for legal reasons, sometimes they are not embedded because they make the file size go out of proportion, and various other reasons. By substituting the non-embedded fonts with a more common font, I'd say you are forcing the PDF to load a smaller number of fonts, and possibly forcing the PDF to use fonts that have a smaller memory foot print leading to a faster load time.
Compare the fonts list before and after. Maybe that will shed more light.
If you open the document in Adobe Acrobat:
File -> Properties -> Fonts
Be cautious with font substitution! It may completely ruin the look and feel of a document.

Large PDF sizes but less quality

I'm organizing a large amount of PDFs, some of which need to be inverted, or have their contrast adjusted. But when I use convert to modify a PDF, the new file size become much bigger than the original file size, using the density and quality command to achieve the original quality. A typical command looks like this:
convert -density 300 OrignalPDF.pdf -quality 100 -negate NewPDF.pdf
This results in a pdf that looks very nearly as sharp as the original, but when switching between the two (with the original inverted within the pdf viewer's settings (qpdfview)), one notices that the new one seems very slightly shrunken and that all the lines become slightly thicker/bolder. Obviously this isn't too bad, but shouldn't I be able to invert the colors with almost no noticeable changes?
This slight change becomes even more ridiculous when one notices the size disparity: the original image was 276 KB and the modified file is 28 MB. That's more than 100 times larger! Given that I have hundreds of PDFs, out of which more than 20 or 30 need to be (custom) modified, how can I keep the total size near the original total size, while retaining quality?
Imagemagick's documentation says:
However the reading of these formats is very complicated, as they are full computer languages designed specifically to generate a printed page on high quality laser printers. This is well beyond the scope of ImageMagick, and so it relies on a specialized delegate program known as "ghostscript" to read, and convert Postscript and PDF pages to a raster image.
So, ImageMagick converts PDF to raster image first and then it makes a simple PDF from this raster image. And the output PDF is unsearchable, contains no vectors, no hidden text etc but just the page wide raster image. But PDF (and PostScript) is not just a set of images but a set of commands, text, vectors, fonts, and even a sub-scripts inside (to calculate output color, for example). PDF is more like an application rather than a static image.
Anywa, I suppose you may have 2 types of input PDF files:
with page-wide images inside (for example, scanned documents). You should process 1st type only using imagemagick. This type of files will be converted into the nearly the same file size.
with pure text and vectors inside (for example, PDF invoices). This type of files should not be processed using imagemagick as the conversion damages the input file (and finally increases the output file size). If you still need to adjust contrast or compression of images inside files of this type then consider using the ghostscript directly, check this tutorial.

Why so big difference in sizes of almost identical documents

Have two pdfs, first created with libharu and second created with PDF::API2. If not mention to coordinates then content is the same. But first pdf oversize second by four times. Only one distinction what i found that is type of fonts embedding showed in document properties fonts tab.
In first
Verdana (Embedded Subset)
Type: TrueType
Encoding: Custom
In second
Verdana
Type: TrueType
Encoding: Custom
Actual Font: Verdana
Actual font Type: TrueType
How to deal with that embedded subset?
There are many factors that affect the size of the PDF. Your problem may be in the way the PDF creation libraries handle font embedding, specifically:
"Embedded subset" means that part of the font's metrics, like glyph widths, are included in the file.
If the font is not embedded, presumably it is loaded by the reader from the system, reducing the size of the file.
If the PDF is already small (only has one page, little text and no images), embedding fonts may make a relatively big difference on the size of the document. Still, in absolute terms, an embedded font shouldn't take a lot of space.
Another factor you should check is compression. PDF is mostly a plain-text stream, but it usually comes in compressed form. Try opening both PDFs in a plain text editor and see if it's readable or gibberish. The gibberish (compressed) form will naturally take less space.
Finally, you can inspect the objects the PDF file is composed from using the many PDF inspectors out there, for example this one (I just googled it up, no guarantees it'll work as expected).
this is an old question but I had a similar issue.
Did you set libharu to compress your pdf?
in C++, from the documentation
HPDF_SetCompressionMode (pdf, HPDF_COMP_ALL);

PDF compression How does Adobe do it?

This is a bit more of a fun question than a serious one, but how does the Adobe PDF format make documents so... portable?
I just created a small Word document, 235kb in size, containing multiple color photos and a few textual phrases. A PDF created using CutePDF (which I understand isn't the most efficient method of PDF creation) is only 176kb. That's a 25% compression ratio. When those files are placed into a compressed folder, the PDF is capable of 3% compression where the .docx can only take 2%. I'm sure that larger files would have even greater differences in size.
My question is, how does Adobe manage to make their files so much smaller? I understand that they are drawn from raster graphics, but my 3 bitmap files really can't be helped from raster that much, can they?
If you have Acrobat 9 there is a nice tool built-in so you can see how the PDF was put together (and compressions used). There is a blog post explaining how to use it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects
There are a few ways it can be compressing this:
Pdf files use lzw and zip compression.
If the image is scaled in the document, or is a larger dpi on disk than you allow for in cutepdf (for example, if cutepdf is set for 300dpi and the image is 600 dpi), it can be scaled in the pdf.
Microsoft stores TONS of info in the docx format, in xml. WAY more than is really needed to just export the info (for an example, try copying and pasting your text into a textbox cell, and look at the html info that comes out - I had a limit on a textbox size for a cms, and a 7 word sentence ballooned to 950 characters). This is so it can be later edited, and with a lot of esoteric info to make sure everything displays right in every possible permutation. The pdf doesn't need that info, and so it can just do the font and size, and strip out all the unnecessary info, saving a ton of space.
When you use such small files any overhead in the document format will have a disproportionate effect which is why you are seeing such large % differences.
I took a 2683KB JPEG and inserted it into a new word 2003 document. The resulting .doc file was 2725KB (or 2697KB as docx). Turning this into a PDF gives me a 2701KB PDF. So I am seeing a difference of 25KB, but only about 1% difference because of the size of the image data. It is about half what you got but maybe the version of word you have is more verbose when making docx?
For the PDF, acrobat shows space usage as 2691K image, 8.27K overhead and 1K fonts. PDF is quite a sparse format in its syntax which limits overhead and much of it has repeating strings so is easily compressible.
If you want to see what the PDF contains in a tree-like view you can download the demo version of CosEdit.

How does PS/PDF store and compress bitmaps?

I am experimenting with a system to scan letters and convert the scanned bitmaps to PDF with the goal to have a high resolution and a small PDF file size.
I am prototyping with scanner, GIMP for bitmap manipulation and ImageMagick for bitmap-to-PDF conversion.
My process looks as follows:
Scan in 3x8bit color, 600 DPI,
LZW-compressed true-color TIFF file
size is around 8 Mb.
Use GIMP to convert bitmap to indexed
image with a typical color table of 4
to 8 colors. That makes the image better compressible.
Use ImageMagick to convert the
LZW-compressed indexed TIFF file PDF,
with around 500K per page.
Now in order to make the image even better compressible, I could make the bitmap more compression-friendly. Before experimenting here, I would like to know how PS/PDF stores bitmaps.
Are bitmaps in PS/PDF run-lenght-encoded? Then I woud gain compression by removing single pixles form bitmap rows.
Do you have ideas for further optimizing here?
Do you know references to bitmap storage format in PS/PDF?
PDF supports many types of image compression, see: http://en.wikipedia.org/wiki/Pdf#Raster_images
I think you can specify which one to use with the imagemagick -compress option: http://www.imagemagick.org/script/command-line-options.php#compress
A few companies (Luratech and CamiNova are the only ones I know) make a "Mixed Raster Content" model in PDF. The files are viewable in the standard Adobe Reader but are very, very small -- comparable to DjVu.
"Mixed Raster Content" means they segment the image into a high resolution B&W mask (hard edges, lines, letters) and lower resolution smooth tone image (background pictures). The mask gets stored using a bitonal compression algorithm (probably JBIG2) and the smooth tone image gets compressed using JP2K (probably).
For bitmaps, IIRC, PDF uses deflate. But PDF can also store images with more specific image compression algorithms, such JPEG (lossy), CCITT (lossless), JBIG2 (lossy and lossless) and JPX (of JPEG2000, lossy and lossless).
Adobe's PDF reference might be a good place to start. From a very cursory look, it looks like images are stored uncompressed, but that doesn't feel right at all. It can also link to external images, in JPEG for instance.
The compression method is generally selected by the tool creating the PDF and you may have limited control over that.
If you have Acrobat 9.0 there is a really nice 'hidden' feature which allows you to see the object tree inside a PDF (you are interested in the XObjects under Resources). There is a short blog on using it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects