Ghostscript embedded fonts and substitution - pdf

I'm converting PDF to JPG with gs.
Does gs substitute embedded fonts? How exactly this works? Like if i embed all fonts that is used in PDF does gs still look for some substitution or can it use that embedded font data?
So does embedding fonts in PDF mean that all glyphs used in PDF with that font is being embedded and i don't need to have that font in my gs font path?
Thanks!

When you’re outputting a JPEG file, you’re in effect outputting an image. This means that Ghostscript renders the page as image, then compresses the image using JPEG (lossy – to prevent reduced legibility of the text, use a lossless compression format such as PNG instead; JPEG is basically only good for photography because lossless would be much too big there).
In a bitmap image, there are no fonts, only pixels – so, for text rendering (e.g. black text on a white page), Ghostscript will create a bitmap image consisting only of greyscale pixels (by means of anti-aliasing), then save that.
To be able to do that, Ghostscript must have access to the fonts at the time of PDF rendering and JPEG creation. This means that the fonts either must be installed on the system (and in your font path), or embedded in the PDF in the first place. They are not necessary to view the JPEG file.

Related

imagemagick - generated PDFs are blurry/hazy

I have an image of a website which I'm trying to convert to PDF. I have the image in several formats: PSD, PNG, JPG, TIFF, all saved losslessly.
I'm using the following command to convert the image to PDF:
convert -density 93 foo.jpg bar.pdf
Here is part of the original image:
And here is the same part, after converting to PDF:
As you can see, the second one is ever so slightly hazy. What's causing this, and how can I eliminate it? I've seen PDFs with crisp graphics, so I know it's possible.
If you are seeing the same results with multiple input types. The fuzziness is likely being caused by the anti-aliasing feature of your PDF Viewer. If using Acrobat, you can turn off image anti-aliasing by doing the following:
Go to Edit-->Preferences-->Page Display
Untick the option "Smooth Images" and hit "OK".
The crisp graphics you are seeing on other PDFs are likely due to the fact that they are vectorized graphics. Imagemagick is creating a PDF and embedding your image inside of it which may be subject to compression.
Also:
When using jpeg as input, add the "-quality 100" to your Imagemagick call to retain the highest quality possible.
Use a higher value for the "-density" parameter (I would recommend at least 150) to generate a higher resolution PDF.

ps2pdf - force/keep plain text

How to use ps2pdf and force it to keep plain text if the original pdf contains real text?
Sometimes if a PDF has some areas with background colors it convert the whole pdf to an image!?
How to force ps2pdf to keep plain text?
Syntax:
pdf2ps file.pdf file.pdf.ps
ps2pdf -dPDFSETTINGS=/screen -dColorImageResolution=50 -dGrayImageResolution=50 file.pdf.ps file_output.pdf
PDF example
www.bluemachines.dk/pdf_comp/dyn.pdf
The first answer is drive Ghostscript directly, don't use ps2pdf (or pdf2ps).
If you are getting text converted to an image, then its most likely because the original PDF file has transparency, which cannot be represented in PostScript. The only way to deal with that is to render the area of transparency.
There is no way to maintain the Encoding of the text, though in general it won't change. However this is highly dependent on the font and encoding used in the input. I can't say more without seeing an example.

Prevent Ghostscript from rasterizing text?

I'm trying to convert PDFs to PCL (using ghostscript, but I'd love to hear alternative suggestions), and every driver (ghostscript device), including all of the built-ins and gutenprint generate PCL files many times larger than the input PDF. (This is the problem - I need my PCL to be about as small as the input).
Given that the text doesn't show up in the PCL file, I guess that Ghostscript is rasterizing the text. Is there a way to prevent GS generally, or just gutenprint, from doing that? I'd rather either have it embed the fonts, or not even embed the fonts (leave it to the printer to render the fonts)?
Unfortunately, there doesn't seem to be any documentation on this point.
There are 3 (I think) types of font in PCL. There are rendered bitmaps, TrueType fonts (in later versions) and the HPGL stick font.
PDF and PostScript Have type 1, 2 (CFF), 3 and 42 (TrueType, but not the same as PCL) and CIDFonts based on any of the preceding types.
The only font type the two have in common is TrueType, so in order to retain text, any font which was not TrueType would have top be converted into TrueType. This is not a simple task. So Ghostscript simply renders the text, which is guaranteed to work.
PDF is, in general, a much richer format than PCL< there are many PDF constructs (fonts, shading, stroke/fill in a single operation, transparency) which cannot be represented in PCL. So its entirely possible that the increase in size is nothing to do with text and fonts.
In fact, I believe that the PXL drivers in Ghostscript simply render the entire page to a bitmap at the required resolution, and then wrap that up with enough PCL to be successfully sent to a printer. (I could be mistaken on this point though)
Basically, you are not going to get PCL of a similar size to your PDF out of Ghostscript.
Here is a way to 'prevent Ghostscript from rasterizing text'. But its output will be PostScript. You may however succeed convert this PostScript to a PCL5e in an additional step.
The method will convert all glyphs into outline shapes for its PostScript output, and it does not work for its PDF or PCL output. The key here is the -dNOCACHE parameter:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Of course, converting font glyphs to outlines will take more space than keeping the original fonts embedded, because "fonts" are a space-optimized concept to store, retrieve and render glyph shapes.
Once you have this PostScript, you may be able to convert it to PCL5e with the help of either of the methods you tried before for PDF input (including {Apache?} FOP).
However, I have no idea if the output will be much smaller than versions with rasterized fonts (or even wholesome rasterized pages). But it may be worth a test.
Now vote down this answer too...
Update
Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:
-dNoOutputFonts
which will cause the output devices pdfwrite, ps2write and eps2write to "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
That means that the above command should be replaced by this:
gs -o somepdf.ps -dNoOutputFonts -sDEVICE=ps2write somepdf.pdf
Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.

Embed JPG data properly in PDF files generated by Inkscape

There is a bug in Inkscape where JPEG images included in an SVG document are embedded as bitmaps rather than JPEG when exporting to PDF files.
The result is a huge increase in file size. For example, I have a simple SVG drawing which includes a 2 MB JPEG image; exporting to PDF results in a 14 MB file.
I am looking for a workaround. Is there a way to fix the resulting PDF by inserting the correctly-encoded JPG image, perhaps via some sort of pdftk trickery?
(In my case, the resulting PDF will be included as a figure in a LaTeX document rendered with pdflatex, so there may be workarounds other than directly fixing the PDF generated by Inkscape.)
One kludge is to use pdf2ps followed by ps2pdf, which will re-encode the bitmap data as JPEG:
pdf2ps made-by-inkscape.pdf foo.ps
ps2pdf foo.ps smaller-file.pdf
For my test case, the file sizes were:
original JPEG 2.1M
made-by-inkscape.pdf 15M
foo.ps 104M
smaller-file.pdf 1.5M
But of course, this involves re-encoding the JPEG data, which is best avoided.
I found that with Inkscape 0.48.1 exporting to EPS instead, and passing the resulting EPS file to the epstopdf script, produces good results. PNG/JPG files stay PNG/JPG within the PDF file, fonts look alright, etc.

How does PS/PDF store and compress bitmaps?

I am experimenting with a system to scan letters and convert the scanned bitmaps to PDF with the goal to have a high resolution and a small PDF file size.
I am prototyping with scanner, GIMP for bitmap manipulation and ImageMagick for bitmap-to-PDF conversion.
My process looks as follows:
Scan in 3x8bit color, 600 DPI,
LZW-compressed true-color TIFF file
size is around 8 Mb.
Use GIMP to convert bitmap to indexed
image with a typical color table of 4
to 8 colors. That makes the image better compressible.
Use ImageMagick to convert the
LZW-compressed indexed TIFF file PDF,
with around 500K per page.
Now in order to make the image even better compressible, I could make the bitmap more compression-friendly. Before experimenting here, I would like to know how PS/PDF stores bitmaps.
Are bitmaps in PS/PDF run-lenght-encoded? Then I woud gain compression by removing single pixles form bitmap rows.
Do you have ideas for further optimizing here?
Do you know references to bitmap storage format in PS/PDF?
PDF supports many types of image compression, see: http://en.wikipedia.org/wiki/Pdf#Raster_images
I think you can specify which one to use with the imagemagick -compress option: http://www.imagemagick.org/script/command-line-options.php#compress
A few companies (Luratech and CamiNova are the only ones I know) make a "Mixed Raster Content" model in PDF. The files are viewable in the standard Adobe Reader but are very, very small -- comparable to DjVu.
"Mixed Raster Content" means they segment the image into a high resolution B&W mask (hard edges, lines, letters) and lower resolution smooth tone image (background pictures). The mask gets stored using a bitonal compression algorithm (probably JBIG2) and the smooth tone image gets compressed using JP2K (probably).
For bitmaps, IIRC, PDF uses deflate. But PDF can also store images with more specific image compression algorithms, such JPEG (lossy), CCITT (lossless), JBIG2 (lossy and lossless) and JPX (of JPEG2000, lossy and lossless).
Adobe's PDF reference might be a good place to start. From a very cursory look, it looks like images are stored uncompressed, but that doesn't feel right at all. It can also link to external images, in JPEG for instance.
The compression method is generally selected by the tool creating the PDF and you may have limited control over that.
If you have Acrobat 9.0 there is a really nice 'hidden' feature which allows you to see the object tree inside a PDF (you are interested in the XObjects under Resources). There is a short blog on using it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects