Prevent Ghostscript from rasterizing text? - pdf

I'm trying to convert PDFs to PCL (using ghostscript, but I'd love to hear alternative suggestions), and every driver (ghostscript device), including all of the built-ins and gutenprint generate PCL files many times larger than the input PDF. (This is the problem - I need my PCL to be about as small as the input).
Given that the text doesn't show up in the PCL file, I guess that Ghostscript is rasterizing the text. Is there a way to prevent GS generally, or just gutenprint, from doing that? I'd rather either have it embed the fonts, or not even embed the fonts (leave it to the printer to render the fonts)?
Unfortunately, there doesn't seem to be any documentation on this point.

There are 3 (I think) types of font in PCL. There are rendered bitmaps, TrueType fonts (in later versions) and the HPGL stick font.
PDF and PostScript Have type 1, 2 (CFF), 3 and 42 (TrueType, but not the same as PCL) and CIDFonts based on any of the preceding types.
The only font type the two have in common is TrueType, so in order to retain text, any font which was not TrueType would have top be converted into TrueType. This is not a simple task. So Ghostscript simply renders the text, which is guaranteed to work.
PDF is, in general, a much richer format than PCL< there are many PDF constructs (fonts, shading, stroke/fill in a single operation, transparency) which cannot be represented in PCL. So its entirely possible that the increase in size is nothing to do with text and fonts.
In fact, I believe that the PXL drivers in Ghostscript simply render the entire page to a bitmap at the required resolution, and then wrap that up with enough PCL to be successfully sent to a printer. (I could be mistaken on this point though)
Basically, you are not going to get PCL of a similar size to your PDF out of Ghostscript.

Here is a way to 'prevent Ghostscript from rasterizing text'. But its output will be PostScript. You may however succeed convert this PostScript to a PCL5e in an additional step.
The method will convert all glyphs into outline shapes for its PostScript output, and it does not work for its PDF or PCL output. The key here is the -dNOCACHE parameter:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Of course, converting font glyphs to outlines will take more space than keeping the original fonts embedded, because "fonts" are a space-optimized concept to store, retrieve and render glyph shapes.
Once you have this PostScript, you may be able to convert it to PCL5e with the help of either of the methods you tried before for PDF input (including {Apache?} FOP).
However, I have no idea if the output will be much smaller than versions with rasterized fonts (or even wholesome rasterized pages). But it may be worth a test.
Now vote down this answer too...
Update
Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:
-dNoOutputFonts
which will cause the output devices pdfwrite, ps2write and eps2write to "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
That means that the above command should be replaced by this:
gs -o somepdf.ps -dNoOutputFonts -sDEVICE=ps2write somepdf.pdf
Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.

Related

Ghostpcl PCL to PDF conversion shading resolution

i am trying to us Ghostpcl to convert pcl files to pdf on linux. In the main, this is working well and the majority of documents are converting well. However, some documents have boxes and shading and these are not rendering well at all. The resolution is very poor and as a result any text on top of the shading is almost unreadable. Additionally some alignment is slightly out down the right hand margin.
i have also used visual software pcl2pdf which does a good job on the shading but unfortunately does not substitute all of the fonts correctly.
the pcl file can be found here
https://dl.dropboxusercontent.com/u/86110783/20170215102450_65702421.pcl
the ghostpcl converted pdf
https://dl.dropboxusercontent.com/u/86110783/ghostpcl20170215102450_65702421.pdf
the pcl2pdf pdf
The command i am using for converting the pcl to pdf is
/opt/ghostpcl/gs -sDEVICE=pdfwrite -sFONTPATH=/opt/fonts -dBATCH -dNOPAG
EPROMPT -dNOPAUSE -dQUIET -sOutputFile=$1.pdf $1.pcl
i have tried various different switches to no avail.
Any ideas would be greatly appreciated
If the pdfwrite device can't handle a graphic primitive 'as is' it will render it to an image. The default resolution is 720 dpi which is ordinarily good for most purposes, but you can alter it with the -r switch.
Note that for PCL its probably important to set the resolution to 300 or 600 dpi, as that is the only resolution PCL is defined for. The 'shading' you are talking about is, I think, a pattern, and that will only repeat properly at the precise resolution (or integer multiples thereof) for which its intended.
Even if you run at 600 dpi its probably going to look odd as you zoom in and out of the PDF file.
I'm not sure what exactly you are complaining about regarding alignment, do you mean the fact that the text doesn't fit in the box ? That will be because its using a substitute for the missing font, and the substitute has different metrics to the original.
I don't see a link to the pcl2pdf file.

GhostScript PS to PDF conversion - No Color

Referring to this post, GhostScript Conversion Font Issues, is it safe to assume that GhostScript's PS-to-PDF conversions still do not guarantee cut-&-paste text from the converted document? Because I too am getting garbled copy-&-paste results with formatted documents, although it works with plain text files.
sample Word document .DOC
printed to PostScript by MS PS Driver
converted to PDF by GhostScript
On the color issue, I am using the Microsoft PS Class Driver to print documents to PostScript format files, and then convert them to PDF format with the GhostScript v9.20 DLL (sample source and outputs attached above). The options used are as follows:
-dNOPAUSE
-dBATCH
-dSAFER
-sDEVICE=pdfwrite
-sColorConversionStrategy=/RGB
-dProcessColorModel=/DeviceRGB
However, it is converted without color. Have I missed some option?
You can never guarantee getting a PDF file with text you can cut and paste from a PostScript program. There is no guarantee that there is any ToUnicode information in the PostScript program, and without that, if the font is subset as here, then there is no way to know what the Unicode code point for a given glyph is.
Regarding colour, the PostScript file you have supplied contains no colour, so its not Ghostscript, the problem is in the way you have produced the PostScript. At a guess you have used a Printer Definition (PPD file) which is for a monochrome printer.
You might be able to improve the text by playing with the options for downloading fonts, the basic problem is that your PostScript program doesn't contain the information we need to be able to construct a ToUnicode CMap. Without that we are forced to assume that the character codes are ASCII, and in your case, because the fonts are subset, they are not ASCII.
For some reason the content of your PostScript appears to be downloading the font as bitmaps. This is ugly, doesn't scale well, and might be the source of your inability to get ToUnicode data inserted. It may also be caused by the fonts you are using, you might try some standard system fonts (if you aren't already) like TimesNewRoman.
While its great that you supplied an example to look at, I'd suggest that in future you make the example smaller, much smaller.... There's really no need for 13 pages of multiply repeated content in this case. More content means it takes more time to decipher, try and keep example files to the minimum required to demonstrate the problem.
In short, it looks like both your problems are due to the way you are (or the application) generating the PostScript.

Ghostscript loses emdash characters and replaces with hyphens

When I run a PDF which was originally created with LibreOffice on Linux, through ghostscript 9.19 on OSX, to produce another (flattened) PDF, the output is perfect except for one problem. All emdashes in the entire document have been replaced with a standard hyphen (awkwardly followed by half of a space.) Oddly enough, if I highlight the resulting "hyphen+space", my context menu shows that I've selected an emdash, so the underlying text is still an emdash, it is just rendering the wrong glyph.
I can reproduce this on multiple documents from the same source, and I'm assuming there's a setting or switch somewhere that can help resolve this.
I don't know whether the font used makes a difference, but for the sake of reference, the body text of my document is set in Arno Pro. When I use a modern version of LibreOffice on OS X to make a sample document also containing an emdash in Arno Pro, the same problem is not exhibited, so it seems to be specific to the software which originally made these PDF files.
These PDFs are of legacy projects that I am not set-up to re-produce at this time, so I need to prepare them for reprinting using the existing files.
How do I retain emdash glyphs when running a command such as the following?
gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=/LeaveColorUnchanged \
-dAutoFilterColorImages=true -dAutoFilterGrayImages=true \
-sOutputFile=output.pdf input.pdf
I can add an example of the input PDF to this question if needed.
Without seeing the PDF file it isn't possible to give you an answer. Most likely the font isn't embedded, or if it is embedded doesn't have an emdash glyph.
Copy and paste uses the ToUnicode CMap, so it isn't dependent on the font. Its simply a list of character codes and the Unicode code point associated with each, when using a given font.
Note that this doesn't mean 'the underlying text is still an emdash'. The ToUnicode information is utterly separate from the font end of things, it is effectively metadata and bears no real relation to the font or rendering.
Put the file on DropBox and post the URL and someone can look into it. I'll be on vacation for the next few days though, but maybe someone else will look.
Note that in PDF you don't necessarily specify characters and positions as a list of consecutive characters; you can specify the position of each individually, or you can specify widths which override the width in the font, etc. So there almost certainly is only one glyph, the 'white space' you refer to is probably just that, white space, its not another glyph.
I should also point out (I do this a lot) that Ghostscript never 'flattens', concatenates, merges, or anything similar operation on PDF files. WHen using Ghostscript and the pdfwrite device the original input (in whatever format) is fully interpreted into graphics marking operations, and sent tot eh device. The device executes the marking operations; in the case of a rendering device, it scan-converts and writes to a bitmap. In the case of pdfwrite, it creates PDF operators.
The result of this is that the output PDF file bears no relation to the input PDF, other than its visual appearance.
You also don't say which version of Ghostscript you are using....

Ghostscript PDF to PDF/A conversion font issues

I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.

How to confirm a TrueType PDF font is missing glyphs

I have a PDF which renders fine in Acrobat but fails to print during the PDF to PS conversion process on our printer's RIP. After uncompressing with pdftk and editing I've found if I replace the usage of a certain font it will print.
The font is a strange one, a TrueType subset with a single character (space).
If I pass the PDF through Ghostscript it reports no errors, however an Acrobat pre-flight check will report a missing glyph for space. This error is not reported for the original file. I'm just using a basic command: gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -o gs.pdf original_sample.pdf
I've pulled out the font data from the original PDF and saved it. Running TTFDUMP.exe produces an interesting result where it seems that the 'glyf' table is missing:
4. 'glyf' - chksm = 0x00000000, off = 0x00000979, len = 0
5. 'head' - chksm = 0xE463EA67, off = 0x00000979, len = 54
Just wondering, am I interpreting this result correctly? Is it valid to run TTFDUMP like this on extracted data from a PDF? I think a 'glyf' table is required based on the spec, at least for the first 4 necessary characters.
TTFDUMP run on the ghostscript PDF produces a similar result but with a 1-byte 'glyf' table.
If so it seems that Acrobat doesn't particularly care about the missing space while other programs (including the printer) do. It's odd it isn't reported as missing though until it runs through Ghostscript.
The PDF is created by Adobe InDesign and the font is copyrighted like most so I can't share it.
Edit - I've accepted Ken's answer as he helped me on the Ghostscript bug tracker. In summary, it seems the font is broken as suspected due to the missing glyf table. Until I hear otherwise I'll have to suppose this is a bug in InDesign, and will continue investigating.
Yes you can run ttfdump on an embedded subset font, its still a perfectly valid font.
A missing glyph is not specifically a problem, because the .notdef glyph is used instead, a missing .notdef means a font isn't legal.
I think you are mistaken about the legality of sharing the PDF file (from the point of view of font embedding). Practically every PDF file you see will contain copyright fonts, but these are permitted to be embedded and distributed as part of a PDF (or indeed PostScript) file. TrueType fonts contain flags which control the DRM of the font, and which can deny embedding in in PDF (or other formats). Ghostscript honours these embedding flags in the font as does Acrobat Distiller and other Adobe products.
There were some fonts which inadvertently shipped with DRM which prevented embedding, and there's a list somewhere of these, along with an explicit statement from the font foundry that its permissible to embed these fonts. I think this was somewhere on the Adobe web site a few years back.
So if you have a PDF file with the font embedded in it (especially if it was produced by an Adobe application) then I would be comfortable that its legal to share.
I'm having some trouble figuring out what the problem actually is, and how you are using Ghostscript. If you are running the PDF->PS and then back to PDF then all bets are off frankly. Round-tripping files will often provoke problems.
In any event I'm happy to look at the file but you will have to make it available.