How to convert the font objects in a PDF to CFF - pdf

PDFs which use CFF fonts instead of Type 1 fonts are much smaller. Running a PDF through a ghostscript conversion will indeed produce a PDF with Type1C (CFF) fonts. But this also "re-renders" the PDF, getting rid of certain factors such as the print/screen distinction (if used). It there any way to convert a PDF with Type 1 fonts to a PDF which is otherwise identical, but which uses Type1C fonts instead, i.e., just converting the font objects?

Related

open a PDF file with automatically replaced Fonts

I am not a programmer, but a normal user who uses Linux.
I want to use Ghostscript to DISPLAY Pdf files, not to CREATE Pdf files. (I have never used Ghostscript until now).
But I want Ghostscript to automatically replace all fonts with other fonts when I open the PDF. No matter if the fonts are embedded or not.
With which fonts should the fonts be replaced?
Answer: I want to create a list of fonts, that I want to be available for replacement.
But which of these fonts on the list should be used?
Answer: The one that best matches the metric of the font to be replaced.
Is it possible to do this somehow?
You can't get Ghostscript to do what you are asking. If a PDF file contains fonts Ghostscript will use those fonts, it will only substitute if it cannot find an embedded font.
The reason for this is simple; the font embedded in the PDF file is the correct font. It's Metrics are correct, and the mapping form character code to the appropriate glyph selector in the font will be correct.
It's also a non-trivial problem to select from a list of fonts the one which 'best matches the metrics of the font to be replaced'. What characteristics should be considered ? How should those be determined ?
When a font is not embedded then Ghostscript will consult its own list of fonts and CIDFonts. Both of these lists can be customised, the documentation is here
But since a substitute font is always going to be a compromise, you can't tell Ghostscript not to use the embedded fonts in a PDF. Well technically you could, by modifying the PDF interpreter, but you say you aren't a programmer, so I doubt you will want to try that.

The 14 standard PDF fonts and character encoding

I'm having difficulty producing PDFs that make use of the 14 standard PDF fonts. Let's use Times-Roman as an example.
I create a Font dictionary of type Type1, with BaseFont set to Times-Roman. If I omit the Encoding entry to the Font dictionary, or add an Encoding dictionary without a BaseEncoding set, the PDF viewer application should use the font's built-in encoding. For Times-Roman, this is AdobeStandardEncoding.
This works fine for ASCII characters. However, something more exotic like the 'fi' ligature (AdobeStandardEncoding code 174) is not displayed correctly by all PDF viewers:
Adobe Reader shows ® (unicode index 174) for Times-Roman and Ă for Times-Italic
SumatraPDF (wine) shows ® for both fonts
Mozilla's PDF.js shows the 'AE' ligature both fonts
All other PDF viewers I've tried, display the 'fi' ligature properly. They also display the € symbol correctly, which is additionally mapped using the Differences array in the Encoding dictionary (because it is not included in AdobeStandardEncoding):
Apple Preview/Skim
GhostScript
PDF-XChange Viewer (wine)
Foxit Reader (wine)
Chromium's internal PDF viewer
Evince (homebrew)
Opening Adobe Reader's Document Properties window shows:
Times-Roman
Type: Type1
Encoding: Custom
Actual Font: Times-Roman
Actual Font Type: TrueType
I suspect the fact that a TrueType font is being used instead of a Type1 font might be related to the problem. The PDF specification:
StandardEncoding Adobe standard Latin-text encoding. This is the
built-in encoding defined in Type 1 Latin-text font programs (but
generally not in TrueType font programs).
It also says WinAnsiEncoding and MacRomanEncoding can be used with TrueType fonts. So should we avoid using the built-in or StandardEncoding for the standard 14 fonts? Its effects seem to be undefined. It seems Adobe Reader doesn't bother performing a proper mapping from glyph names to glyphs in the TrueType font being used.
Will providing a Differences array when using the Win or Mac encodings produce proper results? Since these map codepoints to Type1/Postscript glyphs names, there is no direct link to TrueType glyphs.
EDIT Mmm, I have a feeling the Font Descriptor Flags might be important for these standard fonts. I set the flags to 4 up to now for all fonts, which seemed to work fine for True/OpenType fonts.
Turns out the Flags in the FontDescriptor dictionary is important. For Times, the Nonsymbolic flag (bit 6) needs to be set. The fact that Times is actually being typeset using a TrueType font has nothing to do with it.
To use the built-in encoding of the font, the Encoding entry of the Type1 Font dictionary should not be set. You may only add an Encoding dictionary (with BaseEncoding omitted) if it contains a non-empty Differences array, or Adobe Reader will error.
With these precautions, the generated PDF displays correctly on all 9 viewer applications listed above.

Ghostscript PDF to PDF/A conversion font issues

I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.

Ghostscript embedded fonts and substitution

I'm converting PDF to JPG with gs.
Does gs substitute embedded fonts? How exactly this works? Like if i embed all fonts that is used in PDF does gs still look for some substitution or can it use that embedded font data?
So does embedding fonts in PDF mean that all glyphs used in PDF with that font is being embedded and i don't need to have that font in my gs font path?
Thanks!
When you’re outputting a JPEG file, you’re in effect outputting an image. This means that Ghostscript renders the page as image, then compresses the image using JPEG (lossy – to prevent reduced legibility of the text, use a lossless compression format such as PNG instead; JPEG is basically only good for photography because lossless would be much too big there).
In a bitmap image, there are no fonts, only pixels – so, for text rendering (e.g. black text on a white page), Ghostscript will create a bitmap image consisting only of greyscale pixels (by means of anti-aliasing), then save that.
To be able to do that, Ghostscript must have access to the fonts at the time of PDF rendering and JPEG creation. This means that the fonts either must be installed on the system (and in your font path), or embedded in the PDF in the first place. They are not necessary to view the JPEG file.

Prevent Ghostscript from rasterizing text?

I'm trying to convert PDFs to PCL (using ghostscript, but I'd love to hear alternative suggestions), and every driver (ghostscript device), including all of the built-ins and gutenprint generate PCL files many times larger than the input PDF. (This is the problem - I need my PCL to be about as small as the input).
Given that the text doesn't show up in the PCL file, I guess that Ghostscript is rasterizing the text. Is there a way to prevent GS generally, or just gutenprint, from doing that? I'd rather either have it embed the fonts, or not even embed the fonts (leave it to the printer to render the fonts)?
Unfortunately, there doesn't seem to be any documentation on this point.
There are 3 (I think) types of font in PCL. There are rendered bitmaps, TrueType fonts (in later versions) and the HPGL stick font.
PDF and PostScript Have type 1, 2 (CFF), 3 and 42 (TrueType, but not the same as PCL) and CIDFonts based on any of the preceding types.
The only font type the two have in common is TrueType, so in order to retain text, any font which was not TrueType would have top be converted into TrueType. This is not a simple task. So Ghostscript simply renders the text, which is guaranteed to work.
PDF is, in general, a much richer format than PCL< there are many PDF constructs (fonts, shading, stroke/fill in a single operation, transparency) which cannot be represented in PCL. So its entirely possible that the increase in size is nothing to do with text and fonts.
In fact, I believe that the PXL drivers in Ghostscript simply render the entire page to a bitmap at the required resolution, and then wrap that up with enough PCL to be successfully sent to a printer. (I could be mistaken on this point though)
Basically, you are not going to get PCL of a similar size to your PDF out of Ghostscript.
Here is a way to 'prevent Ghostscript from rasterizing text'. But its output will be PostScript. You may however succeed convert this PostScript to a PCL5e in an additional step.
The method will convert all glyphs into outline shapes for its PostScript output, and it does not work for its PDF or PCL output. The key here is the -dNOCACHE parameter:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Of course, converting font glyphs to outlines will take more space than keeping the original fonts embedded, because "fonts" are a space-optimized concept to store, retrieve and render glyph shapes.
Once you have this PostScript, you may be able to convert it to PCL5e with the help of either of the methods you tried before for PDF input (including {Apache?} FOP).
However, I have no idea if the output will be much smaller than versions with rasterized fonts (or even wholesome rasterized pages). But it may be worth a test.
Now vote down this answer too...
Update
Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:
-dNoOutputFonts
which will cause the output devices pdfwrite, ps2write and eps2write to "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
That means that the above command should be replaced by this:
gs -o somepdf.ps -dNoOutputFonts -sDEVICE=ps2write somepdf.pdf
Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.