Could you tell me how can I convert TIFF to PDF with using Ghostscript or Postscript?
I tried to use this command:
gswin32c.exe -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=o.pdf test.tif
But it doesn't work.
It produces an error:
GPL Ghostscript 9.06 (2012-08-08)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Error: /undefined in II*
Operand stack:
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1910 1 3 %oparray_pop 1909 1 3 %oparray_pop 1893 1 3 %oparray_pop 1787 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1169/1684(ro)(G)-- --dict:0/20(G)-- --dict:77/200(L)--
Current allocation mode is local
Current file position is 4
GPL Ghostscript 9.06: Unrecoverable error, exit code 1
The libtiff software package (available on all major OS platforms) ships with a command line tool called tiff2pdf.
$ tiff2pdf -h
LIBTIFF, Version 4.0.3
Copyright (c) 1988-1996 Sam Leffler
Copyright (c) 1991-1996 Silicon Graphics, Inc.
usage: tiff2pdf [options] input.tiff
options:
-o: output to file name
-j: compress with JPEG
-z: compress with Zip/Deflate
-q: compression quality
-n: no compressed data passthrough
-d: do not compress (decompress)
-i: invert colors
-u: set distance unit, 'i' for inch, 'm' for centimeter
-x: set x resolution default in dots per unit
-y: set y resolution default in dots per unit
-w: width in units
-l: length in units
-r: 'd' for resolution default, 'o' for resolution override
-p: paper size, eg "letter", "legal", "A4"
-F: make the tiff fill the PDF page
-f: set PDF "Fit Window" user preference
-e: date, overrides image or current date/time default, YYYYMMDDHHMMSS
-c: sets document creator, overrides image software default
-a: sets document author, overrides image artist default
-t: sets document title, overrides image document name default
-s: sets document subject, overrides image image description default
-k: sets document keywords
-b: set PDF "Interpolate" user preference
-h: usage
So a simple command to get your PDF would be:
$ tiff2pdf -o output.pdf -p A4 -F test.tif
Ghostscript reads PDF and PostScript as input, it doesn't read image formats, and in particular doesn't read TIFF. However PostScript is a programming language, so it is entirely possible to write a PostScript program to read a TIFF file (the viewgif.ps and viewjpeg.ps program supplied with Ghostscript do this for GIF and JPEG formats)
I do have a program which does this, up to a point, and which has been posted on comp.lang.postscript a few times. Its kind of large to share here (33Kb) but I can email you a copy if you are interested.
Use the gdal_translate utility. It's designed for geospatial raster imagery, but it doesn't care if it's just a regular image.
gdal_translate -of pdf \path\to\someimage.tif test.pdf
Additional info on the geo-pdf driver and it's options:
http://www.gdal.org/frmt_pdf.html
The default compression applied is DEFLATE, which is good because it's lossless, but doesn't produce very small files. Usually using PREDICTOR and TILED options can increase compression (but not always, test with your data).
gdal_translate -of pdf ^
--config COMPRESS=DEFLATE --config PREDICTOR=2 --config TILED=YES ^
in.tif deflate.pdf
For smallest files use JPEG. For combined smallest and least lossy use JPEG2000, but test in client's pdf reader because support is not universal (recent Adobe Reader is fine).
gdal_translate -of pdf -co compress=jpeg -co jpeg_quality=85 ^
inimage.tif outdoc.pdf
gdal_translate -of pdf -co compress=jpeg2000 ...
-co is for brevity, it's interchangeable with --config.
Upper case in first example is convention only, on the command line it doesn't matter.
^ is Windows character to suppress linefeed, omit when all in one line.
Otaining pre-built binaries:
http://trac.osgeo.org/gdal/wiki/DownloadingGdalBinaries
Figuring out which distribution package to use will be a bit of a pain if you don't know anything about Geographic Information Systems. If all you want is to stuff the program somewhere and run it, grab "compiled binaries in single zip" from GIS Internals, 2015-Jan 32bit stable release here, unpack and start from SDKShell.bat.
The best tool I've used so far has been Irfanview. It allowed us to do a very obscure version of TIFF files, and also let us do mixed TIFF to PDF conversions where some pages where grayscale, some were color, and some were CCITT4 fax encoded.
I created a .NET program that converted millions of documents using Irfanview using this command line:
C:\Program Files (x86)\IrfanView\i_view32.exe TIFFFile.tiff /convert=OutputPDFFile.pdf /silent /cmdexit
The .NET program essentially just passed in the full paths to the TIFF input and PDF output files.
Setup
Download Irfanview 32bit. Not all plugins are available for 64-bit version, so I used 32-bit.
Then download the plugin package for Irfanview and install it.
Then open Irfanview, and you need to define the settings for either the IMPdf Plugin (legacy 32-bit only plugin) or the newer PDF Plugin by going to File>>Batch Conversion/Rename.
The settings in the Batch Conversion/Rename window stay in effect for every subsequent execution of the program as they are stored in the Irfanview INI file.
Click Output Format, then choose PDF.
Then click the OPTIONS button. That will allow you to control parameters on what will happen to the input TIFF file when it is converted to PDF.
Then try irfanview to do it manually for a few TIFF files to ensure the output is desired.
Then you can automate it using a program...
I Convert the Tiff file to Pdf File by use Imagick
code :
$document = new Imagick(test.tiff);
$document->setImageFormat("pdf");
$document->writeImages("test.pdf", true);
I Convert the PDF file to Tiff File by Ghostscript with Java on ubuntu.
snippets code :
String convertCommand = "gs -dNOPAUSE -q -sDEVICE=tiff24nc -sCompression=lzw -dBATCH -sOutputFile=" + outputFile + " " + sourceFile;
Runtime rt = Runtime.getRuntime();
Process pr = rt.exec(convertCommand);
pr.waitFor();
If you want compression then simply replace command to,
String convertCommand = "gs -dNOPAUSE -q -sDEVICE=tifflzw -dBATCH -sOutputFile=" + outputFile + " " + sourceFile;
Please install Ghostscript before using it,
1. sudo apt-get install ghostscript libtiff-tools
Related
I know that Adobe Acrobat Reader DC can select the Microsoft Print to PDF printer to output to a PDF file with Print As Image checked in the Advanced Print Setup dialog. However, I want to use a command to do this. I tried the following command, as a result it failed to convert each page to images (Note the output file is still PDF).
gs -o 0.999.watermask.compact.screen.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true -dPDFSETTINGS=/screen 0.999.watermask.pdf
References
7.4 PDF file output
iText 7 iText 7 for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring.
itext-rups-7.1.14.jar iText RUPS iText® 7.1.14 ©2000-2020 iText Group NV (AGPL-version)
Your -switches include -dDetectDuplicateImages=true which under the circumstances should be superfluous and the device selection can be from one of four as pointed out by KenS.
gs -o 0.999.watermask.compact.screen.pdf -sDEVICE=pdfimage32 -dPDFSETTINGS=/screen 0.999.watermask.pdf
If you want to emulate MS Print As Image PDF on Windows you would find the result in some ways inferior (and often many times bigger). But for comparison it would be,
NOTE:- "%%printer%%... is for a batch file for a command line use "%printer%...
gswin64c.exe -sDEVICE=mswinpr2 -dNoCancel -o "%%printer%%Microsoft Print to PDF" -dPDFSETTINGS=/screen -f "0.999.watermask.pdf"
I need to create a PDF/A from a Folder of Tiff Files.
Creating a PDF (1.5) is working with ImageMagick.
But Converting this PDF to a PDF/A using Ghostscript is a problem.
My GhostScript cmd:
-dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o "C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" -dPDFACompatibilityPolicy=1 "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
Also tryed:
-dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
my PDFA_def.ps is the GS standard with:
/ICCProfile (AdobeRGB1998.icc) % Customise
The created PDF/? is not passing the "Verify compliance with PDF/A-2b" preflight in Adobe Acrobat:
Error
Metadata missing (XMP)
PDF/A entry missing
Syntax problem: Indirect object “endobj” keyword not preceded by an EOL marker
Syntax problem: Stream dictionary improperly formatted
Also not the https://www.pdf-online.com/osa/validate.aspx validator:
File pdfa.pdf
Compliance pdf1.5
Result Document does not conform to PDF/A.
Details
Validating file "pdfa.pdf" for conformance level pdf1.5
XML line 10:212: xmlParseCharRef: invalid xmlChar value 0.
The document does not conform to the requested standard.
The document's meta data is either missing or inconsistent or corrupt.
The document does not conform to the PDF 1.5 standard.
Done.
Also tryed VeraPDF ....
What kind of settings have I forgotten?
Well there's quite a few problems here.
You haven't said what version of Ghostscript you are using, nor have you supplied an example file to experiment with. You also haven't given the back channel output which might contain additional information.
You can't use the supplied model PFA_def.ps without modification, at the very least you need to modify the /ICCProfile entry to point to a real valid ICC profile. I suspect this has caused pdfwrite to abort PDF/A-2 production, which would normally be mentioned in the back channel output.
You haven't set -dColorConversionStrategy, just setting the ProcessColorModel is not sufficient, pdfwrite will mostly ignore that. If you don't tell pdfwrite that you want colours converted to a different space, it will preserve them unchanged, regardless of the Process color model.
With this command its now running:
-dPDFA=2 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dNOPAUSE -dBATCH -o "C:\Temp\TestData\tiff2pdfa\pdfatest.pdf" "C:\Temp\TestData\tiff2pdfa\PDFA\PDFA_def.ps" "C:\Temp\TestData\tiff2pdfa\test.pdf"
Thanks to:
Batch Convert PDF to PDF/A - MARK BERRY
But i have still some Error:
GPL Ghostscript 9.25: UTF16BE text string detected in DOCINFO cannot be represented
in XMP for PDF/A 1, discarding DOCINFO
Processing pages 1 through 56.
Page 1
GPL Ghostscript 9.25: Setting Overprint Mode to 1
not permitted in PDF/A-2, overprint mode not set
Should I be thinking about this "Overpirnt Mode"?
I'm trying to render this pdf on Linux, but the produced png shows missing font blocks.
gs -sDEVICE=png16m -o test.png AGREE\ II\ lijst.pdf
The pdf was generated on a Mac (using its print to pdf feature), and contains embedded fonts:
$> pdffonts AGREE\ II\ lijst.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
EFPVPD+ArialMT TrueType MacRoman yes yes no 12 0
RFRCNT+TimesNewRomanPSMT TrueType MacRoman yes yes no 10 0
FLZUKV+Arial-BoldMT TrueType MacRoman yes yes no 11 0
VQHCHW+Arial-ItalicMT TrueType MacRoman yes yes no 13 0
evince and qpdfview show the pdf just fine, but printing fails, probably because my computer uses the same ghostscript to render the pdf to bitmap.
Help would be highly appreciated.
$> gs -help
GPL Ghostscript 9.22 (2017-10-04)
Copyright (C) 2017 Artifex Software, Inc. All rights reserved.
Usage: gs [switches] [file1.ps file2.ps ...]
Most frequently used switches: (you can use # in place of =)
-dNOPAUSE no pause after page | -q `quiet', fewer messages
-g<width>x<height> page size in pixels | -r<res> pixels/inch resolution
-sDEVICE=<devname> select device | -dBATCH exit after last file
-sOutputFile=<file> select output file: - for stdout, |command for pipe,
embed %d or %ld for page #
Input formats: PostScript PostScriptLevel1 PostScriptLevel2 PostScriptLevel3 PDF
Default output device: x11alpha
Available devices:
alc1900 alc2000 alc4000 alc4100 alc8500 alc8600 alc9100 ap3250 atx23
atx24 atx38 bbox bit bitcmyk bitrgb bitrgbtags bj10e bj10v bj10vh bj200
bjc600 bjc800 bjc880j bjccmyk bjccolor bjcgray bjcmono bmp16 bmp16m
...
plibk plibm png16 png16m png256 png48 pngalpha pnggray pngmono pngmonod
pnm pnmraw ppm ppmraw pr1000 pr1000_4 pr150 pr201 ps2write psdcmyk
psdcmykog psdrgb pwgraster pxlcolor pxlmono r4081 rinkj rpdl samsunggdi
...
xpswrite
Search path:
/usr/share/ghostscript/9.22/Resource/Init :
/usr/share/ghostscript/9.22/lib :
/usr/share/ghostscript/9.22/Resource/Font :
/usr/share/ghostscript/fonts : /usr/share/fonts/Type1 : /usr/share/fonts
Ghostscript is also using fontconfig to search for font files
For more information, see /usr/share/ghostscript/9.22/doc/Use.htm.
Please report bugs to bugs.ghostscript.com.
I even have the fonts installed on my computer:
$> fc-list : family | egrep -i arial\|times
Times New Roman
Times
Arial
Ghostscript 9.22 built from the tagged release source (available at the bottom of the table here currently) on Fedora 23, renders the file as expected for me.
So it seems likely that its not 'gs9.22 on Linux' its 'the build of Ghostscript 9.22 on my flavour of Linux, compiled by the package maintainer on the Linux distribution I'm using' which is most probably at fault.
There could be many reasons for this. Firstly the various distributions apply their own patches to the Ghostscript source tree. Secondly, the packagers insist on using system shared libraries, instead of the 3rd party library sources shipped with the Ghostscript source. We, the Ghostscript development team, know those work, because that's what we test. We can't test every possible version (and patch) of all the system libraries shipped with every flavour of Linux and you may have hit some kind of incompatbility (most likely with FreeType since its fonts).
Note that what you are seeing there are not 'missing font blocks' but missing glyphs. That's the /.notdef glyph which is used to render in place of a glyph when that glyph is not present in the font. You haven't given the back channel output from Ghostscript when rendering the file, if the fonts were missing then there would be some warnings about that.
I would suggest that you fetch the Ghostscript source, either the tarball from the link above, or use Git and get the latest sources. Build it yourself (you will need gcc, make and autotools installed) and test that. If that doesn't work, then please file a bug report here Please be sure to describe how you built Ghostscript, specify the command line you used, and attach the offending file to the bug report.
There is a pre-built Linux binary at the same location, and you could try that, but there's a reasonable chance that it simply won't work on your Linux.
By the way, having the fonts available on your system won't help at all, because the fonts in the PDF file have been subset and re-encoded. There's essentially no way to use the system fonts to replace the ones embedded in the PDF file, the interpreter must use the embedded fonts.
In brief, I'm dealing with a problematic PDF, which:
Cannot be fully rendered in a document viewer like evince, because of missing font information;
However - ghostscript can fully render the same PDF.
Thus -- regardless of what ghostscript uses to fill in the blanks (maybe fallback glyphs, or a different method to accessing fonts) -- I'd like to be able to use ghostscript to produce ("distill") an output PDF, where pretty much nothing will be changed, except font information added, so evince can render the same document in the same manner as ghostscript can.
My question is thus - is this possible to do at all; and if so, what would be command line be to achieve something like that?
Many thanks in advance for any answers,
Cheers!
Details:
I'm actually on an older Ubuntu 10.04, and I might be experiencing - not a bug - but an installation problem with evince (lack of poppler-data package), as noted in Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu.
However, that is exactly what I'd like to handle, so I'll use the fontspec.pdf attached to that post ("PDF triggering the bug.", // v.) to demonstrate the problem.
evince
First, I open this pdf's page 3 in evince; and evince complains:
$ evince --page-label=3 fontspec.pdf
Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'F5.1'
Error (7597): No font in show
Error: Unknown font tag 'F5.1'
Error (7630): No font in show
Error: Unknown font tag 'F5.1'
Error (7660): No font in show
Error: Unknown font tag 'F5.1'
...
The rendering looks like this:
... and it is obvious that some font shapes are missing.
Adobe acroread
Just a note on how Adobe's Acrobat Reader for Linux behaves; the following command line:
$ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf
... generates no output to terminal whatsoever (for more on /a switch, see Man page acroread) -- and the program has absolutely no problem displaying the fonts.
Also, while I'd like to avoid the roundtrip to postscript - however, note that acroread itself can be used to convert a PDF to postscript:
$ ./Adobe/Reader9/bin/acroread -v
9.5.1
$ ./Adobe/Reader9/bin/acroread -toPostScript \
-rotateAndCenter -choosePaperByPDFPageSize \
-start 3 -end 3 \
-level3 -transQuality 5 \
-optimizeForSpeed -saveVM \
fontspec.pdf ./
Again, the above command line will generate no output to terminal; -optimizeForSpeed -saveVM are there because apparently they deal with fonts; the last argument ./ is the output directory (output file is automatically called fontspec.ps).
Now, evince can display the previously missing fonts in the fontspec.ps output - but again complains:
$ evince fontspec.ps
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
...
... and furthermore, all text seems to be flattened to curves in the postscript - so now one cannot select the text in the .ps file in evince anymore (note that the .ps file cannot be opened in acroread). However, one can convert this .ps back into .pdf again:
$ pstopdf fontspec.ps # note, `pstopdf` has no output filename option;
# it will automatically choose 'fontspec.pdf',
# and overwrite previous 'fontspec.pdf' in
# the same directory
... and now the text in the output of pstopdf is selectable in evince, all fonts are there, and evince doesn't complain anymore. However, as I noted, I'd like to avoid roundtrip to postscript files altogether.
display (from imagemagick)
We can also observe the page in the same document with imagemagicks display (note that image panning from the commandline using 'display' is apparently still not available, so I've used -crop below to adjust the viewport):
$ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
**** Warning: considering '0000000000 00000 n' as a free entry.
...
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... which generates some ghostscripish errors - and results with something like this:
... where it's obvious that the missing fonts that evince couldn't render, are now shown here, with imagemagicks display, properly.
ghostscript
Finally, we can use ghostscript as x11 viewer itself -- to observe the same page, same document:
$ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
-c '<</PageOffset [-120 520]>> setpagedevice' \
-f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 74.
Page 3
>>showpage, press <return> to continue<<
^C
... and results with this output:
In conclusion: ghostscript (and apparently by extension, imagemagick) can seemingly find the missing font (or at least some replacement for it), and render a page with that -- even if evince fails at that for the same document.
I would, therefore, simply like to export a PDF version from ghostscript, that would have only the missing fonts embedded, and no other processing; so I try this:
$ gs -dBATCH -dNOPAUSE -dSAFER \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 3.
Page 3
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... but it doesn't work - the output file mypg3out.pdf suffers from the exact same problems in evince as noted previously.
Note: While I'd like to avoid the postscript roundtrip, a good example of gs command line with from pdf to ps with font embedding is here: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts; but the same command line switches for .pdf to .pdf to not seem to have any effect on the problem described above.
OK point 1; you CANNOT use Ghostscript and pdfwrite to create a PDF file 'without any additional processing'.
The way that pdfwrite and Ghostscript work is to fully interpret the incoming data (PostScript, PDF, XPS, PCL, whatever), creating a series of graphics primitives, which are passed to the pdfwrite device. The pdfwrite device then reassembles these into a brand new PDF file.
So its not possible to take a PDF file as input and manipulate it, it will always create a new file.
Now, I would suggest that you upgrade your 9.02 Ghostscript to 9.05 to start with. Missing CIDFonts are much better handled in 9.05 (and will be further improved in 9.06 later this year). (The font you are missing 'Osaka Mono' is in fact a CIDFont, not a regular font)
Using the current bleeding edge Ghostscript code produces a PDF file for me which has the missing font embedded. I can't tell if this will work for you because my copy of evince renders the original file perfectly well.
Added later
Examining the original PDF file I see that the fonts there are indeed embedded (as I would expect, since they are subsets). So in fact as you say in your own answer above, the problem is not font embedding, but the use of CIDFonts.
My answer here will not help you, as pdfwrite will still produce a CIDFont in the output. Basically this is a flaw in your version or installation of evince.
The problem with 'remapping' the characters is that a font is limited to 256 glyphs, while a CIDFont has effectively no limit. So there is no way to put a CIDFont into a Font. The only way to do this would be to create multiple Fonts each of which contained a portion of the original, and then switch between them as required. Slow and klunky.
If you convert to PostScript using the ps2write device then it will do this for you, but you stand a great risk that in the process it will convert the vector glyph data into bitmaps, which will not scale well.
Fundamentally you can't really achieve what you want to do (convert 1 CIDFont into N regular Fonts) with Ghostscript, or in fact with any other tool that I know of. While its technically possible, there is no real point since all PDF consumers should be able to handle CIDFonts. If they can't then its a bug in the PDF consumer.
Right, I got a bit further on this (but not completely) - so I'll post a partial answer/comment here.
Essentially, this is not a problem about font embedding in PDF - this is a problem with font mapping.
To show that, let's analyse the mypg3out.pdf, which was extracted by gs in the OP (from the 3rd page of the fontspec.pdf document):
$ pdffonts mypg3out.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Error: Missing language pack for 'Adobe-Japan1' mapping
CAAAAA+Osaka-Mono-Identity-H CID TrueType yes yes yes 19 0
GBWBYF+CMMI9 Type 1C yes yes yes 28 0
FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType yes yes yes 16 0
ZRLTKK+Optima-Regular TrueType yes yes yes 30 0
ZFQZLD+FPLNeu-Bold Type 1C yes yes yes 8 0
DDRFOG+FPLNeu-Italic Type 1C yes yes no 22 0
HMZJAO+FPLNeu-Regular Type 1C yes yes yes 10 0
RDNKXT+FPLNeu-Regular Type 1C yes yes yes 32 0
GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType yes yes no 26 0
As the output shows - all fonts are, indeed, embedded; so something else is the problem. (It would have been more difficult to observe this in the complete fontspec.pdf, as there are a ton of fonts there, and a ton of error messages.)
The crucial point (I think) here, is that there is:
only one "Error: Missing language pack for 'Adobe-Japan1' mapping" message; and
only one CID TrueType font, which is CAAAAA+Osaka-Mono-Identity-H
There seems to be an obvious relationship between the CID TrueType and the 'Adobe-Japan1' mapping error; and I got that finally clarified by CID fonts - How to use Ghostscript:
CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.
CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.
All good - except here we're dealing with PDF fonts, not PostScript fonts as such; let's demonstrate that a bit.
For instance, 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo describes how the Ghostscript-installed script called prfont.ps can be used to render a table of fonts.
However, here it would be easier with just Listing Ghostscript Fonts [gs-devel], and using resourcestatus operator to query for a specific font - which doesn't require a special .ps script:
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
...
Processing pages 1 through 1.
Page 1
URWAntiquaT-RegularCondensed
Palatino-Italic
Hershey-Gothic-Italian
...
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
....
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.
We got a list of fonts; however, those are system fonts available to ghostscript - not the fonts embedded in the PDF!
(Basically,
gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka will return nothing, and
-c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]' will conclude with "Didn't find this font on the system! Substituting font Courier for CAAAAA+Osaka-Mono-Identity-H.")
To list the fonts in the PDF, the pdf_info.ps script file from Ghostscript (not installed, in sources) can be used:
$ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
...
No system fonts are needed.
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
...
Font or CIDFont resources used:
CAAAAA+Osaka-Mono
DDRFOG+FPLNeu-Italic
FDFZUN+Skia-Regular_wght13333_wdth11999
GBWBYF+CMMI9
GBWBYF+Skia-Regular_wght13333_wdth11999
GTIIKZ+Osaka-Mono
HMZJAO+FPLNeu-Regular
RDNKXT+FPLNeu-Regular
ZFQZLD+FPLNeu-Bold
ZRLTKK+Optima-Regular
So finally we can observe the CAAAAA+Osaka-Mono in Ghostscript - although I wouldn't know how to query more specific information about it from within ghostscript.
In the end, I guess my question boils down to: how could ghostscript be used, to map the glyphs from a CID embedded font - into a font with a different "encoding" (or "character map"?), which will not require additional language files?
Addendum
I have also experimented with these approaches:
pdffonts on the output here will not have the Osaka-Mono listed, but it will still complain "Error: Missing language pack for 'Adobe-Japan1' mapping": $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2
$ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2
same as previously - this (via Ghostscript's "Use.htm") also makes Osaka-Mono disappear from pdffonts list: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
-c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \
-f mypg3out.pdf
this crashes with Error: /undefinedresource in findresource: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
-c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \
-f mypg3out.pdf
Note finally that some of the .ps scripts ghostscript installs, it may use automatically; for instance, you can find gs_ttf.ps:
$ locate gs_ttf.ps
/usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps
... and then using sudo nano locate gs_ttf.ps, you can add the statement (Hello from gs_ttf.ps\n) print at the beginning of the code; then whenever one of the above gs commands is called, the printout will be visible in stdout.
References
Adding your own fonts - Fonts and font facilities supplied with Ghostscript
About "CIDFnmap" of Ghostscript - Features to support CJK CID-keyed in Ghostscript
Bug 689538 – GhostScript can not handle an embedded TrueType CID-Font
Bug 692589 – "Error CIDSystemInfo and CMap dict not compatible" when converting merged file to PDF/A - #1522
Adobe Forums: CMap resources versus PDF mapping resources: Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes
Adobe CIDs and glyphs in CJK TrueType font
Ghostscript and Japanese TrueType font
Installation guide: GS and CID font
Debian -- Filelist of package poppler-data/sid/all
I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and wether they are embedded or not.
Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?
You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.
Using pdftops
One of the most frequently used methods to do this on *nix systems consists of the following steps:
Convert the PDF to PostScript, for example by using XPDF's pdftops (on Windows: pdftops.exe helper program.
Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.
Using fontforge
Another method is to use the Free font editor FontForge:
Use the "Open Font" dialogbox used when opening files.
Then select "Extract from PDF" in the filter section of dialog.
Select the PDF file with the font to be extracted.
A "Pick a font" dialogbox opens -- select here which font to open.
Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.
Using mupdf
Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new: "MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)
Note: pdfextract.exe is a command-line program. To use it, do the following:
c:\> pdfextract.exe c:\path\to\filename.pdf # (on Windows)
$> pdfextract /path/tofilename.pdf # (on Linux, Unix, Mac OS X)
This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.
CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.
Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.
Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a 'swiss knife'-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn't up to date (yet). If you're on a Mac using 'MacPorts': then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.
To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*
So to extract fonts and images, you may need to run one of the following commandlines:
c:\> mutool.exe extract filename.pdf # (on Windows)
$> mutool extract filename.pdf # (on Linux, Unix, Mac OS X)
Downloads are here: mupdf.com/downloads
Using gs (Ghostscript)
Then, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.
Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the 'forward slash', /, as a path separator also on Windows!):
gswin32c.exe ^
-q -dNODISPLAY ^
c:/path/to/extractFonts.ps ^
-c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"
or on Linux, Unix or Mac OS X:
gs \
-q -dNODISPLAY \
/path/to/extractFonts.ps \
-c "(/path/to/your/PDFFile.pdf) extractFonts quit"
I've tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don't know if other font types will also be extracted at all, and if so, in a re-usable way. I don't know if the utility does block extracting of fonts which are marked as protected.
Using pdf-parser.py
Finally, Didier Stevens' pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures. pdf-parser.py is a Python script which can do a lot of other things too. It can also decompress and extract arbitrary streams from objects, and therefore it can extract embedded font files too.
But you need to know what to look for. Let's see it with an example. I have a file named big.pdf. As a first step I use the -s parameter to search the PDF for any occurrence of the keyword FontFile (pdf-parser.py does not require a case sensitive search):
pdf-parser.py -s fontfile big.pdf
In my case, for my big1.pdf, I get this result:
obj 9 0
Type: /FontDescriptor
Referencing: 15 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 32
/FontBBox [ -665 -325 2000 1006 ]
/FontFile2 15 0 R
/FontName /ArialMT
/ItalicAngle 0
/StemV 87
/Type /FontDescriptor
/XHeight 519
>>
obj 11 0
Type: /FontDescriptor
Referencing: 16 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 262176
/FontBBox [ -628 -376 2000 1018 ]
/FontFile2 16 0 R
/FontName /Arial-BoldMT
/ItalicAngle 0
/StemV 165
/Type /FontDescriptor
/XHeight 519
>>
It tells me that there are two instances of FontFile2 inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the /FontFile2 for font /ArialMT, object no. 16 holds the /FontFile2 for font /Arial-BoldMT.
To show this more clearly:
pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
/FontFile2 15 0 R
/FontFile2 16 0 R
A quick peeking into the PDF specification reveals the the keyword /FontFile2 relates to a 'stream containing a TrueType font program' (/FontFile would relate to a 'stream containing a Type 1 font program' and /FontFile3 would relate to a 'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a Type1C or a CIDFontType0C subtype}.)
To look specifically at PDF object no. 15 (which holds the font /ArialMT), one can use the -o 15 parameter:
pdf-parser.py -o 15 big1.pdf
obj 15 0
Type:
Referencing:
Contains stream
<<
/Length1 778552
/Length 1581435
/Filter /ASCIIHexDecode
>>
This pdf-parser.py output tells us that this object contains a stream (which it will not directly display) that has a length of 1.581.435 Bytes and is encoded ( == "compressed") with ASCIIHexEncode and needs to be decoded ( == "de-compressed" or "filtered") with the help of the standard /ASCIIHexDecode filter.
To dump any stream from an object, pdf-parser.py can be called with the -d dumpname parameter. Let's do it:
pdf-parser.py -o 15 -d dumped-data.ext big1.pdf
Our extracted data dump will be in the file named dumped-data.ext. Let's see how big it is:
ls -l dumped-data.ext
-rw-r--r-- 1 kurtpfeifle staff 1581435 Apr 11 00:29 dumped-data.ext
Oh look, it is 1.581.435 Bytes. We saw this figure in the previous command's output. Opening this file with a text editor confirms that its content is ASCII hex encoded data.
Opening the file with a font reading tool like otfinfo (this is a part of the lcdf-typetools package) will lead to some disappointment at first:
otfinfo -i dumped-data.ext
otfinfo: dumped-data.ext: not an OpenType font (bad magic number)
OK, this is because we did not (yet) let pdf-parser.py make use of its full magic: to dump a filtered, decoded stream. For this we have to add the -f parameter:
pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf
What's the size is this new file?
ls -l dumped-data-decoded.ext
-rw-r--r-- 1 kurtpfeifle staff 778552 Apr 11 00:39 dumped-data-decoded.ext
Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key /Length1...
What does file think it is?
file dumped-data-decoded.ext
dumped-data-decoded.ext: TrueType font data
What does otfinfo tell us about it?
otfinfo -i dumped-data-decoded.ext
Family: Arial
Subfamily: Regular
Full name: Arial
PostScript name: ArialMT
Version: Version 5.10
Unique ID: Monotype:Arial Regular:Version 5.10 (Microsoft)
Designer: Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
Manufacturer: The Monotype Corporation
Trademark: Arial is a trademark of The Monotype Corporation.
Copyright: © 2011 The Monotype Corporation. All Rights Reserved.
License Description: You may use this font to display and print content as permitted by
the license terms for the product in which this font is included.
You may only (i) embed this font in content as permitted by the
embedding restrictions included in this font; and (ii) temporarily
download this font to a printer or other output device to help
print content.
Vendor ID: TMC
So Bingo!, we have a winner: pdf-parser.py did indeed extract a valid font file for us. Given the size of this file (778.552 Bytes), it looks like this font had been embedded even completely in the PDF...
We could rename it to arial-regular.ttf and install it as such and happily make use of it.
Caveats:
In any case you need to follow the license that applies to the font. Some font licences do not allow free use and/or distribution. Pirating fonts is like pirating any software or other copyrighted material.
Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all.
Please do also read the following about Pros and (more) Cons regarding font extraction efforts:
http://typophile.com/node/34377 — not available anymore, but can bee seen on Wayback Machine at https://web.archive.org/web/20110717120241/typophile.com/node/34377
Use online service http://www.extractpdf.com. No need to install anything.
Even though this question is 10 years old, it is still valid and as technology changes so does a valid answer.
In searching the current answers noticed none of them note WOFF (Web Open Font Format) (W3C) (Wikipedia) which can be used to recreate the individual characters (glyphs) and display them in a web page accurately.
Using the free online web page by IDR Solutions, PDF to HTML5 (link), convert a PDF to a zip file. In the resulting zip will be a font directory of woff file types. Current Internet browsers support woff files if you were not aware. (reference) These can be examined at the online site FontDrop! (link).
WOFF files can be converted to/from OTF or TTF at WOFFer – WOFF font converter
Also the zip file from PDF to HTML5 will contain an HTML file for each page of the PDF that can be opened in an Internet browser and is one of the best and most accurate PDF translations I have found or seen.
Eventually found the FontForge Windows installer package and opened the PDF through the installed program. Worked a treat, so happy.
http://www.verypdf.com/app/pdf-font-extractor/pdf-font-extracting-tool.html
IMO easiest way to extract fonts (Windows).
PDF2SVG version 6.0 from PDFTron does a reasonable job. It produces OpenType (.otf) fonts by default. Use --preserve_fontnames to preserve "the font/font-family naming scheme as obtained from the source file."
PDF2SVG is a commercial product, but you can download a free demo executable (which includes watermarks on the SVG output but doesn't otherwise restrict usage). There may be other PDFTron products that also extract fonts, but I only recently discovered PDF2SVG myself.
One of the best online tools currently available to extract pdf fonts is http://www.pdfconvertonline.com/extract-pdf-fonts-online.html
This is a followup to the font-forge section of #Kurt Pfeifle's answer, specific to Red Hat (and possibly other Linux distros).
After opening the PDF and selecting the font you want, you will want to select "File -> Generate Fonts..." option.
If there are errors in the file, you can choose to ignore them or save the file and edit them. Most of the errors can be fixed automatically if you click "Fix" enough times.
Click "Element -> Font Info...", and "Fontname", "Family Name" and "Name for Humans" are all set to values you like. If not, modify them and save the file somewhere. These names will determine how your font appears on the system.
Select your file name and click "Save..."
Once you have your TTF file, you can install it on your system by
Copying it to folder /usr/share/fonts (as root)
Running fc-cache -f /usr/share/fonts/ (as root)