Ghostscript should not embed font but only list a substitute - pdf

I do have a pdf generating pipeline where at the end ghostscript (Linux) gets called to end up with a PDF (input ps). The PDF must be as small as possible, so the general commandline used is
ps2pdf13 -dSAFER -dPDFSETTINGS=/default -dEmbedAllFonts=false -dNoOutputFonts -dFastWebView infile outfile
That generates nice PDF files without fonts included as wanted, the assumption is that the target system then should use whatever they have to replace. Yes, this can mean that different systems do use slightly different fonts and as such get different looks.
Mostly works, there are 7 different fonts listed in the PDFs properties. Works nicely on Linux.
Windows (Acrobat Reader) complains about one of them missing, and then doesn't render any of that ones characters.
I know I can let gs embed fonts, except that increases the PDF size by 50%. Would like to avoid that (while its around 6000bytes, this multiplies by approx 30000 times for every run, and as such does count).
I would love to have a way to "embed" in the PDF an information of "For Font Helvetica-Narrow just use Arial Narrow" (or similar).
Does that exist?
[Edit]
Sorry for the late reply, busy. :(
Well, ok. I was thinking of a list of possible options for font selection. Also, coming from that way, the question may be gone the wrong way.
The options, btw, do make different size, though it seems to be the
-dEmbedAllFonts one to be responsible for sizes, -dNoOutPutFonts
doesnt seem to have any effect actually.
I have to compare against a (very old) distiller, which we try to replace, and using pdffonts, I get the following tables:
psp2df:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Helvetica-Narrow Type 1 Custom no no no 11 0
Helvetica-Bold Type 1 Custom no no no 9 0
Helvetica-Narrow-Bold Type 1 WinAnsi no no no 13 0
Courier Type 1 Custom no no no 15 0
Courier-Bold Type 1 Standard no no no 10 0
Helvetica Type 1 Custom no no no 8 0
Times-Italic Type 1 Standard no no no 21 0
distiller:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Helvetica Type 1 Custom no no no 4 0
Helvetica-Bold Type 1 Custom no no no 5 0
Courier Type 1 Custom no no no 6 0
Courier-Bold Type 1 Custom no no no 7 0
Helvetica-Narrow Type 1 Custom no no no 8 0
Helvetica-Narrow-Bold Type 1 Custom no no no 9 0
Times-Italic Type 1 Custom no no no 15 0
With the ps2pdf created PDF file Acrobat Reader complains about "Font
Helvetica-Narrow can not be found". The distiller one works.
I don't get it. It's the same list, at least for that font.
And obviously it then looks crap.
One solution is to embed fonts. Then the font list turns into
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XVQNWP+Helvetica-Narrow Type 1C Custom yes yes no 11 0
Helvetica-Bold Type 1 Custom no no no 9 0
LBTZEH+Helvetica-Narrow-Bold Type 1C WinAnsi yes yes no 13 0
Courier Type 1 Custom no no no 15 0
Courier-Bold Type 1 Standard no no no 10 0
Helvetica Type 1 Custom no no no 8 0
Times-Italic Type 1 Standard no no no 21 0
and the file size goes up a load, which we want to avoid. Distiller
shows its possible, but not how.

No, you cannot define a substitute font for a missing one, that is entirely at the discretion of the viewer. How would it help anyway ? If the substitute you define isn't available to the viewer, then it would have to fall back to its own substitution anyway, or fail altogether.
A few comments on your command line:
If you are using -dNoOutputFonts then your PDF file should not contain any fonts, or font references, at all. It would also be (considerably) larger than disabling font embedding, and possibly larger than the same PDF with subset fonts embedded, because all the text will be included as path data, for even moderate amounts of text the repetition of the path data will exceed the font size.
Its hard to see how you are managing to produce a file which ends up referencing fonts, but doesn't include the font.
You don't need to specify -dPDFSETTINGS=/default because that is the default...
If you want a smaller file, do not specify -dFastWebView that produces a linearised PDF file which is larger (because of the format) than a non-linearised file. Very few viewers honour it, even those that do can only accelerate the first page view, and if the file is very small, its pointless since the entire file will arrive as fast as the early portion of the linearised file.
Forcing the version to 1.3 will likely make the file size larger too, at least in the future.

Related

Why can't I convert certain TIF files that I received in a split archive?

I received a large number of document files, where each document has its own split archive for each page (i.e. file1.001,file1.002,file2.001,file3.001). These are meant to be TIF files that can easily be combined and converted into PDF documents.
However, some of these files will not convert through imagemagick. Some can simply be converted using a different program, which works fine. There are some files where this doesn't work. I tried converting them to .jpg, then to tif, but they won't convert to .jpg. Things got weird when I converted them to .png, as some of these files would have multiple output files associated with them.
This is hard to explain, but I'll try and give an example; file1.001 and file1.002 both have the same image present on them when converted to tif and opened. However, when either of the tif documents is converted to a .png, two .png files are created. One has the original page, but the other one has a second page of the document that I could not view previously.
What could be causing this weird behavior, and how can I convert these to pdf more reliably?
I also used BlueBeam Staple to convert the files, if that helps at all.
Edit:
I've verified I'm on the latest imagemagick release, and I've been using it through PHP to process files. I'm running Windows 10.
Also, here's some example files to play around with. The first TIF actually shows the second page, instead of the page I normally see when I open the file.
Edit 2: Sorry, I thought uploading the image would preserve the file type. Here's a link to some test samples
When I convert your tiff to png, I get two files using IM 7.1.0-10 Q16-HDRI or IM 6.9.12-25 Q16 both on Mac OSX Sierra.
magick -quiet 294944.tif x.png
Produces:
and
Is this not what you get or expect?
P.S.
What are the other two files: 327924.001 327924.002
If those are some kind of split tiff, then it does not look like libtiff, which Imagemagick uses to read TIFFs can handle them. I get errors when attempting to use identify on them.
You definitely have some issue with whatever attempted to write those tiffs.
instrument 294944 page 1 of 2 = G4 199 dpi sheet 2 of 2 294944.tif (25.17 x 17.53 inches)
instrument 294944 page 2 of 2 = G4 199 dpi sheet 1 of 2 294944.tif (24.12 x 17.63 inches)
instrument 327501 page 1 of 1 = UN 72 dpi sheet 1 of 1 327924.001 (124.78 x 93.86 inches)
instrument 327924 page 1 of 2 = G4 400 dpi sheet 1 of 2 327924.002 (23.80 x 17.53 inches)
instrument 327924 page 2 of 2 = G4 400 dpi sheet 2 of 2 327924.002 (23.84 x 17.41 inches)
Two are identified as CCITT Group 4 Fax Encoding which is common for TIFFs of this type.
Tiff is a multi image format so a multipage FAX can be viewed as one file or 4 different printing CMYK colour plates could be sent as one image file for either overlay as one check print or printed one at a time for quality inking.
The file name Tif (or tiff) is usually applied to files with one or more pages (even 400+ for a long novel)
The extension part001.tif part002.tif is usually applied to groups of multiple pages OR for single sequential pages part1.001.tif part1.002.tif
Unfortunately for you you have a mix following a convention that seems to indicate number of pages 002 = 2 pages, but in inconsistent order, so need to check which were used for each file, as there is uncertainty.
Also the internal number does NOT always reflect the filename? perhaps transfer of interest ?
IN ADDITION you have a mix of compression methods and resolution thus cannot be sure of correct scale to be applied.
The best way to resolve this issue is decide how you wish them to be regrouped/sequenced and use the correct scale for each page or group of pages then recombine as desired into PDF.
It would help for a large number to tabulate the pages by number scale size compression etc and then process in identical groups before reorder and merge.

Find tagged content in PDF/A-1a using pdfbox

I have what I presume to be a PDF/A-1a file that was generated by apache fop and has an overlay letterhead put on using OverlayPDF from pdfbox. preflight recognizes the file as ok (but obviously only PDF/A-1b) and Acroreader says it is "PDF/A" mode and "Tagged: yes" in the document properties. I would like to see how that looks so I could maybe tweak fop into some small improvements.
My question is, where can I look to see the tagged content (i.e. the text representation of what in PDF is a kerned sequence of char outputs), preferably without coding myself, e.g. using the debugger/PDFReader from pdfbox? I'm a little lost there - is there an alternative way getting a textual output of the document structure e.g. into an xml file to search it using an editor? - TIA!
Edit
The letterhead(s) itself is originally postscript and converted to PDF/A-1b using ghostscript, then overlayed with
java -jar pdfbox-app-2.0.0-RC3.jar OverlayPDF letter_plain.pdf \
followingpages_letterhead.pdf -first firstpage_letterhead.pdf \
letter_with_head.pdf
The letter_plain.pdf is generated with fop using
fop -pdfprofile 'PDF/A-1a' -v -d -c my_fop_config.cfg -xml letter.xml \
-xsl letter_to_fo.xsl -pdf letter_plain.pdf
The versions used are pdfbox 2.0 and fop 1.1.
In case the letter_with_head.pdf would no longer be PDF/A-1a then the question would apply to the letter_plain.pdf which should be 1a as per the fop call, would have to choose a different solution (like svg) to get the letterhead in then.
Edit 2
Example pdfs can be found here: https://www.magentacloud.de/share/j9qk7jfzyv - there is no need for a separate followingpages_letterhead.pdf as the sample is only one page.
Edit 3
I have suspicion that text is buried somewhere below Root/StructTreeRoot/ParentTree/Nums/[1]/[3]/P/P/P/P/P/P (assuming that the P's somehow map the fo:block's) but can't get nowhere showing text from the pdf.
The structure tree entries in the PDF at hand maps to marked content in the pages content stream. As an example the entry in
Root/StructTreeRoot/K/[0]/K/[0]/K/[1]/K/[0]/K/[0]/K/[0]/K/[0]
maps to this part of the pages content stream
/Span << /MCID 0 >> BDC
BT
/F15 11 Tf
1 0 0 -1 0 9.163 Tm
[ (Bes) 15 (tell-Nr) 48 (. 1) 34 (23) 6 (456) 29 (7) 40 (8) ] TJ
ET
EMC
As can be seen there is no additional definition so there is no easily displayable text other than parsing the TJoperator in this example sequence. So the tagging is used to define the structure of the document pointing to different building blocks only.
In addition there is some information for Accessibility Support. But that's limited to specifying the Langattribute in the structure tree.

How to convert a vector eps file to pdf?

I have a EPS file in vector format that I need to convert to PDF, retaining its vector format. I'm using a Windows 7 system, and I'm trying to find a tool that I can redistribute with my application. It can't be GUI or online based; I need my application to use it as a library or via a system call.
I have tried the following tools without success:
ghostscript 9.06 - ps2pdf - Outputs a blank pdf.
ImageMagick - Generates a pdf with the correct image, but it's a raster converter so it does not preserve the vector format.
UniConvertor - Outputs a blank pdf.
pstoedit - Outputs a blank pdf.
Of course, I'm not an expert with any of these tools listed so it's quite possible I'm just not running the tool with the correct configuration; if anyone recognizes a blank pdf as being a symptom of an incorrectly configured run with one of the tools, please let me know of possible fixes. Thank you for any help.
Here is the header of the eps file:
%!PS-Adobe-2.0 EPSF-1.2
%%Creator:Adobe Illustrator(TM) 1.1
%%For:OPS MANUAL FLOE
%%Title:ILLUS.MAC
%%CreationDate:7/27/87 3:40 PM
%%DocumentProcSets:Adobe_Illustrator_1.1 0 0
%%DocumentSuppliedProcSets:Adobe_Illustrator_1.1 0 0
%%DocumentFonts:Courier
%%+Helvetica
%%BoundingBox:000 -750 650 50
%%TemplateBox:288 -360 288 -360
%%EndComments
%%BeginProcSet:Adobe_Illustrator_1.1 0 0
The Bounding box says the marks extend from 0,-750 to 650, 50
So almost the entire content (750/800) is below the page. Note that Ghostscript ignores DSC comments, they are, after all, comments.
In order to position this on the page, you must translate the origin and potentially scale the page. Please note that EPS files are intended for inclusion in other documents, not for printing on their own, and its up to the document manager to read the BoundingVox comments and position the EPS correctly.
In the absence of a document manager, you will have to do this yourself. Note that changing the comments will have no effect at all.
I would suggest you start by prepending the line:
0 750 translate
which will move the origin 750 units vertically and so the page will then extend from 0,0 to 650,800 and see what effect that has.

How to identify PDF files that need OCR?

I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?
It will take for ever if I ran every single file through an OCR processor.
I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.
EDIT:
This should get you started:
foreach ($pdffile in get-childitem -filter *.pdf){
$pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
write-host $pdffile.fullname
write-host $pdftext.length;
write-host $pdftext;
write-host "-------------------------------";
}
Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.
XPDF worked for me in a different way. But not sure it is the right way.
My PDFs with image also gave text content. So I used pdffonts.exe to verify if the fonts are embedded in the document or not.In my case all image files showed 'no' for embedded value.
> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name type emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> Helvetica Type 1 no no no 7 0
Where as all searchable PDFs gave 'yes'
> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name type emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri TrueType yes yes no 7 0
> ABCDEE+Calibri,Bold TrueType yes yes no 9 0
I found that TotalCmd has a plugin that handles this:
https://totalcmd.net/plugring/pdfOCR.html
pdfOCR is wdx plugin that discovers how many pages of PDF file in
current directory needs character recognition (OCR), i.e. how many
pages in PDF file have no searchable text in their layout. This is
mostly needed when one is preparing PDF files for one’s documentation
or archiving system. Generally in one’s work with PDF files they need
to be transformed from scanned version to text searchable form before
they are included in any documentation to allow for manual or
automatic text search. The pdfOCR plugin for Total Commander fulfils a
librarian’s need by presenting the number of pages that are images
only with no text contained. The number of scanned pages are presented
in the column “needOCR”. By comparing the needOCR number of pages with
the number of total pages one can decide if a PDF file needs
additional OCR processing.
You can scan a folder or entire drive using desktop search tool "dtSearch". At the end of the scan, it will show the list of all "image only" PDFs. In addition, it will also show a list of "encrypted" PDFs if any.

Determine fonts used in postscript (.ps) file

Given a postscript file that has the following header
%!PS-Adobe-3.0
I would like to list all fonts used in the file. The output does not have to be perfect, but I need to make sure I get all references to any font being used. I am aware there are different types of fonts, and that a font may or may not be embedded in the postscript file.
My current best idea is to grep/search for the word Font case insensitively and go from there.
Will this get me all the font references?
Any better way to achieve this?
I tend to use .NET/C# for development purposes, but any solution is appreciated.
Thanks,
Bernard
UPDATE:
lhf's answer solved the problem, due to formatting and length constraints I am adding a working usage example based on his recommendations.
Windows batch file that can be saved to a .cmd file and run from the command prompt:
REM Prerequisites:
REM - GPL Ghostscript 8.64 # http://pages.cs.wisc.edu/~ghost/doc/GPL/gpl864.htm
REM - pdffonts # 3.02pl4 win32 download # http://www.foolabs.com/xpdf/download.html
REM Add directories to path, contains ps2pdf and its dependency gswin32c.exe
SET PATH=%PATH%;C:\Program Files\gs\gs8.64\lib;C:\Program Files\gs\gs8.64\bin
REM Add pdffonts directory to path
SET PATH=%PATH%;c:\temp\path-toxpdf-3.02pl4-win32
REM Convert postscript file to pdf file
call ps2pdf input.ps temp.pdf
REM list pdf file fonts
call pdffonts temp.pdf
Sample output:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
DQRDAA+BCC128Medium-Identity-H CID TrueType yes yes no 21 0
MIAVUG+Verdana-Identity-H CID TrueType yes yes no 13 0
BKNKQN+Verdana-Identity-H CID TrueType yes yes no 10 0
Convert the file to pdf and then use pdffonts if you have it.
If you're into PS programming, you could run a mock PS intepreter (in PS) that ignores most things except findfont.
If the PostScript file conforms to the PostScript Language Document Structuring Conventions Specification you may look for PostScript comments starting witht the strings:
%%DocumentNeededResources:
%%DocumentSuppliedResources:
%%DocumentFonts:
%%DocumentNeededFonts:
%%DocumentSuppliedFonts:
try the following regular expression:
#"/.*?\sfindfont"
it will give you some extra stuff, but you can play with it from there.