Ghostscript Combine ps with pdf file and add the first line (header) - pdf

I need to combine ps with pdf file and add the first line "#S Company Name" for the combined file.
I use the command:
gswin64c.exe -o combine.ps -sDEVICE=ps2write -f fax003.ps -f warning.pdf
How can I add an extra line "#S Company Name" to the output file combine.ps?
Should get such a file:
#S Company Name **-- first line--**
%!PS-Adobe-3.0
%%BoundingBox: 0 0 595 842
%%HiResBoundingBox: 0 0 595.00 842.00
%%Creator: GPL Ghostscript 922 (ps2write)
%%LanguageLevel: 2
%%CreationDate: D:20180304115454+02'00'
%%Pages: 13
%%EndComments
%%BeginProlog
/DSC_OPDFREAD true def
/SetPageSize true def
/EPS2Write false def
currentdict/DSC_OPDFREAD known{
currentdict/DSC_OPDFREAD get
}{
false
}ifelse
10 dict begin
/DSC_OPDFREAD exch def
/this currentdict def
...

What you've written as the output file is not valid PostScript. The presence of the "#5...." text will cause a PostScript interpreter to throw an error.
So, assuming that you have a workflow where this will be stripped off. You can't do exactly this with an unmodified Ghostscript. Your only solution will be to open the file after it has been completely output and modify it. Or, of course, alter the pdfwrite device source code and rebuild the Ghostscript executable.
Since you appear to be using Ghostscript in a commercial environment, can I just draw your attention to the licence under which it is supplied; this is the AGPL v3, which covers usage such as Software-as-a-Service. I don't want you to think I'm accusing you of using Ghostscript other than in accordance with the terms of the licence, I just want to make certain you are aware of the terms.

Related

How do I crop pages 3&4 in a multipage pdf using ghostscript

I would like to crop just some pages in a multipage pdf keeping all pages, some cropped, others not. I tried the following but it "deletes" the non cropped pages...
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -dFirstPage=3 -dLastPage=4 -c "[/CropBox [24 72 1559 1794]" -c " /PAGES pdfmark" -f input.pdf
I've seen the posts on different cropping on odd and even pages, but I could not figure out how to apply this to a certain page in a multipage document.
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -c "<</EndPage {0 eq {2 mod 0 eq {[/CropBox [0 0 1612 1792] /PAGE pdfmark true}{[/CropBox [500 500 612 792] /PAGE pdfmark true} ifelse}{false}ifelse}>> setpagedevice" -f input.pdf
This does crop all pages according to the settings of the second CropBox. If anybody is wondering about the large margins... I apply this do large drawings.
I have also tried to substitute some operators to only apply the crop to a certain page number: "sub 4" instead of "2 mod" was one attempt to attain the " 0 eq" condition only when the current page number reaches 4.
OK first things first, Ghostscript and the pdfwrite device do not 'modify' an input PDF file. For regular readers; standard lecture here, if you've read it before you can skip the following paragraph.
The way this works is that the input file is completely interpreted into a sequence of graphics primitives which are sent to the device. Rendering devices then call the graphics library to render the primitives to a bitmap, which is then output. High level (vector) devices, such as pdfwrite, translate the primitives into equivalent operations in some high level page description language, and emit that.
So, when you select -dFirstPage and -dLastPage, those are only pages for the input file you are choosing to process. So pdfwrite isn't 'deleting' your pages, you never sent them to the device in the first place.
Now, Ghostscript is a PostScript interpreter, and therefore its action can be affected by writing PostScript programs. In your case you probably want to actually process all the pages (so drop -dFirstPage and -dLastPage), but only write the pdfmark on selected pages.
The way to do this is via a BeginPage or EndPage procedure. If you search here or in the PostScript tag you'll find a number of examples. Fundamentally both procedures are called with a reason code and a count of pages so far.
From memory you will want to check the reason code is 2. If it is, then you want to check the count of pages, and it it matches your criteria (in the case here, count is 3 or 4), execute the /PAGE pdfmark. In any case you want to return 'true' so that the page is emitted.
[EDIT added here]
Hmm, OK I see the problem. What's happening is that the PDF interpreter is calling 'setpagedevice' to set the page size for each page, in case the page size has altered. The problem is that this resets the page count back to 0 each time.
Now, I wouldn't normally suggest the following, because it relies on some undocumented aspects of Ghostscript's PDF interpreter. However, I happen to know that the PDF interpreter tracks the page number internally using a named object called /Page#.
So, if I take the code you wrote, and modify it slightly:
<<
/EndPage {
0 eq {
pop /Page# where {
/Page# get
3 eq {
(page 3) == flush
[/CropBox [0 0 1612 1792] /PAGE pdfmark
true
}
{
(not page 3) == flush
[/CropBox [500 500 612 792] /PAGE pdfmark
true
} ifelse
}{
true
} ifelse
}
{
false
}
ifelse
}
>> setpagedevice
Couple of things to note; there's some debug in there, the lines with '== flush' print out some stuff on the back channel so you know how each page is being handled. If /Page# isn't defined, then the code simply leaves everything alone, this is just some basic safety-first stuff.
Rather than type all this on the command line (which also loses indenting and is hard to read) I stuck it in a file, called test.ps, then invoked GS as:
gswin32c -sDEVICE=pdfwrite -sOutputFile=out.pdf test.ps input.pdf
Its not the neatest solution in the world, but it works for me.

Why does ghostscript replace fontnames to "CairoFont"?

I use ghostscript to optimize pdf files (mostly with respect to size), for which it does a great job. The command that I use is:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-dCompatibilityLevel=1.4 -sOutputFile=out.pdf in.pdf
However, it seems that this replaces fonts (or subsets them) and does not preserve their names. It replaces it by CairoFont. How could I get ghostscript to preserve the fontnames?
Example:
A simple pdf file (created with Inkscape), with a single text element in it (Nimbus Roman) as an input (in.pdf):
for which pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
PMLNBT+NimbusRomanNo9L Type 1 yes yes yes 5 0
However, after running ghostscript over the file pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
OEPSCM+CairoFont-0-0 Type 1C yes yes no 8 0
So, is there a way to have ghostscript (or libcairo?) preserve the name of the font?
The input file is uploaded here.
Ghostscript doesn't change the font name, but there are, in fact, several different font 'names' in a PDF file.
In the case of your file the PDF FontDescriptor object has a name
<<
/Type /FontDescriptor
/FontName /PMLNBT+NimbusRomanNo9L
/Flags 4
/FontBBox [ -168 -281 1031 924 ]
/ItalicAngle 0
/Ascent 924
/Descent -281
/CapHeight 924
/StemV 80
/StemH 80
/FontFile 7 0 R
>>
which refers to a FontFile stream
/FontFile 7 0 R
That stream contains the following:
%!PS-AdobeFont-1.0: NimbusRomNo9L-Regu 1.06
%%Title: NimbusRomNo9L-Regu
%Version: 1.06
%%CreationDate: Thu Aug 2 13:14:49 2007
%%Creator: frob
%Copyright: Copyright (URW)++,Copyright 1999 by (URW)++ Design &
%Copyright: Development; Cyrillic glyphs added by Valek Filippov (C)
%Copyright: 2001-2005
% Generated by FontForge 20070723 (http://fontforge.sf.net/)
%%EndComments
FontDirectory/NimbusRomNo9L-Regu known{/NimbusRomNo9L-Regu findfont dup/UniqueID known pop false {dup
/UniqueID get 5020931 eq exch/FontType get 1 eq and}{pop false}ifelse
{save true}{false}ifelse}{false}ifelse
11 dict begin
/FontType 1 def
/FontMatrix [0.001 0 0 0.001 0 0 ]readonly def
/FontName /CairoFont-0-0 def
Do you see the FontName in the actual font ? Its called CairoFont-0-0
This brings me back to a point which I reiterate frequently here and elsewhere; when you process a PDF file with Ghostscript and emit a new PDF file using the pdfwrite device you are not 'optimising', 'converting', 'subsetting' or in a general sense manipulating the content of the original PDF file.
What Ghostscript does is interpret the PDF file, ths produces a set opf marking operations (such as 'stroke', 'fill', 'image' etc) which it sends to the selected Ghostscript device. Most Ghostscript devices will then use the graphics library to render the operations to a bitmap and when the page is complete will write the bitmap to a file. The 'high level' or 'vector' devices instead repackage the operations into another Page Description Language. In the case of pdfwrite, that's a PDF file.
What this means in practice is that the emitted PDF file has nothing (apart from appearance) in common with the original PDF file. In particular the description of the objects may be different.
So in your case, the pdfwrite device doesn't know what the font was called in the original PDF object. It does know that the font that was defined was called Cairo-0-0 so that's what it calls the font when it emits it.
Frankly this is another piss-poor example from Cairo, to go along with defining each page as containing transparency whether it does or not, the FontName in the Font object is supposed to be the same as the name in the Font stream.
Its pretty clear that the FontName has been altered, given the rest of the boilerplate there.

-dSubsetFonts=false option stops showing TrueType fonts /glyphshow

I have a PostScript that uses TrueType fonts. However, I want to include rarly used characters like registration marks (®) and right/left single/double quotes (’, “ etc).
So I used glyphshow and called the names of the glyphs
%!
<< /PageSize [419.528 595.276] >> setpagedevice
/DeviceRGB setcolorspace
% Page 1
%
% Set the Original to be the top left
%
0 595.276 translate
1 -1 scale
gsave
%
% Save this state before moving x y specifically for images
%
1 -1 scale
/BauerBodoniBT-Roman findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 0 setrgbcolor
10 -40 moveto /quoteright glyphshow
10 -80 moveto /registered glyphshow
/Museo-700 findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 1 setrgbcolor
10 -120 moveto /quoteright glyphshow
10 -180 moveto /registered glyphshow
showpage
When I execute this PostScript using the following command (due to my requirement for the pdf to be editable in Illustrator i.e. can be opened with all fonts intact) the PDF shows nothing but seems to contain the glyphs if you copy and paste from the pdf into a text file.
gs -o gly_subsetfalse.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dSubsetFonts=false -dPDFSETTINGS=/prepress textglyph.ps
However, this above command now causes issues with pulling it into Illustrator. The rare glyphs become unrecongisble (', Æ). Normal characters and regular glyphs seem fine i.e. /a glyphshow and just show text appear in pdf and illustrator.
So, it seems that having the SubsetFonts option as True shows rare glyphs but this stops me from pulling the PDF into Illustrator.
Attached are the TrueType Fonts for reference and two PDFs (one with subsetfont option being truw and the other not - default).
I have also tried the following command with the same ill results (no visible glyphs appearing on the PDF and Illustrator incorrectly shows the glyphs).
gs -o gly_subsetfalse_embedallfonts.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/prepress -dSubsetFonts=false -dEmbedAllFonts=true textglyph.ps
But with this command I also get a PreFlight error from the PDF if that helps:
"Glyph width info in PDF does not match width info in embedded font"
Attached are all the files spoke about above - click here.
Encoding the font also does not produce good results.
I have encoded a TrueType(and a Type42) font in my PostScript and listed a few new characters to glyphshow.
Results are:
Command 1:
gs -o encode_ttf_subset_false.pdf -sDEVICE=pdfwrite -dSubsetFonts=false encode.ps
Results 1:
Open the PDF in Acrobat does NOT display any glyphshow characters.
Command 2:
gs -o encode_ttf_subset_true.pdf -sDEVICE=pdfwrite encode.ps
Results 2:
Open the PDF in Acrobat and it DOES show the glyphshow characters but not in Illustrator.
Command 3:
gs -o encode_ttf_subset_false_embedtrue.pdf -sDEVICE=pdfwrite -dSubsetFonts=false -dEmbedAllFonts=true encode.ps
Results 3:
Same as Result 1 (glyphshow characters do not appear).
Below is my new PostScript with Encoded TTF and Type42 (I've also included them in my file further below).
Is this a bug at least with Ghostscript?
/museobold findfont dup %%%%% This is the Type42 Font
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/museobold-Esp currentdict definefont pop
end
/museobold-Esp 18 selectfont
72 600 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
/BauerBodoniBT-Roman findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/BauerBodoniBT-Roman-Esp currentdict definefont pop
end
/BauerBodoniBT-Roman-Esp 18 selectfont
72 630 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
showpage
Click here to downloading the following: BBBTRom.ttf (TrueType font); 3 pdfs (results 1, 2 and 3); museobold (TrueType font converted to Type42 using ttftotype42) and encode.ps.
This is back to your problem with using Illustrator as a general PDF application. It can't do that. Now as you note you've found ways round that in the past, this time I believe you are out of luck.
The PostScript glypshow operator doesn't have a PDF equivalent. Also, because of the way glyphshow works, we cannot simply use any existing font instance to store the glyph (because the glyph may not be, and probably isn't, present in the Encoding). As a result pdfwrite does the only thing it can. It makes a new font which consists only of the glyphs used by glyphshow from the specific original font's CharStrings.
Because we don;t have an Encoding to work from we have to use a custom (suymbolic) Encoding (because fonts in a PDF file have to have an Encoding) which from your previous experience I suspect means that Illustrator is unable to read the font we embed.
Using glyphshow with pdfwrite is something I would not encourage.
Now having said that, there should not be a problem with the PDF file when SubsetFonts is true, though I do have an open bug report which sounds similar. You haven't actually said which version of Ghostscript you are using, so I can't be sure if its the same problem. (nor do I have the same fonts etc). Note that this is not (I believe) related to your problem with Illustrator, that's caused by your use of glyphshow and some Illustrator limitation.
As a general rule I would not use -dPDFSETTINGS, certainly not while trying to debug a problem, nor would I limit the output to PDF 1.3.

Convert subset of postscript file to pdf documents

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben
Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).
Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.
Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps

High-res images from PDFS

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.
I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.
Recipe 1, pdfimages and ImageMagick:
First do:
$ pdfimages $MY_PDF.pdf foo"
Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.
Then for each *.pbm do:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)
Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).
Recipe 2 Imagemagick solo:
convert +adjoin $MY_PDF.pdf pages.tif
This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).
Pro: Orientation stays!
Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.
How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?
Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
In Short: You have to tell ImageMagick at which density it should scan the PDF.
so convert -density 600x600 foo.pdf foo.png will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add -resize 3000x3000 or whatever size you require and it will be scaled down.
Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)
Chris
I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.
The flow is as follows:
Use pdfimages (apt-get install poppler-utils) to get a set of pbm
files (not shown below).
For each file:
Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.
YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.
Here's the implementation in a simple bash script:
#!/bin/bash
# Rotates a pbm file in place.
# Pass a .pbm as the only arg.
file=$1
TMP="/tmp/rotation-calc"
mkdir $TMP
# Dependencies:
# convert: apt-get install imagemagick
# ocrad: sudo apt-get install ocrad
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"
# Make copies in all four orientations (the src file is north; copy it to make
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"
cp $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file
# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"
$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text
# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table
# Take the bottom two; these are likely right side up and upside down, but
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table
# Spellcheck. The lowest number of misspelled words is most likely the
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
txt=$(echo $record | $AWK '{ print $2 }')
misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table
# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')
mv $rotated_file $file
# Clean up.
if [ -d $TMP ]; then
rm -r $TMP
fi