Convert subset of postscript file to pdf documents - pdf

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben

Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).

Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.

Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps

Related

Ghostscript Combine ps with pdf file and add the first line (header)

I need to combine ps with pdf file and add the first line "#S Company Name" for the combined file.
I use the command:
gswin64c.exe -o combine.ps -sDEVICE=ps2write -f fax003.ps -f warning.pdf
How can I add an extra line "#S Company Name" to the output file combine.ps?
Should get such a file:
#S Company Name **-- first line--**
%!PS-Adobe-3.0
%%BoundingBox: 0 0 595 842
%%HiResBoundingBox: 0 0 595.00 842.00
%%Creator: GPL Ghostscript 922 (ps2write)
%%LanguageLevel: 2
%%CreationDate: D:20180304115454+02'00'
%%Pages: 13
%%EndComments
%%BeginProlog
/DSC_OPDFREAD true def
/SetPageSize true def
/EPS2Write false def
currentdict/DSC_OPDFREAD known{
currentdict/DSC_OPDFREAD get
}{
false
}ifelse
10 dict begin
/DSC_OPDFREAD exch def
/this currentdict def
...
What you've written as the output file is not valid PostScript. The presence of the "#5...." text will cause a PostScript interpreter to throw an error.
So, assuming that you have a workflow where this will be stripped off. You can't do exactly this with an unmodified Ghostscript. Your only solution will be to open the file after it has been completely output and modify it. Or, of course, alter the pdfwrite device source code and rebuild the Ghostscript executable.
Since you appear to be using Ghostscript in a commercial environment, can I just draw your attention to the licence under which it is supplied; this is the AGPL v3, which covers usage such as Software-as-a-Service. I don't want you to think I'm accusing you of using Ghostscript other than in accordance with the terms of the licence, I just want to make certain you are aware of the terms.

Reordering large PDF files using command line tools

I'm working with PDF files that have hundreds of forms within them. Each form is 2 pages long, so in most files pages 1-2 is the first form, pages 3-4 is the second form, and so on.
However, there are several PDF files where the page order of the forms are reversed. In these cases, page 1 is the second page of the first form and page 2 is the first page of the first form, page 3 is the second page of the second form and page 4 is the first page of the second form, and so on.
I want to reorder the page order in these files so that the pages are in this sequence: (2(1), 2(1)-2, 2(2), 2(2)-1, 2(3), 2(3)-1,...,2n,2n-1), where n= total number of pages/2.
I've been looking for a way to do this using command line tools such as cpdf, pdftk, etc., but to no avail. The files are quite large so I would like to do it by using command line tools.
Any advice will be greatly appreciated!
CIB pdf toolbox of CIB (https://www.cib.de) has a (non free) command line tool version, which supports all possibilities of PDF merging in one run.
Did you try poppler-utils?
i think with the command line tools pdfseparate and pdfunite utilities you can achieve all you want.
http://www.manpagez.com/man/1/pdfseparate/
http://www.manpagez.com/man/1/pdfunite/
Would it matter to you if the order of the forms inside the PDF is changed? For example if instead of
form1-reversed,
form2-reversed,
form3-reversed
your resulting file would look like
form3,
form2,
form1
?
In this case you could just run PDFtk so that it completely reverses all the pages of the original file:
pdftk in.pdf cat r1-1 output reversed.pdf
(Prefixing a page number with the letter r references pages in reverse order. That means r1 is the last page...)
In case you are on an operating system which supports shell scripting (like Bash on Linux or macOS), you can achieve the output of your requested page numbers by something like this (assuming that your n==10):
for i in {1..10}; do
echo -n "$(( 2 * ${i} )) ";
echo -n "$(( 2 * ${i} -1 )) ";
done
This will output 2 1 4 3 6 5 8 7 10 9. Now you could use this PDFtk command for re-ordering the pages as you want:
pdftk in.pdf cat $(for i in {1..10};do echo -n "$((2 * ${i})) ";echo -n "$((2*${i}-1 )) ";done) output out.pdf

-dSubsetFonts=false option stops showing TrueType fonts /glyphshow

I have a PostScript that uses TrueType fonts. However, I want to include rarly used characters like registration marks (®) and right/left single/double quotes (’, “ etc).
So I used glyphshow and called the names of the glyphs
%!
<< /PageSize [419.528 595.276] >> setpagedevice
/DeviceRGB setcolorspace
% Page 1
%
% Set the Original to be the top left
%
0 595.276 translate
1 -1 scale
gsave
%
% Save this state before moving x y specifically for images
%
1 -1 scale
/BauerBodoniBT-Roman findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 0 setrgbcolor
10 -40 moveto /quoteright glyphshow
10 -80 moveto /registered glyphshow
/Museo-700 findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 1 setrgbcolor
10 -120 moveto /quoteright glyphshow
10 -180 moveto /registered glyphshow
showpage
When I execute this PostScript using the following command (due to my requirement for the pdf to be editable in Illustrator i.e. can be opened with all fonts intact) the PDF shows nothing but seems to contain the glyphs if you copy and paste from the pdf into a text file.
gs -o gly_subsetfalse.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dSubsetFonts=false -dPDFSETTINGS=/prepress textglyph.ps
However, this above command now causes issues with pulling it into Illustrator. The rare glyphs become unrecongisble (', Æ). Normal characters and regular glyphs seem fine i.e. /a glyphshow and just show text appear in pdf and illustrator.
So, it seems that having the SubsetFonts option as True shows rare glyphs but this stops me from pulling the PDF into Illustrator.
Attached are the TrueType Fonts for reference and two PDFs (one with subsetfont option being truw and the other not - default).
I have also tried the following command with the same ill results (no visible glyphs appearing on the PDF and Illustrator incorrectly shows the glyphs).
gs -o gly_subsetfalse_embedallfonts.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/prepress -dSubsetFonts=false -dEmbedAllFonts=true textglyph.ps
But with this command I also get a PreFlight error from the PDF if that helps:
"Glyph width info in PDF does not match width info in embedded font"
Attached are all the files spoke about above - click here.
Encoding the font also does not produce good results.
I have encoded a TrueType(and a Type42) font in my PostScript and listed a few new characters to glyphshow.
Results are:
Command 1:
gs -o encode_ttf_subset_false.pdf -sDEVICE=pdfwrite -dSubsetFonts=false encode.ps
Results 1:
Open the PDF in Acrobat does NOT display any glyphshow characters.
Command 2:
gs -o encode_ttf_subset_true.pdf -sDEVICE=pdfwrite encode.ps
Results 2:
Open the PDF in Acrobat and it DOES show the glyphshow characters but not in Illustrator.
Command 3:
gs -o encode_ttf_subset_false_embedtrue.pdf -sDEVICE=pdfwrite -dSubsetFonts=false -dEmbedAllFonts=true encode.ps
Results 3:
Same as Result 1 (glyphshow characters do not appear).
Below is my new PostScript with Encoded TTF and Type42 (I've also included them in my file further below).
Is this a bug at least with Ghostscript?
/museobold findfont dup %%%%% This is the Type42 Font
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/museobold-Esp currentdict definefont pop
end
/museobold-Esp 18 selectfont
72 600 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
/BauerBodoniBT-Roman findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/BauerBodoniBT-Roman-Esp currentdict definefont pop
end
/BauerBodoniBT-Roman-Esp 18 selectfont
72 630 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
showpage
Click here to downloading the following: BBBTRom.ttf (TrueType font); 3 pdfs (results 1, 2 and 3); museobold (TrueType font converted to Type42 using ttftotype42) and encode.ps.
This is back to your problem with using Illustrator as a general PDF application. It can't do that. Now as you note you've found ways round that in the past, this time I believe you are out of luck.
The PostScript glypshow operator doesn't have a PDF equivalent. Also, because of the way glyphshow works, we cannot simply use any existing font instance to store the glyph (because the glyph may not be, and probably isn't, present in the Encoding). As a result pdfwrite does the only thing it can. It makes a new font which consists only of the glyphs used by glyphshow from the specific original font's CharStrings.
Because we don;t have an Encoding to work from we have to use a custom (suymbolic) Encoding (because fonts in a PDF file have to have an Encoding) which from your previous experience I suspect means that Illustrator is unable to read the font we embed.
Using glyphshow with pdfwrite is something I would not encourage.
Now having said that, there should not be a problem with the PDF file when SubsetFonts is true, though I do have an open bug report which sounds similar. You haven't actually said which version of Ghostscript you are using, so I can't be sure if its the same problem. (nor do I have the same fonts etc). Note that this is not (I believe) related to your problem with Illustrator, that's caused by your use of glyphshow and some Illustrator limitation.
As a general rule I would not use -dPDFSETTINGS, certainly not while trying to debug a problem, nor would I limit the output to PDF 1.3.

ImageMagick generate pdf with special page numbering

I generate a PDF from a set of PNG files using this command:
convert -- $(ls -v -- src/*.png) out/book.pdf
where there're some files with names like -03.png, which I need to have smaller page numbers than others. But I get a PDF which has -01 having page number 1, -02 number 2, etc., and 01 starts from page number 6.
The PDF is a scanned book, which has some elements like table of contents etc. which aren't included in page numbering. I remember to have seen some PDFs which have special page numbers like vii before normal Arabic numbers start.
I've tried using -scene -5 to add an offset to page numbers, but this didn't change the result.
So what should I instead do to make page "01.png" have page number 1, etc., and previous ones have some other numbers (negative or Latin, anything) and appear at the beginning of the document?
First, you want to sort files numerically counting optional minus sign, which you won't do with command you show.
Second, you talk about PageLabels for PDF pages, which you can add using Ghostscript and pdfmark operator.
Try this command:
ls src/*.png | \
sort -n | \
convert #- pdf:- | \
gs \
-sDEVICE=pdfwrite \
-o out/book.pdf \
-c '[{Catalog}<</PageLabels<</Nums[0<</P(-3)>>1<</P(-2)>>2<</P(-1)>>3<</S/D>>]>>>>/PUT pdfmark' \
-f -
It's for 3 pages -3, -2 and -1, followed by any number of pages labelled 1, 2, 3 etc. Modify according to your needs.

combining pdf files with ghostscript, how to include original file names?

I have about 250 single-page pdf files that have names like:
file_1_100.pdf,
file_1_200.pdf,
file_1_300.pdf,
file_2_100.pdf,
file_2_200.pdf,
file_2_300.pdf,
file_3_100.pdf,
file_3_200.pdf,
file_3_300.pdf
...etc
I am using the following command to combine them to a single pdf file:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf file*pdf
It works perfectly, combining them in the correct order. However, when I am looking at finished.pdf, I want to have a reference that tells me the orignal filename for each page.
Does anyone have any suggestions? Can I add page names referencing the files or something?
It is fairly easy to put the file names into a list of Bookmarks which many PDF viewers can display.
This is done with PostScript using the 'pdfmark' distiller operator. For example, use the following
gs -sDEVICE=pdfwrite -o finished.pdf control.ps
where control.ps contains PS commands to print the pages and output the bookmark (/OUT) pdfmarks:
(examples/tiger.eps) run [ /Page 1 /Title (tiger.eps) /OUT pdfmark
(examples/colorcir.ps) run [ /Page 2 /Title (colorcir.ps) /OUT pdfmark
Note that you can also perform the enumeration using PS to automate the entire process:
/PN 1 def
(file*.pdf) {
/FN exch def
FN run
[ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
/PN PN 1 add def % bump the page number
} 1000 string filenameforall
NB that the order of filenameforall enumeration is not specified, so you may want to sort the list
to control the order, using the Ghostscript extension .sort ( array lt .sort lt ).
Also after thinking about this, I also realized that if an imput file has more than one page, there is a better way to set the bookmark to the correct page number using the 'PageCount' device property.
[
(file*.pdf) { dup length string copy } 1000 string filenameforall
] % create array of filenames
{ lt } .sort % sort in increasing alphabetic order
/PN 1 def
{ /FN exch def
/PN currentpagedevice /PageCount get 1 add def % get current page count done (next is one greater)
FN run [ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
} forall
The above creates an array of strings (copying them to unique string objects since filenameforall
just overwrites the string it is given), then sorts it, and finally processes the array of strings
using the forall operator. By using the PageCount device property to get the count of pages already produced, the page number (PN) for the bookmark will be correct. I have tested this snippet as 'control.ps'.
To stamp the filename on each page you can use a combination of ghostscript and pdftk. Taken from https://superuser.com/questions/171790/print-pdf-file-with-file-path-in-footer
gs \
-o outdir\footer.pdf \
-sDEVICE=pdfwrite \
-c "5 5 moveto /Helvetica findfont 9 scalefont setfont (foobar-filename.pdf) show"
pdftk \
foobar-filename.pdf \
stamp outdir\footer.pdf \
output outdir\merged_foobar-filename.pdf