combining pdf files with ghostscript, how to include original file names?

combining pdf files with ghostscript, how to include original file names? - pdf

I have about 250 single-page pdf files that have names like:
file_1_100.pdf,
file_1_200.pdf,
file_1_300.pdf,
file_2_100.pdf,
file_2_200.pdf,
file_2_300.pdf,
file_3_100.pdf,
file_3_200.pdf,
file_3_300.pdf
...etc
I am using the following command to combine them to a single pdf file:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf file*pdf
It works perfectly, combining them in the correct order. However, when I am looking at finished.pdf, I want to have a reference that tells me the orignal filename for each page.
Does anyone have any suggestions? Can I add page names referencing the files or something?

It is fairly easy to put the file names into a list of Bookmarks which many PDF viewers can display.
This is done with PostScript using the 'pdfmark' distiller operator. For example, use the following
gs -sDEVICE=pdfwrite -o finished.pdf control.ps
where control.ps contains PS commands to print the pages and output the bookmark (/OUT) pdfmarks:
(examples/tiger.eps) run [ /Page 1 /Title (tiger.eps) /OUT pdfmark
(examples/colorcir.ps) run [ /Page 2 /Title (colorcir.ps) /OUT pdfmark
Note that you can also perform the enumeration using PS to automate the entire process:
/PN 1 def
(file*.pdf) {
/FN exch def
FN run
[ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
/PN PN 1 add def % bump the page number
} 1000 string filenameforall
NB that the order of filenameforall enumeration is not specified, so you may want to sort the list
to control the order, using the Ghostscript extension .sort ( array lt .sort lt ).
Also after thinking about this, I also realized that if an imput file has more than one page, there is a better way to set the bookmark to the correct page number using the 'PageCount' device property.
[
(file*.pdf) { dup length string copy } 1000 string filenameforall
] % create array of filenames
{ lt } .sort % sort in increasing alphabetic order
/PN 1 def
{ /FN exch def
/PN currentpagedevice /PageCount get 1 add def % get current page count done (next is one greater)
FN run [ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
} forall
The above creates an array of strings (copying them to unique string objects since filenameforall
just overwrites the string it is given), then sorts it, and finally processes the array of strings
using the forall operator. By using the PageCount device property to get the count of pages already produced, the page number (PN) for the bookmark will be correct. I have tested this snippet as 'control.ps'.

To stamp the filename on each page you can use a combination of ghostscript and pdftk. Taken from https://superuser.com/questions/171790/print-pdf-file-with-file-path-in-footer
gs \
-o outdir\footer.pdf \
-sDEVICE=pdfwrite \
-c "5 5 moveto /Helvetica findfont 9 scalefont setfont (foobar-filename.pdf) show"
pdftk \
foobar-filename.pdf \
stamp outdir\footer.pdf \
output outdir\merged_foobar-filename.pdf

Related

How do I crop pages 3&4 in a multipage pdf using ghostscript

I would like to crop just some pages in a multipage pdf keeping all pages, some cropped, others not. I tried the following but it "deletes" the non cropped pages...
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -dFirstPage=3 -dLastPage=4 -c "[/CropBox [24 72 1559 1794]" -c " /PAGES pdfmark" -f input.pdf
I've seen the posts on different cropping on odd and even pages, but I could not figure out how to apply this to a certain page in a multipage document.
gswin64.exe -o cropped.pdf -sDEVICE=pdfwrite -c "<</EndPage {0 eq {2 mod 0 eq {[/CropBox [0 0 1612 1792] /PAGE pdfmark true}{[/CropBox [500 500 612 792] /PAGE pdfmark true} ifelse}{false}ifelse}>> setpagedevice" -f input.pdf
This does crop all pages according to the settings of the second CropBox. If anybody is wondering about the large margins... I apply this do large drawings.
I have also tried to substitute some operators to only apply the crop to a certain page number: "sub 4" instead of "2 mod" was one attempt to attain the " 0 eq" condition only when the current page number reaches 4.

OK first things first, Ghostscript and the pdfwrite device do not 'modify' an input PDF file. For regular readers; standard lecture here, if you've read it before you can skip the following paragraph.
The way this works is that the input file is completely interpreted into a sequence of graphics primitives which are sent to the device. Rendering devices then call the graphics library to render the primitives to a bitmap, which is then output. High level (vector) devices, such as pdfwrite, translate the primitives into equivalent operations in some high level page description language, and emit that.
So, when you select -dFirstPage and -dLastPage, those are only pages for the input file you are choosing to process. So pdfwrite isn't 'deleting' your pages, you never sent them to the device in the first place.
Now, Ghostscript is a PostScript interpreter, and therefore its action can be affected by writing PostScript programs. In your case you probably want to actually process all the pages (so drop -dFirstPage and -dLastPage), but only write the pdfmark on selected pages.
The way to do this is via a BeginPage or EndPage procedure. If you search here or in the PostScript tag you'll find a number of examples. Fundamentally both procedures are called with a reason code and a count of pages so far.
From memory you will want to check the reason code is 2. If it is, then you want to check the count of pages, and it it matches your criteria (in the case here, count is 3 or 4), execute the /PAGE pdfmark. In any case you want to return 'true' so that the page is emitted.
[EDIT added here]
Hmm, OK I see the problem. What's happening is that the PDF interpreter is calling 'setpagedevice' to set the page size for each page, in case the page size has altered. The problem is that this resets the page count back to 0 each time.
Now, I wouldn't normally suggest the following, because it relies on some undocumented aspects of Ghostscript's PDF interpreter. However, I happen to know that the PDF interpreter tracks the page number internally using a named object called /Page#.
So, if I take the code you wrote, and modify it slightly:
<<
/EndPage {
0 eq {
pop /Page# where {
/Page# get
3 eq {
(page 3) == flush
[/CropBox [0 0 1612 1792] /PAGE pdfmark
true
}
{
(not page 3) == flush
[/CropBox [500 500 612 792] /PAGE pdfmark
true
} ifelse
}{
true
} ifelse
}
{
false
}
ifelse
}
>> setpagedevice
Couple of things to note; there's some debug in there, the lines with '== flush' print out some stuff on the back channel so you know how each page is being handled. If /Page# isn't defined, then the code simply leaves everything alone, this is just some basic safety-first stuff.
Rather than type all this on the command line (which also loses indenting and is hard to read) I stuck it in a file, called test.ps, then invoked GS as:
gswin32c -sDEVICE=pdfwrite -sOutputFile=out.pdf test.ps input.pdf
Its not the neatest solution in the world, but it works for me.

How to split single pdf page in half vertically using ghostscript

I have a pdf in landscape orientation and is a "Page Spread".
I need to split the page in half vertically in the middle.
I used this setting to cut the pdf in half: this one gets the left part of the page
"C:\Program Files (x86)\gs\gs9.10\bin\gswin32c.exe" -o output.pdf
-sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dSubsetFonts=true -dEmbedAllFonts=true -g 750x1085 -c
"<</BeginPage{0.95 0.96 scale 20 22 translate}>> setpagedevice" "<</PageOffset [0 0]>> setpagedevice" -f input.pdf
The above command works fine. My problem now is I need to set the Mediabox, Cropbox, Bleedbox, Trimbox, and artbox similar to the size of the splitted page. In the instance above; it should be 750x1085.
What should be the correct command/settings to run on GS so that the Mediabox, Cropbox, Bleedbox, Trimbox and artbox has the same sizes?
UPDATE
This is a sample of the PDF file I'm trying to cut in half:
PDF FILE TO SPLIT
I am now using /PAGE pdfmark and this is the command I'm trying to use:
"C:\Program Files (x86)\gs\gs9.10\bin\gswin32c.exe" -o output.pdf -sDEVICE=pdfwrite -dNOPAUSE -dBATCH
-dSAFER -dUseCropbox -dSubsetFonts=true -dEmbedAllFonts=true -g7500x10850
-c "[/CropBox [0 0 750 1085] /PAGES pdfmark"
"[/MediaBox [0 0 750 1085] /PAGES pdfmark"
"[/TrimBox [0 0 750 1085] /PAGES pdfmark"
"[/BleedBox [0 0 750 1085] /PAGES pdfmark"
"[/ArtBox [0 0 750 1085] /PAGES pdfmark"
"<</BeginPage{0.95 0.96 scale 20 22 translate}>> setpagedevice"
"<</PageOffset [0 0]>> setpagedevice"
-f input.pdf
I still can't achieve setting the Cropbox, MediaBox, TrimBox, BleedBox, ArtBox with the same size.
What should be the correct settings for the command?

Moving to an answer here, comments are too small.
Your basic problem is that the original PDF file already contains all of the Box parameters you are trying to modify. The Ghostscript PDF interpreter will, for certain classes of output device, preserve those boxes by writing its own version of a pdfmark to the device. Because this happens after your pdfmark, it replaces yours.
There is, unfortunately, no mechanism in place to disable this. Additionally, since you are using Windows, you have the resources built into the ROM file system, so you can't readily modify them.
So at this point there isn't really anything you can do about this. If you are comfortable downloading the source and meddling with some PostScript let me know and I think I can give you a solution (you don't need to rebuild GS, but the PostScript resources aren't available separately).
EDIT to include some suggested work-arounds
In pdf_main.ps:
% Test whether the current output device handles pdfmark.
/.writepdfmarkdict 1 dict dup /pdfmark //null put readonly def
/.writepdfmarks { % - .writepdfmarks <bool>
currentdevice //.writepdfmarkdict .getdeviceparams
mark eq { //false } { pop pop //true } ifelse
systemdict /DOPDFMARKS known or
} bind def
You could alter /.writepdmarks to be:
/.writepdfmarks { % - .writepdfmarks <bool>
false
} bind def
But note that a number of other things will stop being emitted in the output PDF file.
Instead you could look at the code in pdf_showpage_finish:
/pdfshowpage_finish { % <pagedict> pdfshowpage_finish -
save /PDFSave exch store
...
...
.writepdfmarks {
% Copy the boxes.
{ /CropBox /BleedBox /TrimBox /ArtBox } {
2 copy pget {
% .pdfshowpage_Install transforms the user space do same here with the boxes
oforce_elems
2 { Page pdf_cached_PDF2PS_matrix transform 4 2 roll } repeat
normrect_elems 4 index 5 1 roll fix_empty_rect_elems 4 array astore
mark 3 1 roll {/PAGE pdfmark} stopped {cleartomark} if
} {
pop
} ifelse
} forall
You can modify the line '{ /CropBox /BleedBox /TrimBox /ArtBox }', any Boxes you don't want to preserve you can remove from that line.
If you want to prevent any of them overriding your specifications, then remove the lines from '% Copy the boxes' up to '% Copy annotations and links'
Please note this only works on a system where the resources are not built into a ROM file system, or where the resources are at least available on disk, and the path to the Resource files is specified using the -I switch.

ImageMagick generate pdf with special page numbering

I generate a PDF from a set of PNG files using this command:
convert -- $(ls -v -- src/*.png) out/book.pdf
where there're some files with names like -03.png, which I need to have smaller page numbers than others. But I get a PDF which has -01 having page number 1, -02 number 2, etc., and 01 starts from page number 6.
The PDF is a scanned book, which has some elements like table of contents etc. which aren't included in page numbering. I remember to have seen some PDFs which have special page numbers like vii before normal Arabic numbers start.
I've tried using -scene -5 to add an offset to page numbers, but this didn't change the result.
So what should I instead do to make page "01.png" have page number 1, etc., and previous ones have some other numbers (negative or Latin, anything) and appear at the beginning of the document?

First, you want to sort files numerically counting optional minus sign, which you won't do with command you show.
Second, you talk about PageLabels for PDF pages, which you can add using Ghostscript and pdfmark operator.
Try this command:
ls src/*.png | \
sort -n | \
convert #- pdf:- | \
gs \
-sDEVICE=pdfwrite \
-o out/book.pdf \
-c '[{Catalog}<</PageLabels<</Nums[0<</P(-3)>>1<</P(-2)>>2<</P(-1)>>3<</S/D>>]>>>>/PUT pdfmark' \
-f -
It's for 3 pages -3, -2 and -1, followed by any number of pages labelled 1, 2, 3 etc. Modify according to your needs.

Is it possible in Ghostscript to add watermark to every page in PDF

I convert PDF -> many JPEG and many JPEG -> many PDF using ghostscript. I need to add watermark text on every converted JPEG (PDF) page. Is it possible using only Ghostscript and PostScript?
The only way I found:
gswin32c -q -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=output.pdf watermark.ps input.pdf
But this will insert watermark.ps watermark on first separate page in output.pdf.
Can I do this on output PDF pages directly?
Can I do this on output JPEG pages directly?
<<
/BeginPage
{ gsave
/Helvetica_Bold 120 selectfont
.85 setgray 130 70 moveto 50 rotate (Sample) show
grestore
} bind
>> setpagedevice
If I use /EndPage instead of /BeginPage - it says setpagedevice is not applicable...
How to remake this script for /EndPage?

Bit too big for a comment, so I've added a new answer. The EndPage procedure (see page 441 of the PostScript Language Reference Manual) takes two additional parameters on the stack, a count of pages emitted so far, and a reason code.
You can use the count of pages to do interesting things like duplexing, or only marking even pages or whatever, but I assume in this case you don't want it, so you just 'pop' it from the stack.
The reason code tells you why the page is being emitted, again you probably don't care so you just pop the value.
Finally the EndPage must return a boolean value to the interpreter saying whether or not to transmit the page (this allows you to do other interesting things, like only printing the first 10 pages and so on).
So you need to initially remove two values, execute your code and return a boolean. Pretty trivial:
<<
/EndPage
{ pop pop %% *BEFORE* gsave as that puts a gsave object on the stack
gsave
/Helvetica_Bold 120 selectfont
.85 setgray 130 70 moveto 50 rotate (Sample) show
grestore
true %% transmit the page, set to false to not transmit the page
} bind
>> setpagedevice

The accepted answer was inserting pages for me. The pages were blank aside from the watermark. If you run into this try adding the 2eq bit here
<<
/EndPage
{
2 eq { pop false }
{
gsave
/Helvetica_Bold 120 selectfont
.85 setgray 130 70 moveto 50 rotate (Sample) show
grestore
true
} ifelse
} bind
>> setpagedevice
I found the following site that pointed me in the correct direction
http://habjan.blogspot.com/2013/10/how-to-programmatically-add-watermark.html
Here's the calling syntax where the above file is saved as watermark.ps and gswin32c references the ghostscript exe
gswin32c -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=watermarked.pdf watermark.ps original.pdf

I don't know what you mean by 'directly'. Its possible, as you have found, to have a PostScript interpreter do many kinds of things on a per-page basis. PostScript is a programming language after all.
I would suggest that the /BeginPage and/or /EndPage procedures in the page device dictionary would be the place to start. These allow you to execute arbitrary PostScript at the start or end of every page.
If you define a /BeginPage procedure then it will be executed before any marking operations from the input program, if you define a /EndPage then it will be executed after the marking operations from the input program (on a page by page basis(.
This allows you to have your own marks lie 'under' or 'over' the marks from the program.

Convert subset of postscript file to pdf documents

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben

Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).

Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.

Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas