I want to extract pages from a PDF file which has custom page numbering, e.g. there are pages with the number C1, C2, C3, and after that, 1,2,3,4 etc. starts.
When I use
$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
-dFirstPage=22 -dLastPage=36 \
-sOutputFile=outfile_p22-p36.pdf 100p-inputfile.pdf
FirstPage and LastPage are the page index, starting to count at the first page - which is not what I want
How can I tell GhostView to use the "real" page numbers?
You can, given a lot of knowledge about the internals of Ghostscript's PDF interpreter, access the page numbers. It will require a lot of looking around in the Resource/Init/pdf*.ps files (mostly
just pdf_main.ps) and an understanding of PostScript, but it is possible. Just not for the faint of heart.
To see an example PS program which digs around inside a PDF to glean information, have a look at toolbin/pdf_info.ps.
If someone comes up with a patch to allow FirstPage/LastPage to take names as labels, then we will consider it. A part of this patch should be a change add an option to pdf_info.ps to print the labels and the real page numbers.
Related
Some time I ago I found that that you can use postscript to make changes to pdf documents with Ghostscript. Available examples make the same changes to every page:
gs \
-sDEVICE=pdfwrite \
-o /path/to/output/pdf-shifted-by-1-inch-to-left.pdf \
-dPDFSETTINGS=/prepress \
-c "<</PageOffset [-72 0]>> setpagedevice" \
-f /path/to/input/pdf-original.pdf
Source: How can I shift page images in PDF files more to the left or to the right?
See also: Cropping a PDF using Ghostscript 9.01
But how could I set different offsets for different pages, without splitting up the pdf into separate files? For example move some pages to the right and some to the left.
I know of a way of doing this using pdftex but I was hoping to avoid this dependancy.
Well basically this is a PostScript question, because Ghostscript's PDF interpreter is (currently) written in PostScript so you can make changes to the PostScript graphics state which will affect the PDF interpreter, and take advantage of PostScript's language features to do programmatic tasks.
To do different things on each page you need to use a BeginPage or EndPage procedure. BeginPage is called at the start of every page, before the program is interpreted, and EndPage is called when the page is complete (ie on execution of a showpage).
You'll need a BeginPage procedure to modify the page setup before the page execution runs. This will be called with a count of the number of pages transmitted so far, so you can use that to make decisions about what you want to do.
NB the current PDF interpreter executes a setpagedevice on every page, because each page of a PDF can be a different size. This means some experimentation will be required to achieve your aims.
I am using (PDF)LaTeX to make a document, and I also need to embed already existing PDF documents in it. The problem is that I have PDF documents in several different page sizes (letter, a4, etc) and I want to compile all of them into a single b5 PDF document.
If I use the pdfpages package from CTAN, all hyperlinks from the original PDFs are removed. So I tried to do it with GhostScript.
This sounds like something normal to do but I have failed to find a working solution.
I have, in the meanwhile, read a few question and answers, but failed to figure out what I am doing wrong and what I am missing.
This doesn't seem to address my problem of scaling.
Neither does that.
This seems to go in the right direction but I couldn't make use of the information :-(.
To make the problem easier, let's just try to resize a single PDF so that:
its contents are scaled to fit the page
the new page has the size I want
Sounds easy, and it is easy to do, for example with pdfjam:
pdfjam --outfile b5-foo.pdf --paper b5paper foo.pdf
Now the problem with this is that pdfjam throws away hyperlinks. From its website:
A potential drawback of pdfjam and other scripts based upon it is that any hyperlinks in the source PDF are lost.
This must be because it seems to use pdfpages mentioned above.
Unlike pdfjam, GhostScript keeps hyperlinks. However, it either:
crops the original when I downscale; or
does not put the scaled content on a page of the size I need -- instead, I get a page that seems to be scaled down, while keeping the original aspect ratio.
This is what I have installed:
$ gs --version
9.21
(Installed on Linux)
This is how I can use GhostScript to crop the content:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 \
-o b5-foo.pdf foo.pdf
... and here is how I can use -dPDFFitPage to scale the content but also keep the aspect ratio of the original page size:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 -dPDFFitPage \
-o b5-foo.pdf foo.pdf
To be even clearer: I seem to get a page that is scaled so that it would fit inside the b5 I am asking for, but it is not b5: it still has the H/W ratio the original (letter) had!
I'd be happy if this can be done just using switches but if I need to use PostScript that's perfectly fine.
The solution seems to be to use -dPSFitPage instead of -dPDFFitPage. This might have something to do with the PDF files that I am trying to resize. Unfortunately, I cannot share those :-(. When I tried to reproduce this with files that I generated and the problem does not reproduce. I don't know why this is or how I should have known it.
To summarize, using PDF files for both input and output:
-dFitPage and -dPDFFitPage give me scaled pages with the original aspect ratio
-dPSFitPage gives me scaled content on the page size I request with -sPAPERSIZE="$PAPERSIZE"
This seems to go against what the documentation says.
I have a library manual that the creator changed some of the LaTeX code and changed the page position and size, but didn't check it before compiling, distilling and sending it off. He is currently unavailable, so if I want to print it I have to fix it myself.
I was able to use some ghostscript commands to push the entire text down to something approaching centered on the page, the command is show below:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/user/shiftdown.pdf -dPDFSETTINGS=/prepress -c "<</PageOffset [0 -35]>> setpagedevice" -f /home/user/brokendoc.pdf
The issue is that while the page is now printable without hitting hardware margins, the chapter titles are still halfway cut off at the top. If I open the PDF in Acrobat or Reader, I can select the chapter title and copy it and it pastes the full text in the program of my choosing. When I tried printing it on a Xerox MFP with a partially incompatible driver it printed the header, but it wouldn't duplex and I didn't want to print 700+ pages and then use the copy to 1 -> 2 function.
Does anyone know of a way to fix these cut off headers such that they either appear correctly in the PDF file or at least reliably print correctly? I have ghostscript easily available, TeX relatively easily available and the standard version of Acrobat X.
[update:]
After downloading the demo of Acrobat Pro XI, I was able to go to the "Print Production" tab and click on "Edit Object". When I clicked on the cut off chapter titles it showed me two bounding boxes that covered the entire page with one just a little taller than the other. When I right clicked on it I got the option to Add Clip and Delete Clip. When I click on Delete Clip it shows the entire chapter title. If I click on Add Clip it says, "One or more of the selected regions already have a clipping region. Proceed with setting the clipping regions for the selected objects? [No] [Yes]"
With that added information, I know there has to be a way to in a batch mode fix the issue, anyone know what command translates into this?
Without seeing the 'brokendoc.pdf' it's hard to know. If I see the file, I can tell you what's going on, and (probably) how to fix it or work around it.
I don't need the entire file, so just a shortened version that only has a few pages that shows the problem will suffice. You might be able to get this from the complete brokendoc.pdf using:
gs -sDEVICE=pdfwrite -o part.pdf -dLastPage=10 brokendoc.pdf
Also, you may want to try:
gs -sDEVICE=pdfwrite -o fitted.pdf -dPDFFitPage -sPAPERSIZE=letter -dFIXEDMEDIA brokendoc.pdf
The above will scale (and center) the page on to the specified page size. You can specify 'letter' or 'a4' or use -dMEDIAWIDTHPOINTS=_ -dMEDIAHEIGHTPOINTS=_ to get a specific output page size. The -dFIXEDMEDIA option causes gs to ignore the MediaBox in the file.
I am using
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=book.pdf -f front-matter.pdf fulltext-0.pdf fulltext-1.pdf back-matter.pdf
to create a single PDF document from a series of pdf documents. I was going to include a new made-up table of content and include it using the pdfmark mechanism. Then I notice that the original files already have bookmarks in them - they are however referenced to the original page numbers, not the ones in the combined document.
I am looking for two possible solutions. Remove the orginal bookmarks or make use of the original bookmarks but somehow update their page references...
As so often the case, someone has walked the same path before you...
unfolding disasters has worked out a solution to this very problem. His python script pdf-merge.py first invokes pdftk with its dump_data switch to retrieve all the pdfmark information. It then keeps track of the total number of pages for each merged document and does the math to offset the new page number pointer in the pdfmark instruction by the sum total of page counts of all the PDF documents included before the current PDF document. So it is close but not the same as the 2-pass approach of KenS. It first discovers bookmarks using pdftk and then creates a new bookmark file with correct page numbers. It also manages to turn the original pdfmark instruction (that would normally be preserved by gs into noop). I won't pretend I understand how that last part worked ...
However, the script does all I need including the option of tweaking the bookmark file before the final writing. Very neat and hat tip to Trevor King.
In general pdfwrite doesn't know you are appending files, so it preserves bookmark and other 'metadata' information on the assumption that you will want it in the output.
However, when you are combining PDF files, preserving the information won't work, as the page numbers for the second and subsequent files will be incorrect.
So you need a 2-pass approach, first merge all the files, discarding the bookmarks, then 'convert' the merged file and add pdfmarks to set the correct bookmarks.
There is currently no option (with pdfwrite) to not preserve bookmarks. You will need to modify the Ghostscript PDF interpreter PostScript files to achieve this I think. You might try setting -dDOPDFMARKS=false, but I doubt that will work.
I've been using Ghostscript to convert my single figure plots rendered in PDF to PNG:
gswin32c -sDEVICE=png16m -r300x300 -sOutputFile=junk.png ^
-dBATCH -dNOPAUSE Figure_001-a.pdf
This works in the sense I get a PNG out and it contains the plot.
But it contains a huge amount of white space as well (an example source image: http://cdsweb.cern.ch/record/1258681/files/Figure_001-a.pdf).
If you view it in Acrobat you'll note there is no white space around the plot. If you use the above command line you'll find the plot is only about 1/3 of the space.
When doing the same thing with an EPS file I run into the same problem. However, there is the command-line parameter -dEPSCrop that one can pass to get the PS rendering engine to pay attention to the BoundingBox.
I need the similar argument for rendering PDFs. I was not able to find it in docs (nor even the -dEPSCrop, actually).
I had exactly the same issue. I fixed it by adding -dUseArtBox switch.
Example:
/usr/bin/gs -dUseArtBox -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=output.png input.pdf
Note: -dUseArtBox switch is supported since ghostscript version 9.07
-dUseArtBox
Sets the page size to the ArtBox rather than the MediaBox. The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.
There are various options to control which "media size" Ghostscript renders a given input:
-dPDFFitPage
-dUseTrimBox
-dUseCropBox
With PDFFitPage Ghostscript will render to the current page device size (usually the default page size).
With UseTrimBox it will use the TrimBox (and it will at the same time set the PageSize to that value).
With UseCropBox it will use the CropBox (and it will at the same time set the PageSize to that value).
By default (give no parameter), Ghostscript will render using the MediaBox.
For your example, it looks like adding "-dUseCropBox" will do the job you're expecting.
Note, you can additionally control the overall size of your output by using "-sPAPERSIZE" (select amongst all pre-defined values Ghostscript knows) or (for more flexibility) use "-dDEVICEWIDTHPOINTS=NNN -dDEVICEHEIGHTPOINTS=NNN".
Have you tried using pdfcrop using pdftex (comes with texlive for example) or (not tried yet) the python script pdfcrop?
I have a similar workflow using the first tool mentioned.