Use ghostscript to delete a page (not extracting a range) - pdf

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.

Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf

OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Related

GhostScript creating extra page when font errors occur

I have a process that needs to write multiple postscript and pdf files to a single postscript file generated by, and that will continue to be modified by, word interop VB code. Each call to ghostscript results in an extra blank page. I am using GhostScript 9.27.
Since there are several technologies and factors here, I've narrowed it down: the problem can be demonstrated by converting a postscript file to postscript and then to pdf via command line. The problem does not occur going directly from postscript to pdf. Here's an example and an example of the errors.
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=C:\testfont.ps C:\smallexample.ps
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=C:\testfont.pdf C:\testfont.ps
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Didn't find this font on the system!
Substituting font Times-Roman for TimesNewRomanPSMT.
I'm starting with the assumption that the font errors are the cause of the extra page (if only to rule that out, I know it is not certain). Since my ps->pdf test does not exhibit this problem and my ps->ps->pdf does, I'm thinking ghostscript is not writing font data that was in the original postscript file to the one it is creating. I'm looking for a way to preserve/recreate that in the resulting postscript file. Or if that is not possible, I'll need a way to tell ghostscript how to use those fonts. I did not have success attempting to include them as described in the GS documentation here: https://www.ghostscript.com/doc/current/Use.htm#CIDFontSubstitution.
Any help is appreciated.
I've made this an answer, even though I'm aware it doesn't answer the question, becasue it won'f fit as a comment.
I think your assumption that the missing fonts are causing your problem is flawed. Many PDF files do not embed all the fonts they require, I've seen many such examples and they do not emit extra pages.
You haven't been entirely clear in your description of what you are doing. You describe two processes, one going from PostScript to PDF and one going from PostScript on to PostScript (WHY ?) and then to PDF.
You haven't described why you are processing PostScript into a PostScript file.
In particular you haven't supplied an example file to look at. Without that there's no way to tell whether your experience is in fact correct.
For example; its entirely possible that you have set /Duplex true and have an odd number of pages in your file. This will cause an extra blank page (quite properly) to be emitted, because duplexing requires an even number of pages.
The documentation you linked to is for CIDFont substitution, it has nothing to do with Font substitution, CIDFonts and Fonts are different things in PDF and (more particularly) PostScript. But I honestly doubt that is your problem.
I'd suggest that you put (at the least) 'smallexample.ps' somewhere public and post the URL here, that way we can at least follow the same steps you are doing. That way we can probably tell you what's going on. An explanation of why you're doing this would be useful too, I would normally strongly suggest that you don't do extra steps like this; each step carries the risk of degarding the output in some way.
Thank you for the response. I am posting as an answer as well due to the comment length restrictions:
I think you are correct that my assumption about fonts is wrong. I have found the extra page in the second ps file and do not encounter the font errors until the second conversion.
I have a process that uses VB MSWord Interop libraries to print multiple documents to a single ps file using a virtual printer set up with ghostscript and redmon. I am adding functionality to mix in pdf files too. It works, but results in an extra page. To narrow down where the problem actually was, I tried much simpler test cases via command line to identify the problem. I only get the extra page when ghostscript is converting ps to ps (whether or not there is a pdf as well). Converting ps to pdf I do not get the extra page. Interestingly, I can work around the problem by converting the ps to pdf and then both pdfs back to ps. That is a slower and should not be necessary however, so I would like to identify and resolve the extra page issue. I cannot share that particular file. I'll see if I can create an example I can share that also exhibits the problem. In the meantime, I can confirm that the source ps file is six pages and the duplexing settings are as follows. There is duplex definition in the resulting ps file with the extra page. Might there be some other common culprits I could check for in the source ps? Thank you.
featurebegin{
%%BeginFeature: *DuplexUnit NotInstalled
%%EndFeature
}featurecleanup
featurebegin{
%%BeginFeature: *Duplex None
<</Duplex false /Tumble false>> setpagedevice
%%EndFeature
}featurecleanup

Count PDF nodes using ghostscript?

I've already used ghostscript to check PDF files and now i need to identify a pdf with more than 1000 nodes. Is it possible to use ghostscript to count the number of nodes a PDF have?
My knowledge with ghostscript is basic and I have difficulty finding a solution in ruby (PDF reader 1.3) or using tools like imageMagick.
Edit:
I can not explain in a more technical way what kind of node I'm looking for. These nodes are equivalent to those found in the corel draw. Initially I thought it would not have equivalent in pdf however the pitstop plugin has the functionality to indentify nodes.
Example of identified nodes by PitStop Pro
Those aren't 'nodes', they are the start and end points of path segments. Ghostscript doesn't have a device to extract paths. It could do so easily enough (and recover the curve control points which don't appear to be displayed in your PitStop screen grab).
However, Ghostscript isn't an editing tool, so its not at all clear what you would intend to do with the information.
If you want this, then you are going to have to either parse the PDF yourself, or write a Ghostscript device to retrieve the information, or write a program for some other tool (eg MuPDF) to extract path information.

Why does the combination pdf2ps / ps2pdf shrink the PDF?

When researching how to compress a bunch of PDFs with pictures inside (ideally in a lossless fashion, but I'll settle for lossy) I found that a lot of people recommend doing this:
$ pdf2ps file.pdf
$ ps2pdf file.ps
This works! The resulting file is smaller and looks at least good enough.
How / why does this work?
Which settings can I tweak in this process?
If there is some lossy conversion, which one is that?
Where is the catch?
People who recommend this procedure rarely do so from a background of expertise or knowledge -- it's rather based on gut feelings.
The detour of generating a new PDF via PostScript and back (also called "refrying a PDF") is never going to give you the optimal results. Sometimes it is useful, f.e. in cases were the original PDF isn't printed at all, or cannot be processed by another application. But these cases are very rare.
In any case, this "roundtrip" conversion will never lead to the same PDF file as initially.
Also the pdf2ps and ps2pdf tools aren't an independent tools at all: they are just simple wrapper scripts around a Ghostscript (gs or gswin32c.exe) command line. You can check that yourself by doing:
cat $(which ps2pdf)
cat $(which pdf2ps)
This will also reveal the (default) parameters these simple wrappers use for the respective conversions.
If you are unlucky, you will have an ancient Ghostscript installed. The PostScript which is then generated by pdf2ps will be Level 1 PS, and this will be "lossy" for many fonts which could be used by more modern PDF files, resulting in rasterization of previous vector fonts. Not exactly the output you'd like to look at...
Since both tools are using Ghostscript anyway (but behind your back), you are better off to run Ghostscript yourself. This gives you more control over the parameters it uses. Especially advantageous is the fact that this way you can get a direct PDF->PDF conversion, without any detour via an intermediary PostScript file format.
Here are a few answers which would give you some hints about what parameters you could use in order to drive the file size down in a semi-controlled way in your output PDF:
Optimize PDF files (with Ghostscript or other) (StackOverflow)
Remove / Delete all images from a PDF using Ghostscript or ImageMagick (StackOverflow)

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Cross-platform library for separating and converting PDFs to images?

I'm currently planning an application which involves manipulating PDFs. My goal is to have a program that i can pass in a PDF as an input which then saves separated grayscale images of the colour channels that the PDF consists of as an output. This is basically a simple RIP.
I'm currently using a solution using GhostScript but i want to rewrite the application to optimise speed and usability. (GhostScript doesn't separate PDFs for example.)
Do you know of any other open source libraries that i may find useful to achieve this?
Did you ever try to run (I'm assuming Windows here):
mkdir separated
gswin32c.exe ^
-o separated/page_%04d.tif ^
-sDEVICE=tiffsep ^
d:/path/to/input.pdf
(you can also try -sDEVICE=tiffsep1) and then looked at the files you've gotten in the separated sub directory?!? And this is not a case of Ghostscript separated PDFs in your mind?
The device tiffsep creates multiple output files:
One single 32bit composite CMYK file (tiff32nc format) per PDF page.
Multiple tiffgray files (each compressed with LZW) per PDF page, one for each separation.
The device tiffsep1 behaves similarly, but...
...doesn't create the composite output file...
...and it creates tiffg4 output files for the separations.