How to use Ghostscript command to validate all pdfs in a folder - pdf

I wish to validate pdfs in a Windows 7 folder containing 30,000 pdfs. I have discovered that some pdfs do not render properly and give a "Insufficient data for image" error.
How can I modify following Ghostscript command to input all pdfs in a folder, rather than as a single pdf or a list of pdfs?
gswin32c.exe -o nul -sDEVICE=nullpage -r36x36 "D:/Pdf/04701.pdf"

OK first, Ghostscript doesn't 'validate' PDFs, all your command line does is determine whether the default behaviour of Ghostscript, while interpreting a file, sends any messages to stdout or stderr.
There are several problems with that, Ghostscript doesn't flag even a warning for everything, and the default behaviour is not to throw an error, but to emit a warning and continue. Since you aren't looking at the output (its going to a bit bucket) its entirely possible that you would miss the fact that swathes of output were simply blank. SO you may well miss a faulty file.
'Insufficient data for an image' is simply one possible error, there are an awful lot of badly written PDF files out there.
If you want to handle every file in a folder, you can't do it with Ghostscript alone, as it requires each input file to be specified. However, you can write a command shell script easily enough. Since you are clearly on Windows, just use a for loop and have the 'do' clause call Ghostscript just as above.
Something like:
For %s in (c:\files\*.*) do gswin32c... %s
just type
help for
at the command shell for information on 'for'.

Related

Ghostscript pdfwrite produces zero length output when reading from PDF files, no error reported, fine with other inputs

I have a requirement to append one PDF file to another.
I felt that Ghostscript was the way forward, and installed the 64 bit Windows version (9.53.0), but if I attempt to do anything with pdfwrite where the input is a PDF, e.g.
gswin64c -DNOSAFER -sDEVICE=pdfwrite -o output.pdf input.pdf
I get zero length output (with no error messages at all). This happens whether the PDF is one of Ghostscript's shipped examples, one generated using tcpdf, or one saved from a Windows application. It happens whether I try to read from a single PDF or from multiple ones (the latter being my use case).
If I convert the input PDFs to Postscript and then use pdfwrite on those, it works like a dream, e.g.
call pdf2ps input.pdf temp.ps
gswin64c -DNOSAFER -sDEVICE=pdfwrite -o output.pdf temp.ps
EPS inputs work fine also - the only problem seems to be with PDF ones. But Ghostscript can read and display any PDF (and indeed convert any PDF to Postscript), it just can't cope with PDFs as input to pdfwrite, as far as I can see.
I can find no reference anywhere to this particular issue.
This turned out to not be limited to PDF input, it's just easier to trigger it that way. The problem was that an internal data type was changed from a build-dependent size to always be 64-bits, but a #define'd value wasn't correctly updated so the 64-bit Windows build was still using a value intended for 32-bit builds.
There's a commit to fix the problem here. However this seems serious enough that a new build 9.53.1 (so that's patch level 1 already...) will be forthcoming shortly (if it's not already there).
It would help a lot if people could report bugs when they find this kind of problem, and even better if there are any volunteers to try out the release candidates, we would really prefer not to make releases with serious problems.....

GhostScript creating extra page when font errors occur

I have a process that needs to write multiple postscript and pdf files to a single postscript file generated by, and that will continue to be modified by, word interop VB code. Each call to ghostscript results in an extra blank page. I am using GhostScript 9.27.
Since there are several technologies and factors here, I've narrowed it down: the problem can be demonstrated by converting a postscript file to postscript and then to pdf via command line. The problem does not occur going directly from postscript to pdf. Here's an example and an example of the errors.
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=C:\testfont.ps C:\smallexample.ps
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=C:\testfont.pdf C:\testfont.ps
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Didn't find this font on the system!
Substituting font Times-Roman for TimesNewRomanPSMT.
I'm starting with the assumption that the font errors are the cause of the extra page (if only to rule that out, I know it is not certain). Since my ps->pdf test does not exhibit this problem and my ps->ps->pdf does, I'm thinking ghostscript is not writing font data that was in the original postscript file to the one it is creating. I'm looking for a way to preserve/recreate that in the resulting postscript file. Or if that is not possible, I'll need a way to tell ghostscript how to use those fonts. I did not have success attempting to include them as described in the GS documentation here: https://www.ghostscript.com/doc/current/Use.htm#CIDFontSubstitution.
Any help is appreciated.
I've made this an answer, even though I'm aware it doesn't answer the question, becasue it won'f fit as a comment.
I think your assumption that the missing fonts are causing your problem is flawed. Many PDF files do not embed all the fonts they require, I've seen many such examples and they do not emit extra pages.
You haven't been entirely clear in your description of what you are doing. You describe two processes, one going from PostScript to PDF and one going from PostScript on to PostScript (WHY ?) and then to PDF.
You haven't described why you are processing PostScript into a PostScript file.
In particular you haven't supplied an example file to look at. Without that there's no way to tell whether your experience is in fact correct.
For example; its entirely possible that you have set /Duplex true and have an odd number of pages in your file. This will cause an extra blank page (quite properly) to be emitted, because duplexing requires an even number of pages.
The documentation you linked to is for CIDFont substitution, it has nothing to do with Font substitution, CIDFonts and Fonts are different things in PDF and (more particularly) PostScript. But I honestly doubt that is your problem.
I'd suggest that you put (at the least) 'smallexample.ps' somewhere public and post the URL here, that way we can at least follow the same steps you are doing. That way we can probably tell you what's going on. An explanation of why you're doing this would be useful too, I would normally strongly suggest that you don't do extra steps like this; each step carries the risk of degarding the output in some way.
Thank you for the response. I am posting as an answer as well due to the comment length restrictions:
I think you are correct that my assumption about fonts is wrong. I have found the extra page in the second ps file and do not encounter the font errors until the second conversion.
I have a process that uses VB MSWord Interop libraries to print multiple documents to a single ps file using a virtual printer set up with ghostscript and redmon. I am adding functionality to mix in pdf files too. It works, but results in an extra page. To narrow down where the problem actually was, I tried much simpler test cases via command line to identify the problem. I only get the extra page when ghostscript is converting ps to ps (whether or not there is a pdf as well). Converting ps to pdf I do not get the extra page. Interestingly, I can work around the problem by converting the ps to pdf and then both pdfs back to ps. That is a slower and should not be necessary however, so I would like to identify and resolve the extra page issue. I cannot share that particular file. I'll see if I can create an example I can share that also exhibits the problem. In the meantime, I can confirm that the source ps file is six pages and the duplexing settings are as follows. There is duplex definition in the resulting ps file with the extra page. Might there be some other common culprits I could check for in the source ps? Thank you.
featurebegin{
%%BeginFeature: *DuplexUnit NotInstalled
%%EndFeature
}featurecleanup
featurebegin{
%%BeginFeature: *Duplex None
<</Duplex false /Tumble false>> setpagedevice
%%EndFeature
}featurecleanup

Ghostscript on Unix generating huge files

I use Ghostscript 9.14, the last one compiled for HP-Unix.
I need to create PDF/A-1b files from existing pdf files from different sources.
It is preferred that this happens on a HP-Unix server because that is the server that puts them in a DMS.
The command:
gs -q -dPDFA -dBATCH -dNOPAUSE -dNOOUTERSAVE \
-dCCFONTDEBUG -dCFFDEBUG -dCMAPDEBUG -dDOCIEDEBUG -dEPSDEBUG \
-dFAPIDEBUG -dINITDEBUG -dPDFDEBUG -dPDFOPTDEBUG -dPDFWRDEBUG \
-dSETPDDEBUG -dSTRESDEBUG -dTTFDEBUG -dVGIFDEBUG -dVJPGDEBUG \
-dColorConversionStrategy=/sRGB -dProcessColorModel=/DeviceRGB \
-sDEVICE=pdfwrite -sPDFACompatibilityPolicy=2 \
-sOutputFile=debug_0901ece380001a00.pdf /usr/../PDFA_def.ps \
/0901ece380001a00.pdf
The source pdf is filled with just non-OCRed images.
I have this working on a newer version on a Windows server (Ghostscript 9.19) without problems and with the same command but can't seem to get it working on HP-Unix.
On the Windows server there is a MS Office installed.
The HP-Unix command generates 9mb file for a 300kb source file and it takes ages to generate.
Ghostscript seems single threaded but 9 mins for 35 pages is a bit much.
When I check through Preflight in Acrobat Pro 9 Extended, the 9mb file is truly PDF-A 1b.
Do I need to install a kind of Office software on Unix to get this working?
Or an image editing tool?
Also, how do I check the debug lines? They aren't in a readable format and I can't find any info on that.
Maybe it is something that only can be checked by the Ghostscript developers?
Almost certainly the input file contains transparency. PDF/A-1 does not support transparency, and so when creating PDF/A-1 files any page which does contain transparency is rendered to an image, and then that image is embedded in the output.
Clearly this will take time (rendering a page at 720 dpi, full colour, and transparency processing is slow) and will result in a large file. However, its the only way to preserve the appearance of the input file and still create a PDF/A-1 file.
Of course, in the absence of an example input file to examine its not possible to be certain of this.
The DEBUG lines switches are useless except to Ghostscript developers, don't bother to set them. You would never set so many anyway, you'll be swamped with extraneous detail. I'm doubtful all the ones you have listed are even valid.
You say you have this 'working' with Ghostscript 9.19 on Windows, what do you mean by 'working' ? It seems to me that the 9.14 output 'works' as well.....
As far as I know we have never compiled a Ghostscript release for HP/UX, but the current version (9.22) is known to compile and run on HP/UX.
Finally Ghostscript does not rely on (and indeed cannot make use of) Microsoft Office. Nor does it rely on the operating system for anything except memory and file access.

ghostPCL: why is this file not converted properly to PDF?

I am using ghostpcl-9.18-win64. This is the script that I used to generate the pdf file:
gpcl6win64-9.18.exe -sDEVICE=pdfwrite -sOutputFile=%1.pdf -dNOPAUSE %1.txt
The file to test can be found here and the result of running ghostpcl can be found here.
If you take a look at the pdf file it contains only a page (there should be 2) and some of the text is missing. Why is that? I always pictured in my mind that ghostpcl would produce a pdf identical with a printout. Am I missing something, parameters perhaps?
As a matter of fact, when I used the lpr command to print the file on RHEL it printed exactly what I expected. I wonder how reliable is the ghostpcl tool in converting pcl files to PDF. And if it's not that reliable, a broader question is: is there another tool to do it? I am interested mainly in the linux version.
The txt file is based on a file generated using SQR.
Thanks
In fact the OP did raise a bug report (but didn't mention it here):
http://bugs.ghostscript.com/show_bug.cgi?id=696509
The opinion of our PCL maintainer is that the output is correct, inasmuch as it matches at least one HP printer. See the URL above for slightly more details.
Based on the discussion on the bug thread, the input file is invalid because it should have had CRLFs instead of LFs only.
If I convert the LFs to CRLFs then my input file is converted as expected to PDF. However, converting LFs to CRLFs is not a general solution. According to support LFs can be used for images. In that case converting such an LF to CRLF could break the image.
It seems that there is one thing I was wrong about on the bug thread, in our system, lpr includes carriage returns as well in the final file that gets sent to the printer. I followed the instructions here: https://wiki.ubuntu.com/DebuggingPrintingProblems, and the instructions in the 'Getting the data which would go to the printer' section to print to a file and the file includes Carriage returns.

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.