I have a 100 page PDF that is about 50 MBs. I am running the script below against it and it's taking about 23 seconds per page. The PDF is a scan of a paper document.
gswin32.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.3
-dPDFSETTINGS=/screen -sOutputFile=out4.pdf 09.pdf
Is there anything I can do to speed this up? I've determined that the -dPDFSettings=/screen is what is making it so slow, but i'm not getting good compression without it...
UPDATE:
OK I tried updating it to what I have below. Am i using the -c 30000000 setvmthreshold portion correctly?
gswin32.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.3
-dPDFSETTINGS=/screen -dNumRenderingThreads=2 -sOutputFile=out7.pdf
-c 30000000 setvmthreshold -f 09.pdf
If you are on a multicore system, make it use multiple CPU cores with:
-dNumRenderingThreads=<number of cpus>
Let it use up to 30mb of RAM:
-c "30000000 setvmthreshold"
Try disabling the garbage collector:
-dNOGC
Fore more details, see Improving Performance section from Ghoscript docs.
I was crunching a ~300 page PDF on a core i7 and found that adding the following options provided a significant speedup:
%-> comments to the right
-dNumRenderingThreads=8 % increasing up to 64 didn't make much difference
-dBandHeight=100 % didn't matter much
-dBandBufferSpace=500000000 % (500MB)
-sBandListStorage=memory % may or may not need to be set when gs is compiled
-dBufferSpace=1000000000 % (1GB)
The -c 1000000000 setnvmthreshold -f thing didn't make much difference for me, FWIW.
You don't say what CPU and what amount of RAM your computer is equipped with.
Your situation is this:
A scanned document as PDF, sized about 500 kB per page on avarage. That means each page basically is a picture, using the scan resolution (at least 200 dpi, maybe even 600 dpi).
You are re-distilling it with Ghostscript, using -dPDFSETTINGS=/screen. This setting will do quite a few things to make the file size smaller. Amongst the most important are:
Re-sample all (color or grayscale) images to 72dpi
Convert all colors to sRGB
Both these operations can quite "expensive" in terms of CPU and/or RAM usage.
BTW, your setting of -dCompatibilityLevel=1.3 is not required; it's already implicitely set by -dPDFSETTINGS=/screen already.
Try this:
gswin32.exe ^
-o output.pdf ^
-sDEVICE=pdfwrite ^
-dPDFSETTINGS=/screen ^
-dNumRenderingThreads=2 ^
-dMaxPatternBitmap=1000000 ^
-c "60000000 setvmthreshold" ^
-f input.pdf
Also, if you are on a 64bit system, try to install the most recent 32bit Ghostscript version (9.00). It performs better than the 64bit version.
Let me tell you that downsampling a 600dpi scanned page image to 72dpi usually does not take 23 seconds for me, but less than 1.
To speed up rasterizing a pdf with large bitmap graphics to a high-quality 300 ppi png image, I found that setting -dBufferSpace as high as possible and -dNumRenderingThreads to as many cores as available was the most effective for most files, with -dBufferSpace providing the most significant lift.
The specific values that worked the best were:
-dBufferSpace=2000000000 for 2 gigabytes of buffer space. This took the rasterization of one relatively small file from 14 minutes to just 50 seconds. For smaller files, there wasn't much difference from setting this to 1 gigabyte, but for larger files, it made a significant difference (sometimes 2x faster). Trying to go to 3 gigabytes or above for some reason resulted in an error on startup "Unrecoverable error: rangecheck in .putdeviceprops".
-dNumRenderingThreads=8 for a machine with 8 cores. This took the rasterization of that same file from 14 minutes to 4 minutes (and 8 minutes if using 4 threads). Combining this with the -dBufferSpace option above took it from 50 seconds to 25 seconds. When combined with -dBufferSpace however, there appeared to be diminishing returns as the number threads were increased, and for some files there was little effect at all. Strangely for some larger files, setting the number of threads to 1 was actually faster than any other number.
The command overall looked like:
gs -sDEVICE=png16m -r300 -o document.png -dNumRenderingThreads=8 -dBufferSpace=2000000000 -f document.pdf
This was tested with Ghostscript 9.52, and came out of testing the suggestions in #wpgalle3's answer as well as the Improving performance section in the Ghostscript documentation.
A key takeaway from the documentation was that when ghostscript uses "banding mode" due to the raster image output being larger than the value for -dMaxBitmap, it can take advantage of multiple cores to speed up the process.
Options that were ineffective or counterproductive:
Setting -c "2000000000 setvmthreshold" (2 gigabytes) either alone or with -dBufferSpace didn't appear to make a difference.
Setting -sBandListStorage=memory resulted in a segmentation fault.
Setting -dMaxBitmap=2000000000 (2 gigabytes) significantly slowed down the process and apparently caused it to go haywire, writing hundreds of gigabytes of temporary files without any sign of stopping, prompting me to kill the process short.
Setting -dBandBufferSpace to half of -dBufferSpace didn't make a difference for smaller files, but actually slowed down the process rather significantly for larger files by 1.5-1.75x. In the Banding parameters section of the Ghostscript documentation, it's actually suggested not to use -dBandBufferSpace: "if you only want to allocate more memory for banding, to increase band size and improve performance, use the BufferSpace parameter, not BandBufferSpace."
I may be complete out of place here, but have you given a try to the Djvu file format ? It works like a charm for scanned documents in general (even if there are lots of pictures), and it gives much better compressed files: I get a factor of two lossless gain in size in general on B&W scientific articles.
Related
I have lots of PDF documents which take a lot space in the DB because they are scans of text documents with handwritten annotations. Each page is actually a JFIF image.
I tried a command from a similar SO question which resulted in a considerable gain in size (11Mb to 1Mb) but also a considerable loss of quality as the text becomes very blury and hard to read:
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -dCompatibilityLevel=1.4 -dEncodeColorImages=true -sOutputFile=/tmp/document-1.pdf /tmp/document.pdf
How do I reduce the size of a my PDF with minor loss of quality ?
From here : https://stackoverflow.com/a/50018211/8315843
gs -dBATCH -dNOPAUSE -sDEVICE=pdfimage8 -r150 -sOutputFile=/tmp/document-1.pdf /tmp/document.pdf
The profile 'ebook' also uses the resolution of 150dpi among lots of other parameters :
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -sOutputFile=/tmp/document-1.pdf /tmp/document.pdf
Or you could try k2pdfopt.
k2pdfopt -mode copy.
It sticks to older compression algorithms, because of compatibility constraints, so it may increase pdf file size, but it usually reduces pdf fie size and almost always reduces loading time.
I am writing this answer because I also searched a lot for a oneliner that would allow me to downscale the scanned images and set the DPI of the resulting pdf.
The answer from Sybuser is correct and will produce the wanted effect, but I was looking for something more than a predefined list of SETTINGS.
This is the resource I recommend: follow link to understand what possibilities are offered by GhostScript.
I used the following combination of switches:
gs -sDEVICE=pdfimage8 -dDownScaleFactor=2 -r150 -q -o output.pdf Clopotele-Hemingway.pdf
to reduce 9 pages of a scanned book from 2.3 M :
down to 660 K:
Original
"Compressed" (pdfimage8 dDownScaleFactor=2 DPI: 150
All the necessary details and more you can find at the official documentation of GhostScript, by experimenting with the three switches, you can tune in to a setting that best suits your needs:
-sDEVICE=pdfimage8
-dDownScaleFactor=2
-r150
Happy compressing!
I am Extracting Text from pdf and for that i am using Ghostscript v9.52
The time taken by ghostscript with default txtwrite commands is ~400ms and the commands are:
-dSafer -dBATCH -dNOPAUSE -sPDFPassword=thispdf -device="txtwrite" stdout pdf.pdf
Then i tried to lower down the resolution of renderring and that saved some time was able to make it down to ~300ms:
-dSafer -dBATCH -dNOPAUSE -r2 -dDEVICEWIDTHPOINTS=50 -dDEVICEHEIGHTPOINTS=50 -dFIXEDMEDIA -sPDFPassword=thispdf -device="txtwrite" stdout pdf.pdf
Have no idea how setting low resolution is working here.
How can i speed up Text Extraction near to 100ms if possible ?
If that's how long it's taking, then that's the length of time it takes. The interpreter has to be started up and establish a full working PostScript environment then fully interpret the input, including all the fonts, and pass that to the output device. The output device records the font, point size, orientation, colour, position and attempts to calculate the Unicode code points for all the text. Then, depending on the options (which you haven't given) it may reorder the text before output. Then it outputs the text closes the input and output files, releases all the memory used and cleanly shuts down the interpreter.
You haven't given an example of the file you are using, but half a second doesn't seem like a terribly long time to do all that.
In part you can blame all the broken PDF files out there, every time another broken file turns up 'but Acrobat reads it' another test has to be made and a work-around established, generally all of which slow the interpreter down.
The resolution will have no effect, and I find it very hard to believe the media size makes any difference at all, since it's not used. Don't use NOGC, that's a debugging tool and will cause the memory usage to increase.
The most likely way to improve performance would be not to shut the interpreter down between jobs, since the startup and shutdown are probably the largest part of the time spent when it's that quick. Of course that means you can't simply fork Ghostscript, which likely means doing some development with the API and that would potentially mean you were infringing the AGPL, depending what your eventual plans are for this application.
If you would like to supply a complete command line and an example file I could look at it (you could also profile the program yourself) but I very much doubt that your goal is attainable, and definitely not reliably attainable for any random input file.
I have no experience of programming.
My PDFs won't display images on the iPad in PDFExpert or GoodNotes as the images are in JPEG2000, from what I could find on the internet.
These are large PDFs, upto 1500-2000 pages with images. One of these was an 80MB or so file. I tried printing it with Foxit to convert the images to JPG from JPEG2000 but the file size jumped to 800MB...plus it's taking too long.
I stumbled upon Ghostscript, but I have NO clue how to use the command line interface.
I am very short on time. Pretty much need a step by step guide for a small script that converts all my PDFs in one go.
Very sorry about my inexperience and helplessness. Can someone spoon-feed me the steps for this?
EDIT: I want to switch the JPEG2000 to any other format that produces less of an increase in file size and causes a minimal loss in quality (within reason). I have no clue how to use Ghostscript. I basically want to change the compression on the images to something that will display correctly on the iPad while maintaining the quality of the rest of the text, as well as the embedded bookmarks.
I'll repeat that I have NO experience with command line...I don't even know how to point GS to the folder my PDFs are in...
You haven't really said what it is you want. 'Convert' PDFs how exactly ?
Note that switching from JPX (JPEG2000) to JPEG will result in a quality loss, because the image data will be quantised (with a different quantisation scheme to JPX) by the JPEG encoder. You can use a lossless compression scheme instead, but then you won't get the same kind of compression. You won't get the same compression ratio as JPX anyway no matter what you use, the result will be larger.
A simple Ghostscript command would be:
gs -sDEVICE=pdfwrite -o out.pdf in.pdf
Because JPEG2000 encoding is (or at least, was) patent encumbered, the pdfwrite device doesn't write images as JPX< by default it will write them several times with different compression schemes, and then use the one that gives the best compression (practically always JPEG).
Getting better results will require more a complex command line, but you'll also have to be more explicit about what exactly you want to achieve, and what the perceived problem with the simplistic command line is.
[EDIT]
Well, giving help on executing a command line is a bit off-topic for Stack Overflow, this is supposed to be a site for software developers :-)
Without knowing what operating system you are using its hard to give you detailed instructions, I also have no idea what an iPad uses, I don't generally use Apple devices and my only experience is with Macs.
Presumably you know where (the directory) you installed Ghostscript. Either open a command shell there and type the command ./gs or execute the command by giving the full path, such as :
/usr/bin/gs
I thought the arguments on the command line were self-explanatory, but....
The -sDEVICE=pdfwrite switch tells Ghostscript to use the pdfwrite device, as you might guess from the name, that device writes PDF files as its output.
The -o switch is the name (and full path if required) of the output file.
The final argument is the name (and again, full path if its not in the current directory) of the input file.
So a command might look like:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/me/output.pdf /home/me/input.pdf
Or if Ghostscript and the input file are in the same directory:
./gs -sDEVICE=pdfwrite -o out.pdf input.pdf
I'm not sure if this is the right place to post this question.
I'm trying to reduce the size of multiple 7MB PDF files so I tried this ghostscript commands I found online:
simple ghostscript with printer quality setting
gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
tried this
gswin32c.exe -o output.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/LeaveColorUnchanged -dDownsampleMonoImages=false -dDownsampleGrayImages=false -dDownsampleColorImages=false -dAutoFilterColorImages=false -dAutoFilterGrayImages=false -dColorImageFilter=/FlateEncode -dGrayImageFilter=/FlateEncode input.pdf
and this
gswin32c.exe -o output.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/LeaveColorUnchanged -dEncodeColorImages=false -dEncodeGrayImages=false -dEncodeMonoImages=false input.pdf
but in all cases the PDF files obtained were 'bigger' that the original.
All these pdf files are basically a collection of scanned images so maybe I need a specific option to 'tell' ghostscript to compress them ?
The strange thing I found is that using the trial version of phantom pdf I was able to reduce the size to 2-5MB without visible loss of quality.
How do I do the same with ghostscript ?
Firstly, Ghostscript (or more accurately, Ghostscript's pdfwrite device) doesn't 'shrink' PDF files, it makes new ones which may, or may not, be smaller.
Secondly, its practically impossible to say what might be happening with a PDF file without an example to look at.
If your files really are scanned images, then (assuming sensible initial compression) there's probably no way to reduce the file size without reducing quality. You might not notice the reduction in quality, especilaly if you're just viewing on screen, but it will be there.
Random poking with command lines which you run across online is probably not going to result in useful output either; you really need to understand where the size is being used in your original files, and then select options which are likely to reduce that.
For example, you say the pages are scanned images; there are only two realistic ways to reduce the size of an image, downsample it to a lower resolution, or select a different (more efficient, possibly lossy, compression). Ghostscript already compresses image data (unless you tell it not to).
The latter two of your command lines explicitly disable image downsampling, so they are not likely to reduce the size of scanned images. (by default the pdfwrite device doesn't downsample images, we try to preserve quality)
The middle option disables auto compression, and selects Flate compression. If your images were previously JPEG compressed, or are not contone images, then this is probably reasonable.
You also say that the PDF files got larger, most likely this is due to using compressed object streams and xref, which is a PDF 1.5 feature that the pdfwrite device doesn't support. However its not likely to save you much space.
I'd say the most likely difference is that 'phantom PDF' is using more aggressive downsampling, which you could reproduce with pdfwrite.
I'm assuming, of course, that you are using a recent version of Ghostscript. Older versions unsurprisingly perform less well than recent ones.
I'm using ImageMagick to convert a few hundred thousand PDF files to PNG files. ImageMagick takes about ten seconds to do this. Now, most of these PDF files are automatically generated grading certificates, so it's basically just a bunch of PDF files with the forms filled in with different numbers. There are also a few simple raster images on each PDF> I mean, one option is to just throw computing power at it, but that means money as well as making sure they all end up in the right place when they come back. Another option is to just wait it out on our current computer. But I did the calculations here, and we won't even be able to keep up with the certificates we get in real-time.
Now, the option I'm hoping to pursue is to somehow take advantage of the fact that most of these files are very similar, so if we have some sort of pre-computed template to use, we can skip the process of calculating the entire PDF file every time the conversion is done. I'd do a quick check to see if the PDF fits any of the templates, run the optimized conversion if it does, and just do a full conversion if it doesn't.
Of course, my understanding of the PDF file format is intermediate at best, and I don't even know if this idea is practical or not. Would it require making a custom version of ImageMagick? Maybe contributing to the ImageMagick source code? Or is there some solution out there already that does exactly what I need it to? (We've all spend weeks on a project, then had this happen, I imagine)
Ok, I have had a look at this. I took your PDF and converted it to a JPEG like this - till you tell me the actual parameters you prefer.
convert -density 288 image.pdf image.jpg
and it takes 8 seconds and results in a 2448x3168 pixel JPEG of 1.6MB - enough for a page size print.
Then I copied your file 99 times so I had 100 PDFs, and processed them sequentially like this:
#!/bin/bash
for p in *.pdf; do
echo $new
new="${p%%pdf}jpg"
convert -density 288 $p $new
done
and that took 14 minutes 32 seconds, or average of 8.7 seconds.
Then I tried using GNU Parallel to do exactly the same 100 PDF files, like this:
time parallel -j 8 convert -density 288 {} {.}.jpg ::: *.pdf
keeping all 8 cores of my CPU very busy. but it processed the same 100 PDFs in 3 minutes 12, so averaging 1.92 seconds each, or a 4.5x speed-up. I'd say well worth the effort for a pretty simple command line.
Depending on your preferred parameters for convert there may be further enhancements possible...
The solution in my case ended up being to use MuPDF (thanks #Vadim) from the command line, which is about ten times faster than GhostScript (the library used by Imagemagick). MuPDF fails on about 1% of the PDF files though, due to improper formatting, which GhostScript is able to handle reasonably well, so I just wrote an exception handler to only use Imagemagick in those cases. Even so, it took about 24 hours on an 8-core server to process all the PDF files.