Quality of tiff output Imagemagick vs. Ghostscript - pdf

I'm currently working on a Google tesseract ocr workflow. There are two options for generating tif's from PDF:
Ghostscript:
gswin64c.exe -r300x300 -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -sOutputFile=thetif.tif -sCompression=lzw thepdf.pdf -c quit -q
Imagemagick - convert:
convert -background white -alpha off -density 300 thepdf.pdf -depth 8 -compress zip thetif.tif
For an (arbitrary) example file, the extracted tif is for gswin64c about five times as large as the result of convert. Also the text is nevertheless much smoother and higher quality with convert (!) then with gswin64c. So I would prefer to use convert, but it takes unfortunately about 4 times the time of gswin64c to extract e.g. 30 pages from a multipage pdf (170 sec vs. 40 sec).
Is there any chance to improve the quality of gswin64c (without extremely enlarge the output files) or to speed up convert?

To me this appears to be the usual trade off of speed versus quality. You like the convert quality, but its too slow, you like Ghostscript's speed but you feel the quality is lower.
Surely that would suggest that you can't have both ?
Anyway do you realise that ImageMagick convert calls Ghostscript to render the PDF file ? So whichever route you use, you are using Ghostscript.
It is (of course) entirely possible that convert is post=-processing the image, but I would suspect it is not. If you look into how convert works you can probably find out what command line its feeding to Ghostscript and use that.
It also looks like convert is using a different compression filter (Flate instead of LZW), and may be specifying anti-aliasing. You can get anti-aliasing either by using TextAlphaBits and GraphcisAlphaBits or the tiffscaled devices.
Of course, using anti-aliasing will result in smoother text (if you like blurred text) but it will take longer.

I do not use google tesseract ocr workflow but your command looks odd. Why two converts?
The input image usually comes after the convert but in your case the -density would come first.
I would try something like this and see what happens:
imagemagick - convert -density 300 thepdf.pdf -background white -alpha off -depth 8 -compress zip thetif.tif

Related

Ghostscript generate quality thumbnail jpg from pdf

I am generating .jpg thumbnails out of .pdf pages with ghostscript.
This is the code I'm using:
gswin64c -dNumRenderingThreads=4 -dNOPAUSE -sDEVICE=jpeg -g125x175 -
dPDFFitPage -sOutputFile=./h%d.jpg -dJPEGQ=100 -r300 -q input.pdf -c quit
Everything is fine except the quality of thumbnails is really bad. I'm hoping for some ghostscript command to increase the quality to imagemagick quality.
Btw. Imagemagick generates good quality thumbnails, but it's too slow.
Here is an example thumbnail with ghostscript:
And here is the image I want. Generated by imagemagick:
It would be helpful to supply the original file, without that its speculation as regards better parameters.
Personally I wouldn't use JPEG, I doubt it offers much compression at such low resolution/media size. It also doesn't perform well on linework and text, which is what your page looks like to me. The combination leads to considerable artefacts in the output.
The ImageMagick output appears to be heavily anti-aliased, you can get that from Ghostscript by setting -dGraphicsAlphaBits, -dTextAlphaBits OR by oversampling the resolution and then downsampling, using -dDownScaleFactor.
Of course, the performance of Ghostscript when producing anti-aliased output will be reduced compared to the normal output. You can't get something for nothing 'better quality' is going to cost you somewhere along the line.
Note that at the page size you are using -dNumRenderingThreads will have no effect whatsoever. You have to be running a display list for that to have any effect, and such a tiny page will be rendered as a bitmap in memory.

Ghostscript convert PDF to JPG (CMYK profile) resolution error

I'm using Ghostcript to convert some PDF files to JPG. All is working when converting the program consider the resolution of 600dpi when converting and output jpeg quality is good.
Here is my code :
gs -sDEVICE=jpegcmyk -dTextAlphaBits=4 -r600 -dSAFER -dBATCH -dNOPAUSE -o my_output_file.jpg my_input_file.pdf
But when I open the file in Photoshop, the properties contains 72dpi instead of 600dpi I expected :
When I try with RGB profile for output, it is ok, I have got 600dpi.
So what I want is CMYK + 600dpi in image properties.
As can be seen from your screenshots, both images are of the same dimensions, 6803 by 709 pixels.
And that is all that matters.
Also, the size of the CMYK version is bigger by about 33% compared to the RGB version -- as is to be expected for an image with 4 color channels instead of 3.
Ghostscript used the -r600 CLI parameter to correctly expand the number of pixels when converting the PDF file.
Ghostscript does not add any EXIF metadata to its output when converting a PDF to raster.
The DPI or PPI information would be an internal metadata hint to tell any compliant viewers how big to render the image on screen. It would not change anything substantial in the image information itself.
Why Photoshop does think it should use 72 dpi for one, but 600 dpi for the other, you may ask Adobe about.
I bet Photoshop also renders the 72dpi file about 7 times larger on screen than the other. Is that the case?
P.S.: See also "What DPI do web images need to be?"

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text

I am trying to compress PDF versions of my school newspaper using code and created the following script which works perfectly below.
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -sOutputFile=$file $old;
I went to run it on the server and discovered that the version of ghostscript on my server was old causing the code to not work, and I don't have permission to update gs (I'm on a shared hosting service.) I do have ImageMagik on the server too and was wondering if anyone could help me compress text heavy PDFs with it. I tried some code similar to
convert -compress JPEG -quality 100 input.pdf output.pdf
but it made the PDF text very blurry (not good for reading newspapers.)
If anyone could help me, it would be greatly appreciated. Thank you!
ImageMagick also uses Ghostscript to convert your pdf file and it will use the same old version of Ghostscript.
If you want to get a more readable text you should set the density.
convert -density 150 input.pdf -compress JPEG output.pdf
If you want to get images with a higher quality you should not specify JPEG compression. If your PDF is monochrome you can use Group4 compression:
convert -density 150 input.pdf -compress group4 output.pdf
When your PDF is not monochrome you can use LZW/Zip compression:
convert -density 150 input.pdf -compress LZW output.pdf
convert -density 150 input.pdf -compress Zip output.pdf
You could start with 150 and increase it to improve the quality. But that will also increase the size of your file. ImageMagick will convert your pdf to an image and then convert it back to a PDF file that contains only images and not text. I am not sure if this will actually decrease the size of your file but you will have to test that yourself.

How to convert PDF to low-resolution (but good quality) JPEG?

When I use the following ghostscript command to generate jpg thumbnails from PDFs, the image quality is often very poor:
gs -q -dNOPAUSE -dBATCH -sDEVICE=jpeggray -g465x600 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_lowres.jpg test.pdf
By contrast, if I use ghostscript to generate a high-resolution png, and then use mogrify to convert the high-res png to a low-res jpg, I get pretty good results.
gs -q -dNOPAUSE -dBATCH -sDEVICE=pnggray -g2550x3300 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_highres.png test.pdf
mogrify -thumbnail 465x600 -format jpg -write pdf_to_highres_to_lowres.jpg pdf_to_highres.png
Is there any way to achieve good results while bypassing the intermediate pdf -> high-res png step? I need to do this for a large number of pdfs, so I'm trying to minimize the compute time.
Here are links to the images referenced above:
test.pdf
pdf_to_lowres.jpg
pdf_to_highres.png
pdf_to_highres_to_lowres.jpg
One option that seems to improve the output a lot: -dDOINTERPOLATE. Here's what I got by running the same command as you but with the -dDOINTERPOLATE option:
I'm not sure what interpolation method this uses but it seems pretty good, especially in comparison to the results without it.
P.S. Consider outputting PNG images (-sDEVICE=pnggray) instead of JPEG. For most PDF documents (which tend to have just a few solid colors) it's a more appropriate choice.
Your PDF looks like it is just a wrapper around a jpeg already.
Try using the pdfimages program from xpdf to extract the actual image rather than rendering
to a file.

ImageMagick PDF to JPEG conversion results in green square where image should be

I'm attempting to convert a PDF to a JPEG using ImageMagick.
The PDF:
baby_aRCWTU.pdf
The command:
convert -density 260 -profile 'SWOP.icc' -profile 'sRGB.icm' 'baby_aRCWTU.pdf' 'baby_aRCWTU.jpg'
The resulting JPEG:
baby_aRCWTU.jpg
As you can see, the text is rendered nicely, but the embedded image shows up as a green square. Any ideas? This occurs with and without the colour profiles.
edit: reposted due to broken links
On a site we convert hundreds of PDF's on a daily basis where we need to create JPGs and we found it only reliable to convert the PDF's to postscript first.
We use the "pdftops" command, try
pdftops baby_aRCWTU.pdf baby_aRCWTU.ps
then your convert command above, but on the ps. Works for me, the image is then included.