Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text - pdf

I am trying to compress PDF versions of my school newspaper using code and created the following script which works perfectly below.
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -sOutputFile=$file $old;
I went to run it on the server and discovered that the version of ghostscript on my server was old causing the code to not work, and I don't have permission to update gs (I'm on a shared hosting service.) I do have ImageMagik on the server too and was wondering if anyone could help me compress text heavy PDFs with it. I tried some code similar to
convert -compress JPEG -quality 100 input.pdf output.pdf
but it made the PDF text very blurry (not good for reading newspapers.)
If anyone could help me, it would be greatly appreciated. Thank you!

ImageMagick also uses Ghostscript to convert your pdf file and it will use the same old version of Ghostscript.
If you want to get a more readable text you should set the density.
convert -density 150 input.pdf -compress JPEG output.pdf
If you want to get images with a higher quality you should not specify JPEG compression. If your PDF is monochrome you can use Group4 compression:
convert -density 150 input.pdf -compress group4 output.pdf
When your PDF is not monochrome you can use LZW/Zip compression:
convert -density 150 input.pdf -compress LZW output.pdf
convert -density 150 input.pdf -compress Zip output.pdf
You could start with 150 and increase it to improve the quality. But that will also increase the size of your file. ImageMagick will convert your pdf to an image and then convert it back to a PDF file that contains only images and not text. I am not sure if this will actually decrease the size of your file but you will have to test that yourself.

Related

ImageMagick convert adds whitespace when converting PDF to PNG

I'm using ImageMagick to convert the following PDF to an PNG file.
Click here to download the PDF from IMSLP (Permalink if the direct download is broken)
In a PDF viewer it looks nice:
but when converting with convert -density 300 -background white -alpha off -alpha remove file.pdf /tmp/file.png
the image gets a large white margin:
I do not want to trim the image afterwards, I just want ImageMagick to somehow respect the view-port or however that viewing information is being encoded in the PDF. Does anyone know which command-line parameter might enable this behavior?
Edit 10.03.2022: I'm using ImageMagick 7.1.0.16 with Ghostscript 9.55.0 inside an Alpine Linux docker image.

I do not get your extra margin in ImageMagick 6.9.12-42 using Ghostscript 9.54. But changing the density does not seem to have any effect.
convert -density 300 -background white file.pdf[1] x2.png
The issues may be a malformed PDF. How was it created? Also what version of Ghostscript are you using? It could be a GS version issue.
If this was a scanned PDF that is a raster image in a vector PDF shell, then you could just use pdfimages to extract the raster files. See https://manpages.debian.org/testing/poppler-utils/pdfimages.1.en.html

The hint from KenS was exactly what I was looking for - the PDF defines a CropBox that ImageMagick 7.1.0 was not using by default. The solution therefore is to modify the command to include the following -define information:
convert -define pdf:use-cropbox=true file.pdf /tmp/file.png
Thank you all for your help!

ImageMagick convert pdf with multiple pages to high quality PNG

I am trying to convert a multi-page PDF to one long png with the following command:
convert -append -flatten -density 300 in.pdf out.png
I am using -flatten to lose transparency, since I want a white background in the final PNG. The problem is that it takes only the first page instead of using all the pages.
How can I convert the PDF to one long PNG while losing the transparency and using a white background?

This command works for me on IM 6.9.9.22 Q16 Mac OSX with Ghostscript 9.21
convert -density 300 -colorspace sRGB itc101_13.pdf -alpha off -append out.png
If it does not work for you, then what is your ImageMagick version and what is your Ghostscript version.

You have your syntax wrong. You must read the PDF before applying append. Try
convert -density 300 -colorspace sRGB in.pdf +adjoin -append -background white -flatten out.png
If that does not work, then what is your ImageMagick version and platform? What is your Ghostscript version and your libpng version? Can post a link to your PDF file?
Note that +adjoin is not usually necessary for output to PNG, but won't hurt.

Quality of tiff output Imagemagick vs. Ghostscript

I'm currently working on a Google tesseract ocr workflow. There are two options for generating tif's from PDF:
Ghostscript:
gswin64c.exe -r300x300 -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -sOutputFile=thetif.tif -sCompression=lzw thepdf.pdf -c quit -q
Imagemagick - convert:
convert -background white -alpha off -density 300 thepdf.pdf -depth 8 -compress zip thetif.tif
For an (arbitrary) example file, the extracted tif is for gswin64c about five times as large as the result of convert. Also the text is nevertheless much smoother and higher quality with convert (!) then with gswin64c. So I would prefer to use convert, but it takes unfortunately about 4 times the time of gswin64c to extract e.g. 30 pages from a multipage pdf (170 sec vs. 40 sec).
Is there any chance to improve the quality of gswin64c (without extremely enlarge the output files) or to speed up convert?

To me this appears to be the usual trade off of speed versus quality. You like the convert quality, but its too slow, you like Ghostscript's speed but you feel the quality is lower.
Surely that would suggest that you can't have both ?
Anyway do you realise that ImageMagick convert calls Ghostscript to render the PDF file ? So whichever route you use, you are using Ghostscript.
It is (of course) entirely possible that convert is post=-processing the image, but I would suspect it is not. If you look into how convert works you can probably find out what command line its feeding to Ghostscript and use that.
It also looks like convert is using a different compression filter (Flate instead of LZW), and may be specifying anti-aliasing. You can get anti-aliasing either by using TextAlphaBits and GraphcisAlphaBits or the tiffscaled devices.
Of course, using anti-aliasing will result in smoother text (if you like blurred text) but it will take longer.

I do not use google tesseract ocr workflow but your command looks odd. Why two converts?
The input image usually comes after the convert but in your case the -density would come first.
I would try something like this and see what happens:
imagemagick - convert -density 300 thepdf.pdf -background white -alpha off -depth 8 -compress zip thetif.tif

Linux PDF to TIFF Quality Issue

I am trying to use a linux application to convert .pdf files to .tiff for faxing, however, our clients have not been happy with the quality of GhostScript's tiffg4 device.
In the image below, the left side shows a conversion using GhostScript tiffg4 and the right is from an online conversion service. We are unable to see which application is being used to attain that quality.
Note: The output TIFF must be black & white
Ghostscript Code:
gs -sDEVICE=tiffg4 -dNOPAUSE -dBATCH -dPDFFitPage -sPAPERSIZE=letter -g1728x2156 -sOutputFile=testg4.tiff test.pdf
We have tried these GhostScript devices:
tiffcrle
tiffg3
tiffg32d
tiffg4
tifflzw
tiffpack
My question is this--does anyone know which application and/or setting is used to achieve the quality on the right?

Extending on BitBank's comment, you could write a RGB tiff and then use ImageMagick to convert to Group 4. ImageMagick allows you to control the dithering algorithm:
gs -sDEVICE=tiff24nc -dNOPAUSE -dBATCH -dPDFFitPage -sPAPERSIZE=letter -g1728x2156 -sOutputFile=intermediate.tiff your.pdf
convert intermediate.tiff -dither FloydSteinberg -compress group4 out.tiff
ImageMagick's manual has some background on the algorithm(s) and available options.

How to convert PDF to low-resolution (but good quality) JPEG?

When I use the following ghostscript command to generate jpg thumbnails from PDFs, the image quality is often very poor:
gs -q -dNOPAUSE -dBATCH -sDEVICE=jpeggray -g465x600 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_lowres.jpg test.pdf
By contrast, if I use ghostscript to generate a high-resolution png, and then use mogrify to convert the high-res png to a low-res jpg, I get pretty good results.
gs -q -dNOPAUSE -dBATCH -sDEVICE=pnggray -g2550x3300 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_highres.png test.pdf
mogrify -thumbnail 465x600 -format jpg -write pdf_to_highres_to_lowres.jpg pdf_to_highres.png
Is there any way to achieve good results while bypassing the intermediate pdf -> high-res png step? I need to do this for a large number of pdfs, so I'm trying to minimize the compute time.
Here are links to the images referenced above:
test.pdf
pdf_to_lowres.jpg
pdf_to_highres.png
pdf_to_highres_to_lowres.jpg

One option that seems to improve the output a lot: -dDOINTERPOLATE. Here's what I got by running the same command as you but with the -dDOINTERPOLATE option:
I'm not sure what interpolation method this uses but it seems pretty good, especially in comparison to the results without it.
P.S. Consider outputting PNG images (-sDEVICE=pnggray) instead of JPEG. For most PDF documents (which tend to have just a few solid colors) it's a more appropriate choice.

Your PDF looks like it is just a wrapper around a jpeg already.
Try using the pdfimages program from xpdf to extract the actual image rather than rendering
to a file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text - pdf

Related

ImageMagick convert adds whitespace when converting PDF to PNG

ImageMagick convert pdf with multiple pages to high quality PNG

Quality of tiff output Imagemagick vs. Ghostscript

Linux PDF to TIFF Quality Issue

How to convert PDF to low-resolution (but good quality) JPEG?

Categories

Resources