Ghostscript OCR and output file

Ghostscript OCR and output file - pdf

I'm try to apply OCR with Ghostscript to pdf file with this command
gswin64c -sOCRLanguage="fast/eng" -sDEVICE=pdfocr24 -o out.pdf in.pdf
Original PDF is 300DPI and Colored and 4.1MB, but after run OCR, i have a PDF with 42.5MB
I have tried to add -r300 -dDownScaleFactor=4 but these options decrease quality image and ocr accuracy.
Expected is to have a PDF a little bit large of original (same quality as original, but with OCR).
There are any advices for accomplish that?

Related

Issue with ghostscript processing of a PDF with embedded PDF (produced with LaTeX)

I have converted this file to PDF using rsvg-convert.
I’m now embedding it within this LaTeX document:
\documentclass{standalone}
\usepackage{graphicx}
\begin{document}
\includegraphics{nucleosynthesis_periodic_table.pdf}
\end{document}
that I compile to test_lualatex.pdf which I then process using this command: gs -o test_output_lualatex.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress test_lualatex.pdf (I’m normally doing this on my bigger main document, where this helps a lot reducing the PDF size —here I’m only providing a MWE). And I get this: test_output_lualatex.pdf.
Why is the output so different after ghostscript processing? Is there anything I can do about it?

Ghostscript generate quality thumbnail jpg from pdf

I am generating .jpg thumbnails out of .pdf pages with ghostscript.
This is the code I'm using:
gswin64c -dNumRenderingThreads=4 -dNOPAUSE -sDEVICE=jpeg -g125x175 -
dPDFFitPage -sOutputFile=./h%d.jpg -dJPEGQ=100 -r300 -q input.pdf -c quit
Everything is fine except the quality of thumbnails is really bad. I'm hoping for some ghostscript command to increase the quality to imagemagick quality.
Btw. Imagemagick generates good quality thumbnails, but it's too slow.
Here is an example thumbnail with ghostscript:
And here is the image I want. Generated by imagemagick:

It would be helpful to supply the original file, without that its speculation as regards better parameters.
Personally I wouldn't use JPEG, I doubt it offers much compression at such low resolution/media size. It also doesn't perform well on linework and text, which is what your page looks like to me. The combination leads to considerable artefacts in the output.
The ImageMagick output appears to be heavily anti-aliased, you can get that from Ghostscript by setting -dGraphicsAlphaBits, -dTextAlphaBits OR by oversampling the resolution and then downsampling, using -dDownScaleFactor.
Of course, the performance of Ghostscript when producing anti-aliased output will be reduced compared to the normal output. You can't get something for nothing 'better quality' is going to cost you somewhere along the line.
Note that at the page size you are using -dNumRenderingThreads will have no effect whatsoever. You have to be running a display list for that to have any effect, and such a tiny page will be rendered as a bitmap in memory.

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text

I am trying to compress PDF versions of my school newspaper using code and created the following script which works perfectly below.
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -sOutputFile=$file $old;
I went to run it on the server and discovered that the version of ghostscript on my server was old causing the code to not work, and I don't have permission to update gs (I'm on a shared hosting service.) I do have ImageMagik on the server too and was wondering if anyone could help me compress text heavy PDFs with it. I tried some code similar to
convert -compress JPEG -quality 100 input.pdf output.pdf
but it made the PDF text very blurry (not good for reading newspapers.)
If anyone could help me, it would be greatly appreciated. Thank you!

ImageMagick also uses Ghostscript to convert your pdf file and it will use the same old version of Ghostscript.
If you want to get a more readable text you should set the density.
convert -density 150 input.pdf -compress JPEG output.pdf
If you want to get images with a higher quality you should not specify JPEG compression. If your PDF is monochrome you can use Group4 compression:
convert -density 150 input.pdf -compress group4 output.pdf
When your PDF is not monochrome you can use LZW/Zip compression:
convert -density 150 input.pdf -compress LZW output.pdf
convert -density 150 input.pdf -compress Zip output.pdf
You could start with 150 and increase it to improve the quality. But that will also increase the size of your file. ImageMagick will convert your pdf to an image and then convert it back to a PDF file that contains only images and not text. I am not sure if this will actually decrease the size of your file but you will have to test that yourself.

Converting from pdf to png with ghostscript, result with many white boxes

I'm converting pdf (created with adobe illustrator) into transparent png file, with following command:
gs -q -sDEVICE=pngalpha -r300 -o target.png -f source.pdf
However, there's undesired white boxes in the resulting PNG, looks like it's auto generated by ghostscript, some bounding box. (see attached image)
Tryied both gs-9.05 and gs-9.10, same bad result.
I've tried to export to PNG file from Illustrator or Inkscape manually, the result is good.
What does Inkscape do to render it correct, and
How could I eliminate those white boxes using ghostscript?

Try mudraw of latest (1.3) muPDF, as far as I checked it creates nice PNGs from PDF files with 1.4 transparency:
mudraw -o out.png -c rgba in.pdf
"rgba" being, as you understand, RGB + alpha

In the general case, you can't. PDF does support transparency, but the underlying media is always assumed to be white and opaque. So anywhere that marks are made on the medium is no longer transparent, its white.
You don't say which version of Ghostscript you are using, but if its earlier than 9.10 you could try upgrading.

How to convert PDF to low-resolution (but good quality) JPEG?

When I use the following ghostscript command to generate jpg thumbnails from PDFs, the image quality is often very poor:
gs -q -dNOPAUSE -dBATCH -sDEVICE=jpeggray -g465x600 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_lowres.jpg test.pdf
By contrast, if I use ghostscript to generate a high-resolution png, and then use mogrify to convert the high-res png to a low-res jpg, I get pretty good results.
gs -q -dNOPAUSE -dBATCH -sDEVICE=pnggray -g2550x3300 -dUseCropBox -dPDFFitPage -sOutputFile=pdf_to_highres.png test.pdf
mogrify -thumbnail 465x600 -format jpg -write pdf_to_highres_to_lowres.jpg pdf_to_highres.png
Is there any way to achieve good results while bypassing the intermediate pdf -> high-res png step? I need to do this for a large number of pdfs, so I'm trying to minimize the compute time.
Here are links to the images referenced above:
test.pdf
pdf_to_lowres.jpg
pdf_to_highres.png
pdf_to_highres_to_lowres.jpg

One option that seems to improve the output a lot: -dDOINTERPOLATE. Here's what I got by running the same command as you but with the -dDOINTERPOLATE option:
I'm not sure what interpolation method this uses but it seems pretty good, especially in comparison to the results without it.
P.S. Consider outputting PNG images (-sDEVICE=pnggray) instead of JPEG. For most PDF documents (which tend to have just a few solid colors) it's a more appropriate choice.

Your PDF looks like it is just a wrapper around a jpeg already.
Try using the pdfimages program from xpdf to extract the actual image rather than rendering
to a file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Ghostscript OCR and output file - pdf

Related

Issue with ghostscript processing of a PDF with embedded PDF (produced with LaTeX)

Ghostscript generate quality thumbnail jpg from pdf

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text

Converting from pdf to png with ghostscript, result with many white boxes

How to convert PDF to low-resolution (but good quality) JPEG?

Categories

Resources