We are using ImageMagic and tesseract to try to read information in documents, but we are not finding the right configuration and combination of both softwares to optimize the original scanned tif document, and apply tesseract to it to obtain the information.
First we use to scan the document in a scanner with a configuration of 300 dpi, and the tif document produces uses to have 170KB size.
Then we try to run a pre-process of the image with imagemagic before passiing it to tesseract 3.0.3, to produce a PDF with text document.
The first command we use is this one:
convert page.tiff -respect-parenthesis -compress LZW -density 300
-bordercolor black -border 1 -fuzz 1% -trim +repage -fill white -draw
"color 0,0 floodfill" -alpha off -shave 1x1 -bordercolor black -border 2
-fill white -draw "color 0,0 floodfill" -alpha off -shave 0x1 -fuzz 1%
-deskew 40 +repage temp.tiff
And then we apply it to tesseract this way:
tesseract -l spa temp.tiff temp pdf
This produces a quite heavy pdf https://drive.google.com/open?id=0B3CPIZ_TyzFXd2UtWldfajR4SVU but tesseract is not able to read data that are in cells, or in a table just under the header of the table if the background of the header is darker.
Then we have tried to use this command with convert:
convert page.tiff -compress LZW -fuzz 1% -trim -alpha off -shave 1x1 temp.tiff
And this produces a very light pdf document https://drive.google.com/open?id=0B3CPIZ_TyzFXWFEwT3JucDBTVVU, but we are still having the same problems.
Could someone point us what way shall we follow to optimize the image to try to obtain information like the ones in the example? or guidelines to optimize images to improve the tesseract accuracy?
The type of documents we are trying to process are very different with different kind of font types and sizes
If on a Unix-based system, you could try my script, textcleaner, at http://www.fmwconcepts.com/imagemagick/index.php
Related
I am trying to convert .pdf files to .jpg using image-magic
convert -limit -limit map 300 -flatten -density 300 -quality 100 -crop '400x400+20+20' dummy.pdf[0] test.jpg
but the problem i am facing is when i convert the file, it cropping the area but marking all the other area as white.
for example if i am converting a pdf with 1000x1000 size and cropping it to a 100x100 size, the output am getting is an image with 1000x1000 size with 100x100 area croped from the pdf and rest is white space.
sample.pdf
i cannot use trim, since my pdf may or may not have white border and trim will remove it
Your syntax is not in the proper order for Imagemagick. Most of the settings and operators need to come after reading the input PDF. Using Imagemagick 6.9.10.71 Q16 Mac OSX Sierra:
convert -limit map 300 -density 300 dummy.pdf[0] -background white -flatten -crop '400x400+20+20' -quality 100 test.jpg
How do I convert only page 2 of a pdf file to a jpg image file, using GraphicsMagick command line prompt?
What option can I use in the gm.exe convert command?
gm.exe convert testing.pdf testing.jpg
Add the page number (starting from zero) in square brackets after the PDF filename:
gm.exe convert testing.pdf[1] testing.jpg
By the way, you can use the same indexing technique for accessing specific frames of a GIF animation, or layers of multi-layer/directory TIFFs.
use the blow command, will get high quality png with white background.
magick convert -density 300 -quality 100% -background white -alpha remove -alpha off ./646.04.pdf ./x.png
I want to shift from imagemagick to graphicsmagick
But I encounter some issues with the syntax
With imagemagick
First I need to merge some images into a PDF
convert -density 300 page_*.tif output.pdf
And then I need to create a thumbnail of the first page of the PDF
convert -density 300 file.pdf[0] -background white -alpha remove -resize 140x140 -strip -quality 40 thumb.jpg[0]
This works fine.. But I want to switch the first command to graphicsmagick
Width graphicsmagick/imagemagick
The graphicsmagick syntax here works fine
gm convert -density 300 page_*.tif output.pdf
But when creating the thumbnail with imagemagick the output has the right size but the acutal image is downsized inside the image itself?!
Thumbnail with imagemagick
https://secure.dyndev.dk/data/voucher/30000/400/30435_eb7e5d0a9df71b2783e2fa89efd9de12fcdb9679.pdf
Thumbnail with graphicsmagick
https://secure.dyndev.dk/data/voucher/30000/400/30433_7710d6404534b0868ab8da41dd651e971b70e16b.pdf
Just hit the same issue, and found a solution here:
https://blog.josephscott.org/2009/11/16/imagemagick-convert-pdf-to-jpg-partial-image-size-problem/
You need to change your convert command into:
convert -density 300 -define "pdf:use-cropbox=true" file.pdf[0] -background white -alpha remove -resize 140x140 -strip -quality 40 thumb.jpg[0]
And perhaps add a -resize "2000x2000>" to limit the size of the resulting JPEG, especially with high density values.
I have to change a given PDF from A4 (210mm*297mm) to 216mm*303mm.
The additional 6 mm for each dimension should be set as white border of 3mm on each side. The original content of the PDF pages should be centered on the output pages.
I tried with convert:
convert in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
This gives me exactly the needed result but I loose very much sharpness of the original images in the PDF. It is all kind of blurry.
I also checked with
convert in.pdf out.pdf
which does no changes at all but also screws up the images.
So I tried Ghostcript but did not get any result. The best approach I found so far from a German side is:
gs -sOutputFile=out.pdf -sDEVICE=pdfwrite -g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice" \
-dNOPAUSE -dBATCH in.pdf
but I get Error: /typecheck in --.postinstall--.
By default, Imagemagick converts input PDF files into images with 72dpi. This is awfully low resolution, as you experienced firsthad. The output of Imagemagick is always a raster image, so if your input PDF was text, it will no longer be.
If you don't mind the output PDF's getting bigger, you can simply increase the ratio Imagemagick is probing the original PDF using -density option, like this:
convert -density 600 in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
I used 600 because it is the sweet spot that works well for OCR. I recomment trying 300, 450, 600, 900 and 1200 and picking the best one that doesn't get unwieldably huge.
Shifting the content on the media is not especially hard, but it does mean altering the content stream of the PDF file, which most PDF manipulation packages avoid, with good reason.
The code you quote above really won't work, it leaves garbage on the operand stack, and the PLRM explicitly states that it is followed by an implicit initgraphics which will reset all the standard parameters anyway.
You could try instead setting a /BeginPage procedure to translate the origin, which will probably work:
<</BeginPage {8.5 8.5 translate} >> setpagedevice
Note that you aren't simply manipulating the original PDF file; Ghostscript takes the original PDF file, interprets it into graphics primitives, then reassembles those primitives into a new PDF file, this has implications... For example, if an image is DCT encoded (a JPEG) in the original, it will be decompressed before being passed into the output file. You probably don't want to reapply DCT encoding as this will introduce visible artefacts.
A simpler alternative, but involving multiple processing steps and therefore more potential for problems, is to first convert the PDF to PostScript with the ps2write device, specifying your media size, and also the -dCenterPages switch, then use the pdfwrite device to turn the resulting PostScript into a new PDF file.
Instead of
-g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice"
(which is wrong), you should use:
-g6120x8590 \
-c "<</Install{8.5 8.5 translate}>> setpagedevice"
or
-g6120x8590 \
-c "<</Install{3 25.4 div 72 mul dup translate}>> setpagedevice"
(which lets Ghostscript calculate the "3mm == 8.5pt" itself...)
I have the following:
ghostscript-fonts-5.50-24
ImageMagick-6.7.2-1
ghostscript-9.02-1
Which I use to create a series of JPGs for each page using:
convert -density 175 -colorspace sRGB test.pdf -resize 50% -quality 95 test.jpg
When I run this on my windows machine all appears to work ok, but on our linux server we get the black background problem.
The resulting JPGs have a black background rendering the image un-readable, what am I missing or is there something I should be doing to correct this?
I've been all over google for days but each suggestion doesnt seem to work for me.
Any help is much appreciated, thanks in advance :)
EDIT
Just noticed this output when converting one of the PDFs that produces the black background:
**** Warning: Fonts with Subtype = /TrueType should be embedded.
The following fonts were not embedded:
Arial
Arial,Bold
Arial,BoldItalic
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Microsoft« Word 2010 <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
This seems related but as we don't have control over how the PDFs are produced we need some way of fixing this server side.
Thanks again
Ran into this one today, found this:
https://www.imagemagick.org/discourse-server/viewtopic.php?t=20234
Based on that, these should all work:
-flatten
-alpha flatten
-alpha remove
I'm currently using the below for my specific case which works great:
convert -thumbnail "1280x800>" -density 300 -background white -alpha remove in.pdf out.jpg
Simple fix to this issue is to use an image format that supports transparency, such as png.
So:
convert -density 175 -colorspace sRGB test.pdf -resize 50% -quality 95 test.png
Problem solved :)
If you want a high quality result, use this command:
convert -density 700 input.pdf -resize 25% -append -quality 98 -alpha remove output.jpg
For windows users, use magick instead of convert