Tesseract searchable pdf creation doesn't work - pdf

i'm running Tesseract 4.0.0 and i tried the following command in order to create a searchable pdf but it doesn't seem to work :
tesseract input output pdf
It gives an error :
can't open file "\Program Files\...//pdf.ttf"!
error during processing
The pdf file gets created but it cannot be open.
I tried it on different image formats : jpg, tif, png with no success.

It does work, not sure which os you are using, but I realised that to make it work on Linux a full install was necessary
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-all
then, for a German document for example, originally a multipage tif:
tesseract multipage-tiff.tif out pdf -l deu
the manual is useful - https://github.com/tesseract-ocr/tesseract/wiki

Related

Running tesseract 4.1 with openjpeg2 - cannot produce pdf output

I have installed on my RedHat machine:
(py36_maw) [rvp#lib-archcoll box]$ tesseract -v
tesseract 4.1.0
leptonica-1.78.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
Found SSE
I try to run, per what docs I can find, to produce pdf output:
(py36_maw) [rvp#lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275
That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.
However, it tries to read a file called PDF, but I cannot figure how to get PDF output.
I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.
How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?
Robert
After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:
Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
Go to your dir with the test.jp2 file and create file config with these lines.
tessedit_create_pdf 1 Write .pdf output file
tessedit_create txt 1 Write .txt output file
(Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)
Run in that dir:
$ tesseract test.jp2 outputbase -l eng config
Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.
Hope this helps someone else!

Specify default substitution font when converting pdf to image using imagemagick and font is missing

I am using Spatie/pdfToImage that builds on ghost script and imagemagick to on my server:
Take a multiple page pdf from an email using mailgun routing.
Save the pdf in folder /docs_pdf like file.pdf
Use a foreach to loop through each page and save each page as a png to /docs like file_#.png
locally where I use laravel -> valet everything works fine.
On my server using digital ocean through laravel forge the language in a multipaged pdf that is in swedish transforms from normal swedish to a bunch of random letters and signs.
The left is correct (yes, its true. Its Swedish) and the right is wrong:
Someone suggested to me that this is probably a matter of the font missing on the server. The fonts used in the pdf:
<</StemV 68/FontName/PSQHMO+FoundrySans-Normal/FontFile2 216 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -240/FontBBox[-40 -240 960 916]/Ascent 916/FontFamily(FoundrySans-Normal)/CapHeight 667/XHeight 465/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 100/FontName/MLHPWU+FoundrySans-Medium/FontFile2 217 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -241/FontBBox[-42 -241 1008 916]/Ascent 916/FontFamily(FoundrySans-Medium)/CapHeight 667/XHeight 470/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 68/FontName/SUEECI+FoundrySans-Normal/FontFile2 218 0 R/FontStretch/Normal/FontWeight 400/Flags 4/Descent -240/FontBBox[-40 -240 960 916]/Ascent 916/FontFamily(FoundrySans-Normal)/CapHeight 667/XHeight 465/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 48/FontName/KIDDUY+FoundrySans-Light/FontFile2 9 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -248/FontBBox[-28 -248 978 924]/Ascent 924/FontFamily(FoundrySans-Light)/CapHeight 667/XHeight 458/Type/FontDescriptor/ItalicAngle 0>>
Here is configuration of fonts in imagemagick and ghostscript:
https://www.imagemagick.org/script/resources.php
how can this be solved?
Update:
I have now made a clean install on a new server.
Installed Imagick and spatie/pdfToImage
As suggested by KenS I ran
gs -sDEVICE=png16m -o out%d.png
terminal output
forge#Server:~/app/storage/app/public/files$ gs -sDEVICE=png16m -o test_out%d.png file.pdf
GPL Ghostscript 9.22 (2017-10-04)
Copyright (C) 2017 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2
the document rendered the same = wrong.
I am at a complete loss.. Don't know what next step might be..
Update2:
I also run the convert imagemagick commando and the img rendered the same way also.
So even if I do it with ghostscript solo, imagemagick or spatie/pdfToImage it gives me the same output
Well, the current version of Ghostscript (9.25) renders this acceptably for me; that is the text appears to be correct. All the fonts are embedded, so there shouldn't be any problems.
And this means that even if you did replace the default font substitution, it wouldn't help, because Ghostscript shouldn't be using the default font, it will be using the fonts embedded in the PDF file.
Without knowing what version of Ghostscript you are using (I see from a later comment that its 9.25), or the command line that is used to start it, I can't really do a like-for-like comparison. Its hard for me to see how you could be getting such a different result though. That looks like Ghostscript has failed to find the embedded fonts.
Its possible that whatever package you are using has done something 'unfortunate'. The various package maintainers on Linux add their own patches, and sometimes modify the way that Ghostscript is built. Possibly that has broken something.
If you are able to build Ghostscript yourself you could try cloning our Git repository and doing that. You could also try downloading the Linux binaries off our website. They won't work with every Linux distribution (different ABI) but you can try, you might be lucky.
You could also try running Ghostscript directly on the PDF file. Something like:
gs -sDEVICE=png16m -o out%d.png
should produce 2 PNG files, out1.png and out2.png. It will also produce a bunch of stuff on the terminal. That back channel output is valuable information for me so if you can reproduce the problem, I'd like to see that too.
One last thought; its possible to have more than one version of Ghostscript installed simultaneously, its possible that your current setup is using an old version of Ghostscript.
I can't help you with ImageMagick or Spatie, but if you can debug those to the point where you can reproduce the problem with a plain Ghostscript command line then I can look further at it.
Finally got it to work. I want to first give kudos to KenS that really helped me, and without him it would not have worked.
This is what I did:
1 - I removed Ghostscript:
sudo apt-get purge --auto-remove ghostscript
then
wget https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs925/ghostscript-9.25.tar.gz
tar xvf ghostscript-9.25.tar.gz
Enter the unpacked folder and do
./configure
make
make install
then
sudo ln -s /usr/local/bin/gs /usr/bin/gs
On top of the above I did:
sudo add-apt-repository ppa:glasen/freetype2
and then:
sudo apt update && sudo apt install freetype2-demos

error Converting PDF to PNG - Python 3.6 and GhostScript

I have much trouble to have a code to convert pdf file to png on python 3.6, windows 10.
I know what you are going to say : google it !
But barely everything I've found was on python 2.7. And some packages haven't been updated.
What I've seen so far it's that the best way to do it is using Wand, right ? (I have installed ImageMagick before )
from wand.image import Image
# Converting first page into JPG
with Image(filename='0.pdf') as img:
img.save(filename="/temp.jpg")
# Resizing this image
Here was my second error :
wand.exceptions.DelegateError: PDFDelegateFailed
`The system cannot find the file specified.' # error/pdf.c/ReadPDFImage/809
So i read i need ghostscript. I installed it. But the package is for python 2.7 and it doesn't work. I found python3-ghostscript 0.5.0. https://pypi.python.org/pypi/python3-ghostscript/0.5.0
New error :
RuntimeError: Can not find Ghostscript DLL in registry
So here I needed to install Ghostscript 9 :
https://www.ghostscript.com/download/gsdnld.html
First of all it's not a GPL license ... That's not even a package but a program. I don't know how I can use it in my futures python codes...
and there is still an error :
RuntimeError: Can not find Ghostscript DLL in registry
and i can't find anything for it.
Ghostscript is licensed under the AGPL, the licence can be found in /Program Files (x86)/gs/gs9.21/doc if you want sources then they are available from the Ghostscript Git repository. Note I'm assuming you are running on Windows since you refer to the Registry.
If you install the prebuilt binary then it will create an entry in the Windows Registry, I assume that's what your Python code is looking for but I can't be sure. You should make sure you install the correct word size (32 or 64) version required by Python, if it cares.
You can, of course, simply run Ghostscript to render a PDF file and produce PNG output.
gswin32c -sDEVICE=png16m -sOutputFile=out%d.png input.pdf
This will create one file per page of the input PDF file, use gswin64c for the 64-bit version...
You can alter the resolution of the output with the -r switch, eg -r300
I presume you can simply fork a process from Python. Otherwise you'll have to get someone to tell you what the Python script is looking for in the Registry. Perhaps its looking for a specific version of Ghostscript, or the 32-bit version or something.

Convert: Postscript delegate failed

I am trying to convert a PDF to JPEG:
$ convert pdf-test.pdf pdf-test.pdf.jpg
However, I am getting this error:
convert: Postscript delegate failed `pdf-test.pdf': No such file or directory # error/pdf.c/ReadPDFImage/664.
convert: missing an image filename `pdf-test.pdf.jpg' # error/convert.c/ConvertImageCommand/3015.
Currently I am using this version of GS and ImageMagick on Mac OS X Lion:
$ gs -v
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
$ convert -version
Version: ImageMagick 6.7.1-1 2011-07-21 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2011 ImageMagick Studio LLC
Features: OpenMP
Can anybody enlighten me on this?
I was receiving the same error message. I then installed gs and the same command worked properly after that.
Try install GS:
$ brew install gs
Well, its telling you thre is no such file or directory. Presumably you have checked the file exists. Have you tried using ./pdf-test.pdf, or using a fully-qualified path ?
Have you tried opening the file directly with GS rather than using ImageMagick ? Just to check the fact that GS is working properly. Somthinhg like:
gs ./test-pdf.pdf
ought to open the PDF file in a window.
ImageMagick sometimes throws this error when you choose too big resolution. Use -density parameter to limit it, e.g. -density 200.
I've encountered this problem today, and it seemed to relate to the overflowing of the /tmp volume. Specifically, it's the magick-* files that were overflowing the storage.
Freeing up the /tmp files solved the problem for me.
I encountered the same problem with MAMP 3.05 on Mac OS X 10.6.8 when trying to convert PDF files with PHP and Imagick to other formats. Conversion doesn't work and gives an error like "Postscript delegate failed... No such file...".
There is already a "gs" file in /Applications/MAMP/Library/bin/lib which comes when installing MAMP 3.05 package. But unfortunately, this file seems not to be in its good location, and this may explain why Ghostscript doesn't work.
The right place for "gs" file is /usr/bin. I tried to put an alias of the "gs" file from MAMP folder to /usr/bin, but it didn't work.
The good method is to make a new install of GS. Download the installer package from http://pages.uoregon.edu/koch/. The last update is 9.14, but on their site, they tell it is buggy in some cases. For this reason, I prefered to install Ghostscript 9.10.
When downloaded, launch the Ghostscript package. It's very easy ! GS installs itself in /usr/local/bin. Copy the "gs" alias from /usr/local/bin to /usr/bin. To do this, obviously, you must reveal hidden files in the finder with a tool like Onyx, choose your MAC OS X version at http://www.titanium.free.fr/downloadonyx.php
Restart MAMP and/or your computer. Now GS works properly, and PDF files can be converted to other picture formats.
Hope be helpful.
I got an extremely similar error message from PHP/Imagick/GS, it turned out the pdf in question was password-protected / encrypted. So maybe that's another possible cause.

Linux: Command Line Utility Convert RTF to PDF?

Any recommendations to convert an RTF to a PDF? I need to do this from my LAMP application, so a command line utility like GhostScript would be ideal.
Alternatively, you can use libreoffice for this task:
libreoffice --headless --invisible --norestore --convert-to pdf source-file.rtf
sudo apt-get install ted
/usr/share/ted/Ted/rtf2pdf.sh source-file dest-file
or visit this link
In my Ubuntu 10.4 I have unrtf, which "converts RTF to HTML, LaTeX, Postscript". From Postscript it should be a trivial application of ps2pdf to get PDFs.
unoconv does this very conveniently. (FYI, I'm currently using version 0.5-1 of same). I have to first run a unoconv --listener & command, followed by a unoconv *.rtf command, for example...
UPDATE: I can verify that, on my Debian Jessie machine, version 0.6 of unoconv behaves in the above fashion. However, the unoconv --listener & command is now no longer necessary (indeed, same seems to cause difficulties if later attempting to open a LibreOffice file...).
Under Cent OS 6, these steps worked for me to convert RTF to PDF from a php file:
yum install Ted
yum install ghostscript
Download rtf2pdf.sh to some path like: /var/www/html/lib, where Apache has sufficient permissions
shell_exec('sh /lib/rtf2pdf.sh /files/test.rtf /files/test.pdf');