How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility? - apache

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.
WARNING: Tesseract OCR is installed and will be automatically applied to image f
iles unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via
TikaConfig.
Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.
There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.
Any help, greatly appreciated.

OK so with the help of this post on the Apache Tika Forum Thank you guys.
I managed to get it working.
Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this
extractInlineImages true
extractUniqueInlineImagesOnly false
ocrStrategy ocr_and_text_extraction
Then locate TesseractOCRConfig.properties.
And change this one property to 1..
enableImageProcessing=1
Save the above properties files. Zip it all up again.
And use your new zipped up jar file and it will now extract text and text from images from a pdf file.

I tried user3250052's approach but I was unable to recompress the jar file in a way that was executable. That's owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:
java -jar tika-app.jar --config=tika-config.xml image.pdf
This is what my tika-config.xml looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
<service-loader dynamic="true" loadErrorHandler="IGNORE"/>
<encodingDetectors>
<encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
</encodingDetectors>
<translator class="org.apache.tika.language.translate.DefaultTranslator"/>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
To build that that config file, first I ran:
java -jar tika-app.jar --dump-current-config
That will dump for you the default config. I took that and put it into tika-config.xml and added:
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
</params>
</parser>
which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).
Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, "by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time".
Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I'm hoping this saves someone the few hours it took me to figure that out (let me know).

I would recommend using ocrStrategy auto
This tries to extract and then falls back onto OCR

Related

securimage_show works, example_form fails

I'm adding security to a 'forgot password' process and using the securimage code. I know it is not recommended, but it is only 1 step in a multi-step validation ... just using the tools I have available until something better comes along
I simply installed the securimage files in a root directory under public_html, ran the compatibility check (all good except the LAME M_3 support)
When I run securimage_show by itself, I get a good image. When I run example_form or captcha.html, I get a background image with 'Failed to load TIF font file'. I looked for any errors, did some debugging, and do not see any difference in the debug log with respect to the TFF file location or name.
Any pointers would be appreciated!

Running tesseract 4.1 with openjpeg2 - cannot produce pdf output

I have installed on my RedHat machine:
(py36_maw) [rvp#lib-archcoll box]$ tesseract -v
tesseract 4.1.0
leptonica-1.78.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
Found SSE
I try to run, per what docs I can find, to produce pdf output:
(py36_maw) [rvp#lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275
That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.
However, it tries to read a file called PDF, but I cannot figure how to get PDF output.
I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.
How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?
Robert
After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:
Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
Go to your dir with the test.jp2 file and create file config with these lines.
tessedit_create_pdf 1 Write .pdf output file
tessedit_create txt 1 Write .txt output file
(Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)
Run in that dir:
$ tesseract test.jp2 outputbase -l eng config
Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.
Hope this helps someone else!

error Converting PDF to PNG - Python 3.6 and GhostScript

I have much trouble to have a code to convert pdf file to png on python 3.6, windows 10.
I know what you are going to say : google it !
But barely everything I've found was on python 2.7. And some packages haven't been updated.
What I've seen so far it's that the best way to do it is using Wand, right ? (I have installed ImageMagick before )
from wand.image import Image
# Converting first page into JPG
with Image(filename='0.pdf') as img:
img.save(filename="/temp.jpg")
# Resizing this image
Here was my second error :
wand.exceptions.DelegateError: PDFDelegateFailed
`The system cannot find the file specified.' # error/pdf.c/ReadPDFImage/809
So i read i need ghostscript. I installed it. But the package is for python 2.7 and it doesn't work. I found python3-ghostscript 0.5.0. https://pypi.python.org/pypi/python3-ghostscript/0.5.0
New error :
RuntimeError: Can not find Ghostscript DLL in registry
So here I needed to install Ghostscript 9 :
https://www.ghostscript.com/download/gsdnld.html
First of all it's not a GPL license ... That's not even a package but a program. I don't know how I can use it in my futures python codes...
and there is still an error :
RuntimeError: Can not find Ghostscript DLL in registry
and i can't find anything for it.
Ghostscript is licensed under the AGPL, the licence can be found in /Program Files (x86)/gs/gs9.21/doc if you want sources then they are available from the Ghostscript Git repository. Note I'm assuming you are running on Windows since you refer to the Registry.
If you install the prebuilt binary then it will create an entry in the Windows Registry, I assume that's what your Python code is looking for but I can't be sure. You should make sure you install the correct word size (32 or 64) version required by Python, if it cares.
You can, of course, simply run Ghostscript to render a PDF file and produce PNG output.
gswin32c -sDEVICE=png16m -sOutputFile=out%d.png input.pdf
This will create one file per page of the input PDF file, use gswin64c for the 64-bit version...
You can alter the resolution of the output with the -r switch, eg -r300
I presume you can simply fork a process from Python. Otherwise you'll have to get someone to tell you what the Python script is looking for in the Registry. Perhaps its looking for a specific version of Ghostscript, or the 32-bit version or something.

python-django Ghostscript apache problem

When I run my app, that converts pdf to png, from django server, the conversion works fine. But when I run this from an apache server, I am getting this error: GhoscriptError: Fatal. Reading from the sterr of ghostscript, it says
Initialization file gs_init.ps does
not begin with an integer.
It seems an initialization error for me, but I have no idea how to fix this.
Using Ubuntu by the way. gs folder is in the path, so Im not sure if that is causing the problem.
Here's my code that generates the images
def PDF_to_png(input,output):
args = [
"-dSAFER",
"-dBATCH", "-dNOPAUSE", "-sDEVICE=png16m",
"-r300",
"-sOutputFile=" + os.path.join(output,input.file_name_without_extension)+"_%d.png",
input
]
ghostscript.Ghostscript(*args)
The error is telling you that the file gs_init.ps which is normally found in gs/Resource/Init/ is not valid. From the header of the file:
------------------------------------------------------------------------
% Interpreter library version number
% NOTE: the interpreter code requires that the first non-comment token
% in this file be an integer, and that it match the compiled-in version!
902
------------------------------------------------------------------------
You can build GS with the resources built-in or on disk, I don't know which build you get with Ubuntu but it sounds like either there is a gs_init.ps in the GS path which has been damaged. This probably means you are using a version with the resources on disk.
You should first try just starting up Ghostscript. If that works then it's something to do with the environment which is different when you run the failing instance. Look for environment variables which begin GS_ (especially *GS_LIB*). You should also try actually defining where GS should look on the command line by including something like :
-I/usr/src/gs/Resource
This I ncludes the specified directory as a search path for Ghostscript (NB GS does not use the PATH environment variable). GS will search here for initialisation files first before proceeding on its fall back mechanism.

How to take webpage screenshots with wkhtmltopdf?

I can create PDF file with wkhtmltopdf in linux Ubuntu: "wkhtmltopdf www.stackoverflow.com file.pdf"
How to take webpage screenshots?
You can use wkhtmltoimage as suggested by Mathew. You can go to the first URL and download the file suitable for your architecture (or its source code). Usage is simple:
./wkhtmltoimage-i386 http://www.google.com google.png
wkhtmltoimage - this also uses the webkit render engine for excellent results and is available as standalone binary with no install problems anticipated, just trial and error to get the results you need.