Tesseract cannot recognize my image correctly - captcha

I am developing an Android app now, it needs to recognize captcha from website.
I utilize the tess-two to recognize captcha and follow TrainingTesseract3 instructions to train my own traineddata (using jTessBoxEditor to correct characters), but it cannot recognize correctly and even cannot recognize it.
The below TIFF image is that I use to train my Tesseract, I collect many captchas and merge them into a image.
TIFF image
The image that I want to recognize
For example, the expected result of the above image should be k8666, but the actual result is only 66.
Does anyone give me a help? Thanks.

I tried your images using a .NET wrapper for tesseract-ocr Tesseract-ocr .Net Wrapper by Charliesw.
I got some better results like (K8EEE, K8656), i think you have to increase the text font and make it bold and i saved the image in tiff format with 96DPI resolution to get a better results than mine.

Related

Is it possible to convert fabricjs svg output to pdf without rasterizing?

We are building a web app where the user can make a design by using fabric.js and at the end he should receive a pdf file with his work.
At first, we tried to use JSPDF because it was prefered to have a cliente-side solution. However by doing pdf.addImage(canvas.toDataURL(),...) we are rasterizing the design.
In second place, we tried server side solution using WKHTMLTOPDF, sending canvas.toSVG(), but there are some issues with fonts and shapes rendering.
The designs are complex as they can have text, shapes, images and svg.
We also tried INKSCAPE (inkscape --without-gui --export-pdf ...), MPDF and MUPDF without good results. IMAGEMAGICK is not a solution has it also rasterize the design.
The main goal is to get a vector pdf, where it's possible to increase size and where the elements of the design are selectable, and if possible that pdf should be ready to print (300 dpi and cmyk)
Yes its possible using TCPDF library.
Please check this ImageSVG api for more information for converting SVG to PDF.
https://tcpdf.org/examples/example_058/
Export the canvas to svg and use pdflib to make the pdf.
You can find an exemple here:https://www.pdflib.com/pdflib-cookbook/graphics/starter-svg/

White image while inserting a SVG image in TCPDF

I'm trying to insert some SVG images in a PDF using TCPDF with the method TCPDF::ImageSVG, but when I try this I get a white space.
If I try to enable TCPDF::setRasterizeVectorImages the image shows in the PDF file, but it is rasterized of course and so its quality is not good.
Do you have any idea?
Thank you very much for your help!
Unfortunately, TCPDF's SVG handling is quite limited, and the cause of your issue depends on the SVG you are trying to use. Later versions of TCPDF support more SVG functionality, so if you haven't done so, try using a later version of TCPDF.
If an update doesn't resolve the issue, and you're forced to use raster images, you can improve quality at the cost of file size. You can do this by rasterizing them at a high DPI yourself outside of TCPDF. Once you've done this, take your new high-resolution raster image and add it to your PDF with the Image method like any other raster image. At work we usually rasterize to 300dpi, but your application may call for more or less.
If your image gets added to the PDF far larger on the page than you expected, specify at least one of the dimensions so TCPDF knows how much of the page you're intending the image to use.

PDFBox : Converting to image : Quality loss when converting PDF containing scanned documents

My use case is pretty simple. I need to convert the PDFs to images.I tried using apache pdfbox and i am having some trouble in converting pdfs which contains scanned images. when i convert scanned image the image clarity is lost due to compression/scaling. So i was trying to extract the image data from the PDF and then store it. But the problem is i may get PDF files which will contain images and text in which case i would need to fallback to image conversion mode. The problem is how to differentiate between the pages/documents having only image and the ones with composite data. I was thinking i could use ProcSet defenition for this purpose but looks like it is marked as obsolete and non-reliable according to PDF specifications. Other possibility is to check all the objects linked to that page and see if it contains anything other than images. Please let me know if there is an easier way of doing this
Thanks
If your intention is convert pdf to image, It is better to use ImageMagick for that. If you use ImageMagick, there is a lot options to change the quality of the image. And converting pdf to image is pretty simple using ImageMagick.

Magick++ - Reading JPEG2000 images

I'm trying to read JPEG2000 images in Magick++ (the C++ API of ImageMagick). To read an image I use the following code:
Image img("path/to/my/image.jp2");
But when I try to do this, ImageMagick throws an Exception and doesn´t load the image.
I extract the images out of PDF files. Could it be that something´s different to normal JPEG2000 images? To extract the images I read the stream of Image objects which have a JPXDecode-filter and save them to a file.
Hope someone can help me!
ImageMagick uses a package called JasPer to handle JPEG2000's. According to the wikipedia page on OpenJpeg, JasPer does not completely support the JPEG2000 specification. I have several extrected JPEG2000 that open fine in QuickTime, but fail to decode with ImageMagick.
I have had better results using OpenJpeg to decode the the Jpeg2000. The interface is less flexible, it will convert to PNG and BMP.

Search and Highlight text in PDF for IPad

I am working on the PDF App for iPad and facing an issue: how to search a text in PDF and also how to highlight that text?
Yours is the same big problem I'm having. My understanding is that, currently on iOS 4.0, the main public API is CGPDF . It allows us to parse PDF, and with it we can search strings in it. See also this Quartz 2D document. It also allows us to render it on the screen using CGContextDrawPage. However, it's not yet possible to get the position of a text in the rendered image. (On OS X it's possible using PDFKit.)
So, I'm afraid that you need to implement the PDF spec yourself to get that info. I think GoodReader etc. is working very very hard to implement these.
I had the same trouble recently and then I found FastPDFKit. Have tested the package and it's working great.
http://mobfarm.eu/fastpdfkit