Could someone explain to me about the training Tesseract OCR? - windows-8

I'm trying to do the training process, but I don't understand even how to start. I would like to train for read it numbers. My images are from real world, so it didn't go so good with the reading process.
It says that I have to have a ".tif" image with the examples... is a single image of every number (in this case) or a image with a lot of different types of number (same font, though)?
And what about the makebox? The command didn't work here.
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Could someone explain me better, at least how to start?
I saw a few softwares that do this more quickly, but I tryied one (SunnyPage 1.8) but isn't free. Anyone know any free software that does this? Or a good tutorial?
Using Tesseract 3, Windows 8 (32bits).

It is important to patiently follow the training wiki google code project site. If needed multiple times. It is an open source library and is constantly evolving.
You will have to create a training image(tiff) with a lot of different types of numbers probably should have all the numbers you wish the engine to recognize.
Please consider posting the exact error message you got with make box.
I think Tesseract is the best free solution available. You have to keep working and seek help from community.
There is a very good post from Cédric here explaining the training process for Tesseract.

A good free OCR software is PDF OCR X which is also based on Tesseract. I tried to copy my notes from German which I had scanned at 1200dpi, and the results were commendable but not perfect. I found that this website - http://onlineocr.net - is a lot more accurate. If you are not registered, it allows a maximum of 4mb file size from most image formats (BMP, PNG, JPEG etc.) and PDF. It can output them as a Word file, an Excel file or an txt file.
Hope this helps.

Related

What methods to recognize sentence handwriting?

I mean posts per sentence, not per letter. Such a doctor's prescription handwriting which hard to read. Not just a normal handwriting.
In example :
I use a data mining or machine learning for doing a training from
paper handwrited.
User scanning a paper with hard to read writing.
The application doing an image processing.
And the output is some sentence from paper.
And what device to use? (Scanner or webcam)
I am newbie. If could i need some example in vb.net with emguCV/openCV and researches journals.
Any help would be appreciated.
Welcome to stack overflow! The answer to your question is twofold:
a. If you want to recognize handwriting that has already happened i.e. it is presented to you as an image you are in trouble. Computer Vision is still not good enough to provide you with reasonable accuracy.
b. If you have a chance to recognize handwriting “as it's happening” - you are in luck. Download, for example, a Gesture Search app from Android play store and you are in business.
The difference between the two scenarios is subtle but significant. In the second case you have an extra piece of information that makes handwriting recognition possible. This piece is timing of each stroke. In other words, instead of an image with handwriting you have a bunch of strokes that are all labeled with their time stamps. You can think about it as a sequence of lines and curves or as image segmentation - in any way this provides a big hint for the system. Additional help comes from the dictionary on your phone but this is typically used by any handwriting system.
Android of course has an open source library for stroke recognition (find more on your own). If you still want to go for recognizing images though, you have to first detect text (e.g. as a bounding box) and second use any of the existing engines to process detected regions. For text detection I can recommend MSER. But be careful trying to implement even text detection on your own - you are entering a world of pain here ;). Here is an article that can help.
As for learning how to recognize text from images on the Internet - this can be your plan B or C or Z when you master above mentioned stages. Don’t try to abuse learning methods and make them do hard work for you - you will hit a wall if you don’t understand what’s going on under the hood.

How can I parse a captcha image with data. and data changes

How to parse a captcha Image or get data from it? The data is part of image. The data changes with reloading. How to get the data on the image? can i do anything with data-url of image?
following is a example for captcha:
http://enquiry.indianrail.gov.in/ntes/CaptchaServlet?action=getNewCaptchaImg&t=1400870602238
Using OCR (Optical Character Recognition) is the first step. Below are 2 examples for such tools/APIs that can help you with that.
Try Tesseract.
Tesseract is probably the most accurate open source OCR engine
available. Combined with the Leptonica Image Processing Library it can
read a wide variety of image formats and convert them to text in over
60 languages.
for more info check: https://code.google.com/p/tesseract-ocr/
You can also try OCRopus
OCRopus is an OCR system written in Python, NumPy, and SciPy focusing
on the use of large scale machine learning for addressing problems in
document analysis.
for more info check: https://code.google.com/p/tesseract-ocr/
For detailed info with code smaple on how to do this, check Ben Boyter's article Decoding CAPTCHA’s at: http://www.boyter.org/decoding-captchas/

Reducing the size of pdf generated from software using proprietary fonts

I am trying to bring an Indian Magazine online. This magazine is typed in CorelDraw using the proprietary Devenagari font (http://www.modular-infotech.com/html/shreelipi.html). So these guys have provided a USB dongle that you have to have attached to the machine when you want to access the fonts, and this software has been in use for past 10 years.
To put the magazine online, we've tried to convert it to pdf (by printing). The resultant pdf size is of the order of 30-50MB, even when the pdf does not have even a single image. I am guessing it converts the whole text into an image
It would be really difficult for users to read this magazine given its size. Though when I convert it to .swf format (for add flipbook kind of functionality) - the size reduces to 5-6MB. But there are people who like to download the magazine and then read. I have had no luck reducing the size of pdf.
I have done lot of research on web. The postscript, primo pdf do not help much. The best I could get was 30% reduction using DocuCom pdf printer. But it is still 20MB. I have tried to play with resolution, compression and quality but the best I could get was 18MB.
Ideally I would like to reduce it to less than 2MB.
I would be really grateful if you could help me reduce the size of the pdf! Considering that it has no images, I am hopeful that I can get some really good compression.
The (35MB) magazine can be downloaded from: http://merajhola.in/jin-march.pdf
I can't see any easy way to reduce the size of this PDF. There are no embedded fonts and all the text is drawn using vector graphics primitives. No amount of tweaking the resolution, compression and quality will have a significant improvement.
One possible option would be to embed the font as a subset rather than use vector graphics. That will almost certainly make a big difference, however I doubt the proprietary font license will allow it.
I'm sorry, but this Shree-Lipi thing just sounds wrong in 2012. It would be much better to use proper OpenType fonts with modern (say InDesign) or free (say LuaTeX) software.

A batch PNG processor for windows to pass googles page speed test

I have googles page speed plugin installed: http://code.google.com/speed/page-speed/
It is saying that I have a lot of pngs that aren't compressed on my site.
I tried using the RIOT image optimizer: http://luci.criosweb.ro/riot/
However with attempts using multiple settings I couldn't get it to pass.
Any ideas? Thanks!
You could try pngcrush, but presumably you'll get much greater savings from converting to JPEG with quality slightly less than 100 (I usually find 92 pretty good). ImageMagick would be the tool of choice for bulk processing.
I never managed to create paletted PNGs, but in principle those should be pretty efficient when you're dealing with illustrations that only use a few colours.
The good png optimzers are:
pngout http://advsys.net/ken/utils.htm
pngcrush http://pmt.sourceforge.net/pngcrush/
optipng http://optipng.sourceforge.net/
advpng http://advancemame.sourceforge.net/comp-readme.html
For best results use all 4 in that order.
You can also use pngnq http://pngnq.sourceforge.net/ to reduce the image even more at the cost of some quality. (And after using pngnq, run the image through the optimizers.)
Thanks for all the suggestions guys. I think in my case the easiest method is to grab the cache files that google page speed produces. Here is the info: http://code.google.com/speed/page-speed/docs/using_firefox.html#savefiles
Also you'll need to run it in firefox as chrome doesn't produce the same files.

How to give best chance of success to an OCR software?

I am using Tesseract OCR (via pytesser) and PIL (Python Image Library) for automated test of an application.
I am checking that the displayed text is ok by making a screenshot and getting the text thanks to tesseract.
I had some issues in the beginning and it seems to work better since I have increased the size of the screenshot thanks to the bicubic interpolation of PIL.
Unfortunatelly, I still have some mistakes like confusion between '0' and 'O'. I can imagine that I will have other similar issues in the future.
I would like to know if there are some techniques to prepare an image in order to help the OCR. Any idea is welcomed.
Thanks in advance
Shameless plug and disclaimer: my company packages Tesseract for use in .NET
Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.
Train tesseract to recognize your font
Make image extra clean and with enough free space around characters
Profit :)
Here are few real world examples.
First image is original image (croped power meter numbers)
Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
Third image is completely cleaned image - 100% OCR recognized without any training!
Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.
For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?
Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.
For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.