How can I parse a captcha image with data. and data changes - captcha

How to parse a captcha Image or get data from it? The data is part of image. The data changes with reloading. How to get the data on the image? can i do anything with data-url of image?
following is a example for captcha:
http://enquiry.indianrail.gov.in/ntes/CaptchaServlet?action=getNewCaptchaImg&t=1400870602238

Using OCR (Optical Character Recognition) is the first step. Below are 2 examples for such tools/APIs that can help you with that.
Try Tesseract.
Tesseract is probably the most accurate open source OCR engine
available. Combined with the Leptonica Image Processing Library it can
read a wide variety of image formats and convert them to text in over
60 languages.
for more info check: https://code.google.com/p/tesseract-ocr/
You can also try OCRopus
OCRopus is an OCR system written in Python, NumPy, and SciPy focusing
on the use of large scale machine learning for addressing problems in
document analysis.
for more info check: https://code.google.com/p/tesseract-ocr/
For detailed info with code smaple on how to do this, check Ben Boyter's article Decoding CAPTCHA’s at: http://www.boyter.org/decoding-captchas/

Related

create a geotiff file from undocumented tif image

I have an undocumented tiff image which I need to use with a software that can read only geotif files. my simplest idea was to pretend the image is at 0N, 0W with a pixel size of 0.00000899928° (1m) in both directions.
I have rea the thread here but I was unable to reproduce the answer.
Thanks for helping. I am a dummy in geodesics, GIS and the like.
You are attempting to georeference a raster, which is often a difficult task, with multiple techniques. It's not possible to provide an answer for your question given the information you have supplied. Also, never assume that lengths in degrees can be converted to lengths in metres (the Earth isn't flat).
Search around GIS.SE for ideas , e.g. using the [georeferencing] tag. There are tools available with QGIS to help manually georeference rasters to other geospatial data.

Could someone explain to me about the training Tesseract OCR?

I'm trying to do the training process, but I don't understand even how to start. I would like to train for read it numbers. My images are from real world, so it didn't go so good with the reading process.
It says that I have to have a ".tif" image with the examples... is a single image of every number (in this case) or a image with a lot of different types of number (same font, though)?
And what about the makebox? The command didn't work here.
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Could someone explain me better, at least how to start?
I saw a few softwares that do this more quickly, but I tryied one (SunnyPage 1.8) but isn't free. Anyone know any free software that does this? Or a good tutorial?
Using Tesseract 3, Windows 8 (32bits).
It is important to patiently follow the training wiki google code project site. If needed multiple times. It is an open source library and is constantly evolving.
You will have to create a training image(tiff) with a lot of different types of numbers probably should have all the numbers you wish the engine to recognize.
Please consider posting the exact error message you got with make box.
I think Tesseract is the best free solution available. You have to keep working and seek help from community.
There is a very good post from Cédric here explaining the training process for Tesseract.
A good free OCR software is PDF OCR X which is also based on Tesseract. I tried to copy my notes from German which I had scanned at 1200dpi, and the results were commendable but not perfect. I found that this website - http://onlineocr.net - is a lot more accurate. If you are not registered, it allows a maximum of 4mb file size from most image formats (BMP, PNG, JPEG etc.) and PDF. It can output them as a Word file, an Excel file or an txt file.
Hope this helps.

jpeg2000 file structure viewer

For an application we are developing it is important for us to know how much data (in bytes) is stored in a JPEG2000 code stream for each resolution and quality layer. Does anybody know an application / library that can easily reveal this information?
Take a look at the free JP2 Metadata Editor:
http://j2k-codec.com/mde.html
It doesn't show you data by resolution but at least it shows you the compressed codestream size for each tile. Maybe it will be helpful to you.
I have been using jpylyzer and/or pirl (jp2info) in those cases.

import/embed xml ocr/text info from one pdf to a different pdf

I am trying to optimize quality/filesize of a image-scanned pdf while retaining ocr quality.
I could try and downsample after ocr of the high quality pdf document but the tools I'm using (acrobat primarily) do not create small file sizes as compared to using photoshop and exporting a lower dpi/optimized pages and using these pages to create a pdf.
A better solution, if possible would be to take a image-pdf document (800M for the current case) that has been ocred and apply the ocr layer to a lower-rez downsampled document.
I can successfully extract the OCR info with coordinates as xml with pdfminer, but I would like to take this and apply it the same file that has been downsampled using photoshop. I thought I read this was possible with pdftk but I can no longer find this information.
Any suggestions would be greatly appreciated.
jack
Can you describe what's the current way your create PDFs?
With IText it's possible to set the compression level of images added.
May be useful

How to give best chance of success to an OCR software?

I am using Tesseract OCR (via pytesser) and PIL (Python Image Library) for automated test of an application.
I am checking that the displayed text is ok by making a screenshot and getting the text thanks to tesseract.
I had some issues in the beginning and it seems to work better since I have increased the size of the screenshot thanks to the bicubic interpolation of PIL.
Unfortunatelly, I still have some mistakes like confusion between '0' and 'O'. I can imagine that I will have other similar issues in the future.
I would like to know if there are some techniques to prepare an image in order to help the OCR. Any idea is welcomed.
Thanks in advance
Shameless plug and disclaimer: my company packages Tesseract for use in .NET
Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.
Train tesseract to recognize your font
Make image extra clean and with enough free space around characters
Profit :)
Here are few real world examples.
First image is original image (croped power meter numbers)
Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
Third image is completely cleaned image - 100% OCR recognized without any training!
Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.
For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?
Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.
For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.