The intuition behind attention mechanism in Optical Character Recognition - tensorflow

I am working on Automatic Number Plate Recognition. So far, I have collected 2428 images, manually labelled them with license number. I went through architectures such as CRNN, attention-OCR and STN-OCR. Tried CRNN. The result were satisfying on a synthetic dataset. But too vague on real images. So, I am planning to use attention-OCR. Before implementing attention, I manually checked how the features look like when given to mobilenet. It was observed that the 5'th channel of the output from layer block_5_depthwise_BN, is focusing more on text region in the plate image. But other channels are not behaving the same. My doubt is, if i pass this layer to an attention block, will it be able focus more on this channel? I would like to get valuable suggestions for the architecture?

Related

What are the types of problems TensorFlow can help solve? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 months ago.
Improve this question
The TensorFlow home page describes its purpose as 'a software library for numerical computation'. Looking through the sample problems it looks like a problem is always formulated as follows:
Input
Model parameters
Desired output
Given some training data for 1) and 3), 2) can be computed.
I can see how this can be used to create bots, self-driving cars, image classifiers etc.
Given the broad definition of 'numerical computation', am I missing a class of other problems this can be used for? Can this be used for, say, more classical numerical computations such as the airflow around an aircraft or deformation of a structure under stress? Do you have any examples of how these classical problems would have to be formulated to fit the form above?
A nice discussion on what artificial neural networks could do, the fact that our brain is a neural network might imply that eventually an artificial neural network will be able to to the same tasks.
Some more examples of artificial neural networks used today: music creation, image based location, page rank, google voice, stock trade predictions, nasa star classifiaction, traffic management
Some fields i know of but do not have a good reference for:
optical quantum mechanics test set-up generator
medical diagnosis, reference only about safety
The Sharp LogiCook microwave oven, wiki, nasa mention
I think there are many millions of "problems" that can be solved with an ANN, deciding on the data representation (input,output) will be a challenge for some of these. some useful and useless examples i have been thinking about:
home thermostat that learns your wishes with certain weather types.
bakery production prediction
recognize go-stones on a board and map their locations
personal activity guesser and turn on appropriate device.
recognize person based on mouse movement
Given the right data and network these examples will work.
Dad has a pc controlling the heating system back home, i trained a network based on his 10years of heating data (outside temp, inside temp, humidity etc.) unfortunately i am not allowed to hook it up.
My aunt and uncle have a bakery, based on 6years of sales data i trained a network predicting how many breads and buns they should make. It showed me how important the correct inputs are. first i used the day of the year but when i switched to day of the week i saw a 15% increase in accuracy.
Currently i am working on a network that will detect a go board in a given image and map all 361 locations telling me if there is a black, white or no stone present.
Two examples that showed me how much information can be stored in a single neuron and of different ways to represent data:
Image example, neuron example (unfortunately you have to train both examples yourself so give them a little time.)
On to your example airflow around an aircraft.
I know none to nothing about airflow calculations and my try would be a really huge 3D input layer where you can "draw" an airplane and the direction and speed of the airflow.
It might work but it will require a tremendous amount of computation power, somebody knowing more about this specific topic probably knows a more abstract way of representing the data resulting in a more manageable network.
This nasa paper talks about a neural network for calculating airflow around a wing. Unfortunately i do not understand what kind of input they use, maybe it is more clear to you.

What methods to recognize sentence handwriting?

I mean posts per sentence, not per letter. Such a doctor's prescription handwriting which hard to read. Not just a normal handwriting.
In example :
I use a data mining or machine learning for doing a training from
paper handwrited.
User scanning a paper with hard to read writing.
The application doing an image processing.
And the output is some sentence from paper.
And what device to use? (Scanner or webcam)
I am newbie. If could i need some example in vb.net with emguCV/openCV and researches journals.
Any help would be appreciated.
Welcome to stack overflow! The answer to your question is twofold:
a. If you want to recognize handwriting that has already happened i.e. it is presented to you as an image you are in trouble. Computer Vision is still not good enough to provide you with reasonable accuracy.
b. If you have a chance to recognize handwriting “as it's happening” - you are in luck. Download, for example, a Gesture Search app from Android play store and you are in business.
The difference between the two scenarios is subtle but significant. In the second case you have an extra piece of information that makes handwriting recognition possible. This piece is timing of each stroke. In other words, instead of an image with handwriting you have a bunch of strokes that are all labeled with their time stamps. You can think about it as a sequence of lines and curves or as image segmentation - in any way this provides a big hint for the system. Additional help comes from the dictionary on your phone but this is typically used by any handwriting system.
Android of course has an open source library for stroke recognition (find more on your own). If you still want to go for recognizing images though, you have to first detect text (e.g. as a bounding box) and second use any of the existing engines to process detected regions. For text detection I can recommend MSER. But be careful trying to implement even text detection on your own - you are entering a world of pain here ;). Here is an article that can help.
As for learning how to recognize text from images on the Internet - this can be your plan B or C or Z when you master above mentioned stages. Don’t try to abuse learning methods and make them do hard work for you - you will hit a wall if you don’t understand what’s going on under the hood.

kinect SKD skeletonization method

I was wondering if there's a way to modify the depth map prior to sending it to the skeletonization algorithm used by the kinect, for example, if we want to run the skeletonization on the output of a segmented depth image. So far I have reviewed the methods in the sdk but I haven't been able to find a skeletonization method exposed. It's like you either turn the skeleton on or off but you have no control on its inputs.
If anyone has any idea regarding this topic I will be much obliged.
Shamita: skeletonization means tracking the joints of the user in real time. I edit because I can't comment (not enought reputation).
All the joints' give a depth coordinate and I don't think you can mess with the Kinect hardware input stream. But you can categorize the joints regarding to depth segments. For example with the live stream you categorize it with the corresponding category if it is below 10 and above five it is in category A. this can be done with the live stream itself because it is just a simple calculation.

Automatic feature extraction from chess board positions

I am working on a project where I take a chess board position (FEN string converted to binary) & it's evaluation score and feed it to a neural network. My aim is to make the neural network differentiate between good and bad positions.
How I encode the position : There are 12 unique pieces in chess i.e pawn, rook, knight, bishop, queen and king for white as well as black. I encode each piece using 4 bits with 0000 denoting an empty square. So the 64 squares are encoded into 256 bits and I use 6 more bits to denote game state like whose turn it is to move, king-castle status, etc.
Problem : Since the input space for chess positions is neither smooth nor uni-modal (one small change in the board position can result in a huge change in the evaluation score), the neural network doesn't learn well. Now, the next logical thing to somehow extract useful features (like material difference, center control, etc) and feed it to the network.
I do not want to hand pick the features as I want the network to learn everything by itself. Therefore I am thinking of extracting features automatically using autoencoders. Is there any better way to accomplish this?
Summary : What is the best way to automatically extract features from a chess board position so that it can be fed into a neural network?
UPDATE : To generate training data, I have modified Stockfish to dump it's evaluation process into a log file. So every new move(position) it considers is written to a file as an FEN string along with it's eval score
Neural networks can give an approximation of any function. The only consideration to do is the dimensionality of the search space, which give constraints to the amount of data you have to get a good approximation.
For a supervised network (you use autoencoders, then I think you use some variant of backpropagation), it's difficult for me to immagine how you think to do the trainig using single positions because you need similar positions in your training set. Maybe your approach is different, but I'm convinced that second strategy (using features) is more promising. I think using positions require a huge amount of data training to get good results.
For features take a look here, and to the classical work of Shannon.
I taked also useful informations from the source code of Crafty.
But you have to extract these informations from the FEN string.
Autoencoders are a way to give a reduction of data (good because increase performances). It seems to be better the use of Pincipal Component Analysys, as reported here.
I hope this can help you.

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com