Can we use Yolo to detect and recognize text in a image - tensorflow

Currently I am using a deep learing model which is called "Yolov2" for object detection, and I want to use it to extract text and use save it in disk, but i don't know how to do that, if anyone know more about that, please advice me
I use Tensorflow

If you use the pretrained model, you would need to save those outputs and input the images into a character recognition network, if using neural net, or another approach.
What you are doing is "scene text recognition". You can check out the Reading Text in the Wild with Convolutional Neural Networks paper, here's a demo and homepage. Github user chongyangtao has a whole list of resources on the topic.

I have a similar question and I am making a digit detection model with svhn dataset. It is not a finished project yet, but it seems to work well. You can see the code at Yolo-digit-detector.


How to transfer learning or fine tune YOLOv4-darknet with freeze some layers?

I'm a beginner in object detection field.
First, I followed YOLOv4 custom-train from here, I have successfully followed the tutorial. Then I started to think that if I have a new task which is similar to YOLOv4 pre-trained (which using COCO 80 classes) and I have only small dataset size, then I think it would be great if I can fine tune the model (unfreeze only the last layer) to keep or even to increase the detector performance by using only small & similar dataset. This reference seems to legitimate my thought about the fine-tuning I wanted to do.
Then I go to Alexey github here to check how to freeze layers, and found that I should use stopbackward=1. It says that
"...set param stopbackward=1 for layer-136 in cfg-file"
But I have no idea about where is "layer-136" in the cfg-file here and also I have no idea where to put stopbackward=1 if I only want to unfreeze the last layer (with freezing all the other layers). So to summarize my questions.
Where (in which line) to put stopbackward=1 in the yolov4-custom.cfg if I want to unfreeze last layer and freeze the other layers?
What is "layer-136" which mentioned in Alexey github reference? (is it one of the classifier layer? or else?)
In which line of yolov4-custom.cfg should I put the stopbackward=1 for that layer-136?
Any further information from you is really appreciated. Please advise.
Thank you in advance.
the "layer-136" is located before the head of yolov4. To make it easy to see, try to visualize the .cfg file to Netron apps and read the .cfg via text editor, so you can understand the location of layer. You can notice the input and output (the x-layer) when you analyze it with Netron

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

Fixing error output from seq2seq model

I want to ask you how we can effectively re-train a trained seq2seq model to remove/mitigate a specific observed error output. I'm going to give an example about Speech Synthesis, but any idea from different domains, such as Machine Translation and Speech Recognition, using seq2seq model will be appreciated.
I learned the basics of seq2seq with attention model, especially for Speech Synthesis such as Tacotron-2.
Using a distributed well-trained model showed me how naturally our computer could speak with the seq2seq (end-to-end) model (you can listen to some audio samples here). But still, the model fails to read some words properly, e.g., it fails to read "obey [əˈbā]" in multiple ways like [əˈbī] and [əˈbē].
The reason is obvious because the word "obey" appears too little, only three times out of 225,715 words, in our dataset (LJ Speech), and the model had no luck.
So, how can we re-train the model to overcome the error? Adding extra audio clips containing the "obey" pronunciation sounds impractical, but reusing the three audio clips has the danger of overfitting. And also, I suppose we use a well-trained model and "simply training more" is not an effective solution.
Now, this is one of the drawbacks of seq2seq model, which is not talked much. The model successfully simplified the pipelines of the traditional models, e.g., for Speech Synthesis, it replaced an acoustic model and a text analysis frontend etc by a single neural network. But we lost the controllability of our model at all. It's impossible to make the system read in a specific way.
Again, if you use a seq2seq model in any field and get an undesirable output, how do you fix that? Is there a data-scientific workaround to this problem, or maybe a cutting-edge Neural Network mechanism to gain more controllability in seq2seq model?
I found an answer to my own question in Section 3.2 of the paper (Deep Voice 3).
So, they trained both of phoneme-based model and character-based model, using phoneme inputs mainly except that character-based model is used if words cannot be converted to their phoneme representations.

How can i detect and localize object using tensorflow and convolutional neural network?

My problem statement is as follows :
" Object Detection and Localization using Tensorflow and convolutional neural network "
What i did ?
I am done with the cat detection from images using tflearn library.I successfully trained a model using 25000 images of cats and its working fine with good accuracy.
Current Result :
What i wanted to do?
If my image consist of two or more than two objects in the same image for example cat and dog together so my result should be 'cat and dog' and apart from this i have to find the exact location of these two objects on the image(bounding box)
I came across many high level libraries like darknet , SSD but not able to get the concept behind it.
Please guide me about the approach to solve the problem.
Note : I am using supervised learning techniques.
Expected Result :
You have several ways to go about it.
The most straight forward way is to get some suggested bounding boxes using some bounding box suggestion algorithm like selective search and run on each on of the suggestion the classification net that you already trained. This approach is the approach taken by R-CNN.
For more advanced algorithm based on the above approach i suggest you read about Fast-R-CNN and Faster R-CNN.
Look at Object detection with R-CNN? for some basic explanation.
Darknet and SSD are based on a different approach if you want to undestand them you can read about them on
Image localization is a complex problem with many different implementations achieving the same result with different efficiency.
There are 2 main types of implementation
-Localize objects with regression
-Single Shot Detectors
Read this to get a better idea.
I have done a similar project (detection + localization) on Indian Currencies using PyTorch and ResNet34. Following is the link of my kaggle notebook, hope you find it helpful. I have manually collected images from the internet and made bounding box around them and saved their annotation file (Pascal VOC) using "LabelImg" annotation tool.

Object detection using CNTK

I am very new to CNTK.
I wanted to train a set of images (to detect objects like alcohol glasses/bottles) using CNTK - ResNet/Fast-R CNN.
I am trying to follow below documentation from GitHub; However, it does not appear to be a straight forward procedure.
I cannot find proper documentation to generate ROI's for the images with different sizes and shapes. And how to create object labels based on the trained models? Can someone point out to a proper documentation or training link using which I can work on the cntk model? Please see the attached image in which I was able to load a sample image with default ROI's in the script. How do I properly set the size and label the object in the image ? Thanks in advance!
sample image loaded for training
Not sure what you mean by proper documentation. This is an implementation of the paper ( Looks like you are trying to generate ROI's. Can you look through the helper functions as documented at the site to parse what you might need:
To run the toy example, make sure that in the datasetName is set to "grocery".
Run to generate the input ROIs for training and testing.
Run to train a Fast R-CNN model using the CNTK Python API and compute test results.
The algo will work on several candidate regions and then generate outputs: one for the classes of objects and another one that generates the bounding boxes for the objects belonging to those classes. Please refer to the code for getting the details of the implementation.
Can someone point out to a proper documentation or training link using which I can work on the cntk model?
You can take a look at my repository on GitHub.
It will guide you through all the steps required to train your own model for object detection and classification with CNTK.
But in short the proper steps should look something like this:
Setup environment
Prepare data
Tag images (ground truth)
Download pretrained model and create mappings for your custom dataset
Run training
Evaluate the model on test set