Multiple labels on a single bounding box for tensorflow SSD mobilenet - tensorflow

I have configured SSD mobilenet v1 and have trained the model previously as well. However in my dataset for each of the bounding box there are multiple class labels. My dataset is of faces each face have 2 labels: age and gender. Both these labels have the same bounding box coordinates.
After training on this dataset the problem that I encounter is that the model only labels the gender of the face and not the age. In yolo however both gender and age can be shown.
Is it possible to achieve multiple labels on a single bounding box using SSD mobile net ?

It depends on the implementation but SSD uses a softmax layer to predict a single class per bounding box, whereas YOLO predicts individual sigmoid confidence scores for each class. So in SSD a single class with argmax gets picked but in YOLO you can accept multiple classes above a threshold.
However you are really doing a multi-task learning problem with two types of outputs, so you should extend these models to predict both types of classes jointly.

Related

ResNet testing accuracy calculation

I have taken the stock ResNet and trained it only to identify a small set of custom objects. My training set is comprised of images of these custom objects and their associated bounding box tags. I then train the ResNet model via TensorFlow and try to test it on a few other images, which do not have associated tags/bounding boxes. When I do the testing, the model identifies the objects with a bounding box, which is great, but those boxes each come with an accuracy percentage, i.e. 92%. How is ResNet calculating this accuracy percentage?? I can't seem to find the actual calculation it is doing.
Thanks

Tensorflow: Load pre-trained model weights into a different architecture

I have a Tensorflow model that works reasonably well for detecting an object in an image and generating a bounding rectangle. The output includes one Softmax and 4 analog values for location. I need to add one more analog output for predicting the object orientation. How can I import the pre-trained model weights and freeze them so that only the part dealing with the orientation in the last layer will be trained.

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Problems recognizing color for pretrained Inception -v3

I'm retraining the last layer of a pre-trained Inception V3 model: https://www.tensorflow.org/tutorials/image_retraining
My image recognition task is to try to identify the species of mushrooms.
When I look at the prediction results for misclassified images, the model is often assigning a mushroom to a species with a similar shape, but a totally different color. So, it seems like the Inception model is more sensitive to shape and less sensitive to color. Unfortunately, color is a very important feature for this problem.
Any idea what I can do?

Object detection with R-CNN?

What does R-CNN actually do? Is it like using features extracted by CNN to detect classes in a specified window area?
Is there any tensorflow implementation for this?
R-CNN is using the following algorithm:
Get region proposals for object detection (using selective search).
For each region crop the area from the image and run it thorough a CNN which classify the object.
There are more advanced algorithms that are built upon this like fast-R-CNN and faster R-CNN.
fast-R-CNN:
Run the entire image through the CNN
For each region from the region proposals extract the area using "roi polling" layer and than classify the object.
faster R-CNN:
Run the entire image through the CNN
Using the features detected using the CNN find region proposals using a object proposals network.
For each object proposal extract the area using "roi polling" layer and than classify the object.
There are a lot of implantation in tensorflow specifically for faster R-CNN which is the most recent variant just google faster R-CNN tensorflow.
Good luck
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it. I am trying to explain R-CNN and the other variants of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model.
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d) For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
An “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc.
For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
Linking some Tensorflow implementation:
https://github.com/smallcorgi/Faster-RCNN_TF
https://github.com/CharlesShang/FastMaskRCNN
You can find a lot of implementation of Github.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.