object detection api , coco model - api

I just started using the tensorflow api and trained few models. Suddenly i realised the name of coco model is different and the accuracy is also the poor like what is the main difference between the faster_rcnn_inception_resnet_v2_atrous_coco Vs faster_rcnn_inception_resnet_v2_atrous_lowproposals_coco VS faster_rcnn_resnet50_coco?? why the terms atrous , low proposals , where in resnet 50 nothing is being used :
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf1_detection_zoo.md

The naming has to do with the respective submitted variations in the COCO competition and respective papers.
They are versions of the Faster RCNN which originally used the VGG-16 for feature extraction.
Not going too deep on this, ResNet Faster RCNN variation, as the name implies, uses the ResNet for Feature Extraction. Then atrous and low proposals are also variations of the model.
Atrous:
Atrous Region Proposal Network (ARPN) is proposed to explore object
contexts at multiple scales by sliding a set of atrous filters with
increasing dilation rates over the last convolutional feature map.
The low proposals, I'm not familiar from where it comes, but from the name, I would guess it just generates less proposals in the Region Proposal Network (RPN) thus being faster at inference time (as you can see in the table of the model zoo).

Related

What is the purpose of a pre-trained network in Faster R-CNN?

I am not able to understand the purpose of a pre-trained network. From what I read, it is used for the RPN and the Classification Network. But I dont't understand how.
CNNs take a notoriously long time to train, especially for more complex models with higher resolutions. In order to avoid the days of training on a high-end GPU, pre-trained models have been made available. You then just have to train on your specific data (assuming your data is similar to the pre-trained data). For instance, if you want to train a CNN to recognize cats in high resolution images, you might want to start with a pre-trained model that recognizes dogs. The training should take a lot, lot less time due to the fact that a lot of the same underlying patterns have already been learned and all your training needs to do is differentiate cats from dogs.

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Low validation accuracy after mobilenet transfer learning

I need a tensorflow model which recognizes a dog's breed. I downloaded the Stanford Dogs Dataset - 20,580 images in 120 categories (=breeds). I followed the procedure described in TensorFlow For Poets to retrain mobilenet_1.0_224. I used --how_many_training_steps=4000 and defaults for everything else. I got this tensorboard graph:
Training and validation accuracy
The validation accuracy is only about 80%.
What can I do to improve it?
In the research paper MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, the test accuracy using the 'MobileNet_1.0_224' architecture on the Stanford Dogs dataset is 83.3%, which seems in line with your results.
When you visually examine the Stanford Dogs Dataset you will find a lot of the breeds look similar, which makes it hard to reach a higher accuracy, even with the state of the art image classifiers in accuracy. You might improve your results by either splitting similar looking breeds into larger subcategories.
Alternatively, you might tweak the training settings of the retrain.py script in the Tensorflow for Poets tutorial, but the gains will be likely be marginal.

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Selecting tensorflow object detection API training hyper parameters

I am setting up an object detection pipeline based on recently released tensorflow object detection API. I am using the arXiv as guidance. I am looking to understand the below for training on my own dataset.
It is not clear how they selected the learning rate schedules and how that would change based on the number of GPUs available for training. How do the training rate schedule change based on number of GPU's available for training? The paper mentions 9 GPUs are used. How should I change the training rate if I only want to use 1 GPU?
The released sample training config file for Pascal VOC using Faster R-CNN has initial learning rate = 0.0001. This is 10x lower than what was published in the original Faster-RCNN paper. Is this due to an assumption on the number of GPU's available for training or due to a different reason?
When I start training from the COCO detection checkpoint, how should the training loss decrease? Looking at tensorboard, on my dataset training loss is low - between 0.8 to 1.2 per iteration (with batch size of 1). Below image shows the various losses from tensorboard. . Is this expected behavior?
For questions 1 and 2: our implementation differs in a few small details compared to the original paper and internally we train all of our detectors with asynchronous SGD with ~10 GPUs. Our learning rates are calibrated for this setting (which you will also have if you decide to train via Cloud ML Engine as in the Pets walkthrough). If you use another setting, you will have to do a bit of hyperparameter exploration. For a single GPU, leaving the learning rate alone probably won't hurt performance, but you may be able to get faster convergence by increasing it.
For question 3: Training losses decrease erratically and you can only see the decrease if you smooth the plots quite a bit over time. Moreover, it's hard to explicitly say how well you are doing with respect to eval metrics just by looking at the training losses. I recommend looking at the mAP plots over time as well as the image visualizations to really get an idea of whether your model has "lifted off".
Hope this helps.