SSD Inception v2. Is the VGG16 feature extractor replaced by the Inception v2? - tensorflow

In the original SSD paper they used a VGG16 network to the the feature extraction. I am using the SSD Inception v2 model from the TensorFlow model zoo and I do not know what the difference in architecture is. This stack overflow post suggest that for other models like SSD MobileNet the VGG16 feature extractor is replaced by the MobileNet feature extractor.
I thought this would be the same case here with the SSD Inception but this paper has me confused. From here it seems that the Inception is added to the SSD part of the model and the VGG16 feature extractor remains in the beginning of the architecture.
What is the architecture of the SSD Inception v2 model?

In tensorflow object detection api, the ssd_inception_v2 model uses inception_v2 as the feature extractor, namely, the vgg16 part in the first figure (figure (a)) is replaced with inception_v2.
In ssd models, the feature layer extracted by feature extractor (i.e. vgg16, inception_v2, mobilenet) will be further processed to produce extra feature layers of different resolutions. In the above figure (a), there are 6 output feature layers, the first two (19x19) are directly taken from the feature extractor. How are the other 4 layers (10x10, 5x5, 3x3, 1x1) generated?
They are generated by extra convolutional operations (these conv operations are sort of like using very shallow feature extractors, aren't they?). The implementation details are here provided with good documents. In the documentation it says
Note that the current implementation only supports generating new layers
using convolutions of stride 2 (resulting in a spatial resolution reduction
by a factor of 2)
that is how the extra feature map decreases by a factor of 2, and if you read the function multi_resolution_feature_maps, you will find slim.conv2d operations being used, which indicates these extra layers are obtained with extra convolution layer (just one layer each!).
Now we can explain what is improved in the paper you linked. They proposed to replace the extra feature layers with inception block. There is no inception_v2 model but simply a inception block. The paper reported improving classification accuracy by using inception block.
Now it should be clear to the question, ssd model with vgg16, inceptioin_v2 or mobilenet are alright but the inception in the paper only refers to a inception block, not the inception network.

Related

SegNet Implementation

I am working on Biomedical Image Segmentation. For this regard, i need the implementation of SegNet model. I searched for SegNet implementation in many places but none of these provide me correct implementation. I got some implementations without using pre-trained encoder. But from the paper of SegNet, i knew that SegNet use pre-trained encoder which is trimmed portion of VGG-16 network trained on ImageNet dataset. I need the implementation in Keras.
N.B. There is a pretrained VGG-16 network available in keras. But that lacks Batch Normalization layers, which is present in the original paper in SegNet.
P.S. I cannot retrain the VGG-16 network on my own because of the scarcity of computational resource.

Can I generate heat map using method such as Grad-CAM in concatenated CNN?

I am trying to apply GradCAM to my pre-trained CNN model to generate heat maps of layers. My custom CNN design is shown as follows:
- It adopted all the convolution layers and the pre-trained weights from the VGG16 model.
- Extract lower level features (early convolution layers) from VGG16.
- Train the fully connected layers of both normal/high and lower level features from VGG16.
- Concatenate outputs of both normal/high- and lower-level f.c. layers and then train more f.c. layers before the final prediction.
model design
I want to use GradCAM to visualize the feature maps of the low-level route and the normal/high-level route and I have done such heatmaps on non-concatenate fine-tuned VGG using the last convolutional layers. My question is, on a concatenated CNN model, can the Grad-CAM method still work using the gradient of the prediction with respect to the low- and high-level feature map feature maps respectfully? If not, are there other methods that can do the heatmaps visualization for such a model? Is using the shared fully connected layer an option?
Any idea and suggestions are much appreciated!

Where can I find the pretrained models of fasterRCNN / R-FCN with Mobilenet Feature extractor trained on COCO datset?

I want train a custom dataset on FasterRCNN with Mobilenetv1 or v2. I want to use the pre-trained models in tensorflow zoo. But I cant find faster Rcnn model with mobilenet as base extractor. Where can I get it?
I have already tensorflow zoo in github. I have previous used SSD+Mobilenet config for the same. Now I want to compare the results with FasterRCNN and RCNN with Mobilenet.
The official repo has not released Faster RCNN with mobilenet models yet. But if you want you can still use some other models with mobilenet trained on COCO, the process is a bit complicated.
There are two important steps to proceed.
First one is to have corresponding feature extractor class. For Faster RCNN, the models directory already contains faster_rcnn_mobilenet feature extractor implementation so this step is OK. But for R-FCN, you will have to implement the feature extractor class yourself.
Second one is to change tensor names available in the checkpoint. For example, if you use ssd_mobilenet_v1_xxx as checkpoint, then all tensors within mobilenet scope are named as FeatureExtractor/MobilenetV1/XXX while if in the faster_rcnn_mobilenet_v1 model, the tensor names within mobilenet scope are FirstStageFeatureExtractor/MobilenetV1/XXX (and SecondStageFeatureExtractor/MobilenetV1/XXX). So essentially you need to remove FirstStage (as well as SecondStage) in the names of all feature extractor tensors, then these tensors will have exactly the same name as in the checkpoint, and will be correctly restored. If you do this, the function you need to modify is
def restore_map(self,
fine_tune_checkpoint_type='detection',
load_all_detection_checkpoint_vars=False):
in file faster_rcnn_meta_arch.py.

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Tensorflow SSD-Mobilenet model accuracy drop after quantization using transform_graph

I am working on the recently released "SSD-Mobilenet" model by google for object detection.
Model downloaded from following location: https://github.com/tensorflow/models/blob/master/object_detection/g3doc/detection_model_zoo.md
The frozen graph file downloaded from the site is working as expected, however after quantization the accuracy drops significantly (mostly random predictions).
I built tensorflow r1.2 from source, and used following method to quantize:
bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=frozen_inference_graph.pb --out_graph=optimized_graph.pb --inputs='image_tensor' --outputs='detection_boxes','detection_scores','detection_classes','num_detections' --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,224,224,3") fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights strip_unused_nodes sort_by_execution_order'
I tried various combinations in the "transforms" part, and the transforms mentioned above gave sometimes correct predictions, however no where close to the original model.
Is there any other way to improve performance of the quantized model?
In this case SSD uses mobilenet as it's feature extractor . In-order to increase the speed. If you read the mobilenet paper , it's a lightweight convolutional neural nets specially using separable convolution inroder to reduce parameters .
As I understood separable convolution can loose information because of the channel wise convolution.
So when quantifying a graph according to TF implementation it makes 16 bits ops and weights to 8bits . If you read the tutorial in TF for quantization they clearly have mentioned how this operation is more like adding some noise in to already trained net hoping our model has well generalized .
So this will work really well and almost lossless interms of accuracy for a heavy model like inception , resnet etc. But with the lightness and simplicity of ssd with mobilenet it really can make a accuracy loss .
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
How to Quantize Neural Networks with TensorFlow