Can I generate heat map using method such as Grad-CAM in concatenated CNN? - tensorflow

I am trying to apply GradCAM to my pre-trained CNN model to generate heat maps of layers. My custom CNN design is shown as follows:
- It adopted all the convolution layers and the pre-trained weights from the VGG16 model.
- Extract lower level features (early convolution layers) from VGG16.
- Train the fully connected layers of both normal/high and lower level features from VGG16.
- Concatenate outputs of both normal/high- and lower-level f.c. layers and then train more f.c. layers before the final prediction.
model design
I want to use GradCAM to visualize the feature maps of the low-level route and the normal/high-level route and I have done such heatmaps on non-concatenate fine-tuned VGG using the last convolutional layers. My question is, on a concatenated CNN model, can the Grad-CAM method still work using the gradient of the prediction with respect to the low- and high-level feature map feature maps respectfully? If not, are there other methods that can do the heatmaps visualization for such a model? Is using the shared fully connected layer an option?
Any idea and suggestions are much appreciated!

Related

SSD Inception v2. Is the VGG16 feature extractor replaced by the Inception v2?

In the original SSD paper they used a VGG16 network to the the feature extraction. I am using the SSD Inception v2 model from the TensorFlow model zoo and I do not know what the difference in architecture is. This stack overflow post suggest that for other models like SSD MobileNet the VGG16 feature extractor is replaced by the MobileNet feature extractor.
I thought this would be the same case here with the SSD Inception but this paper has me confused. From here it seems that the Inception is added to the SSD part of the model and the VGG16 feature extractor remains in the beginning of the architecture.
What is the architecture of the SSD Inception v2 model?
In tensorflow object detection api, the ssd_inception_v2 model uses inception_v2 as the feature extractor, namely, the vgg16 part in the first figure (figure (a)) is replaced with inception_v2.
In ssd models, the feature layer extracted by feature extractor (i.e. vgg16, inception_v2, mobilenet) will be further processed to produce extra feature layers of different resolutions. In the above figure (a), there are 6 output feature layers, the first two (19x19) are directly taken from the feature extractor. How are the other 4 layers (10x10, 5x5, 3x3, 1x1) generated?
They are generated by extra convolutional operations (these conv operations are sort of like using very shallow feature extractors, aren't they?). The implementation details are here provided with good documents. In the documentation it says
Note that the current implementation only supports generating new layers
using convolutions of stride 2 (resulting in a spatial resolution reduction
by a factor of 2)
that is how the extra feature map decreases by a factor of 2, and if you read the function multi_resolution_feature_maps, you will find slim.conv2d operations being used, which indicates these extra layers are obtained with extra convolution layer (just one layer each!).
Now we can explain what is improved in the paper you linked. They proposed to replace the extra feature layers with inception block. There is no inception_v2 model but simply a inception block. The paper reported improving classification accuracy by using inception block.
Now it should be clear to the question, ssd model with vgg16, inceptioin_v2 or mobilenet are alright but the inception in the paper only refers to a inception block, not the inception network.

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

VGG16 Transfer learning with an additional input source

I am trying to use Tensorflow for transfer learning using a pre-trained VGG16 model.
However, the input to the model in my problem is an RGB image with an extra channel functioning as a binary mask. This is different than the original input on which the model was trained (224x224 RGB images).
I think that using the pretrained model is still possible in this case. How do I assign weights for connections between the first convolutional layer and the extra channel? Is transfer learning still applicable in such a scenario?
Thanks!

How to use a bidirectional RNN layer in tensorflow ?

When we add a bidirectional RNN layer I can understand that we have to concatenate hidden states. If we use bidirectional RNN layer in encoder decoder model do we have to train the bidirectional RNN layer separately ?
No. To quote from the abstract of Bidirectional Recurrent Neural Networks by Schuster and Paliwal:
The BRNN can be trained without the limitation of using input information just
up to a preset future frame. This is accomplished by training it
simultaneously in positive and negative time direction.
I guess you are talking about tf.nn.static_bidirectional_rnn.

Manipulating pretrained layers of convnet in Tensorflow

I am learning convolutional networks in Tensorflow. I wonder if there is any tutorials of using TF to investigate a pre-trained convnet model, like these excellent tutorials for Caffe: this and this. I mean, how to access middle layers, get its learned parameters and blobs, to customize input shape to accept arbitrary image size or batch size, etc.
It's not quite the same thing, but there's a codelab here that shows you how to remove the top layer of a pretrained network and train up a new one on your own data:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/index.html?index=..%2F..%2Findex#0
It might give you some ideas on how to approach this in TensorFlow.