making a pre-trained mxnet network fully convolutional - mxnet

I wish to convert one of the existing pre-trained mxnet models available here to a fully convolutional one.
This means being able to input an image of any size, specifying the stride, and getting a full output.
For instance, assume the model was trained on 224x224x3 images. I want to input an image which is 226x226x3 and specify stride=1, in order to get a 3x3xnum-classes output.
I'm not asking "theoretically", but rather for an example code :-)
Thanks!

According to this example: https://github.com/dmlc/mxnet-notebooks/blob/master/python/tutorials/predict_imagenet.ipynb
You can change the data shape when binding the model:
mod.bind(for_training=False, data_shapes=[('data', (1,3,226,226))])
Then you can input an 3 * 226 * 226 image.
Another example:http://mxnet.io/how_to/finetune.html
This example replaces the last layer of pre-trained model with a fc layer.

Related

Last fc layers in VGG16

The VGG16 architecture has input: 224x224x3 images.I want to have 48x48x3 inputs but to do this in keras, we remove the last fc layers which have 4096 neurons each.Why we have to do this? and is it needed to add another size of fc layers for this input?
Final pooling layer of VGG16 has dimension 7x7x512 for 224x224 input image. From there VGG16 uses fully connected layer of (7x7x512)x4096 to get 4096 dimensional output. However, since your input size is different your feature output dimension from final pooling layer will also be different (2x2x512 I think). So you need to change matrix dimension for fully connected layer to make it work. You have two other options though
use a global average pooling across spatial dimension to get 512 dimensional feature and then use few fully connected layers to get to your number of classes.
Resize you input image to 224x224x3 and you won't need to change anything in model architecture.
Removing the last FC layers is for fine-tuning or transfer learning, where you adapt an existing network to a new problem, such as changing the number of categories that your classifier can choose between.
You are adapting the network to take a different sized input, so you need to adjust the first layer(s) of the network.

Obtaining Logits of the output from deeplab model

I'm using a pre-trained deeplab model (from here) to obtain segmentations for an input image. I'm able to obtain the sematic labels (i.e. SemanticPredictions) which is argmax applied to logits (link).
I was wondering if there is an easy way to obtain the logits before argmax? I was hoping to find the output tensor name and simply pass it into my tfsession
as in the following:
tf_session.run(
self.OUTPUT_TENSOR_NAME,
feed_dict={self.INPUT_TENSOR_NAME: [np.asarray(input_image)]})
But I have not been able to locate such tensor name in the code that reveals the logits, or softmax outputs.
For a model trained from MobileNet_V2 setting self.OUTPUT_TENSOR_NAME = 'ResizeBilinear_2:0' retrieves the logits before the argmax is performed.
I suspect this is the same for xception, but have not verified it.
I arrived at this answer by loading my model in tensorflow. Then, printing the name of all layers in the loaded graph. Finally, I took the name of the final output layer before the last 'ArgMax' layer and ran some inferencing using that.
Here is a link to a stackoverflow question on printing the names of the layers in a graph. I found the answer by Ted to be most helpful.
By the way, the output layers of DeeplabV3 models does not apply SoftMax. So you cannot simply take the raw value of the elements of output vectors as a confidence.

visualizing 2nd convolutional layer

So via tf.summary I visualized the first convolutional layer in my model, of shape [5,5,3,32], as a set of individual images, one per filter. so this layer has a filter of 5x5 dimensions, of depth 3, and there are 32 of them. Im viewing these filters as 5x5 color(RGB) images.
Im wondering how to generalize this to a second convolutional layer, and third and on...
shape of the second convolutional layer is [5,5,32,64].
my questions is how would I transform that tensor into individual 5x5x3 images?
with the first conv layer of shape [5,5,3,32] I visualize it by transposing first tf.transpose(W_conv1,(3,0,1,2)), and then having 32 5x5x3 images.
doing a tf.transpose(W_conv2,(3,0,1,2)) would produce a shape [64,5,5,32]. how would I then use those "32 color channels"? (Im know its not that simple :) ).
Visualization of higher level filters is usually done indirectly. To visualize a particular filter, you look for images that the filter would respond the most to. For that you perform gradient ascent in the space of images (instead of changing the parameters of the network like when you train the network, you change the input image).
You will understand it easier if you play with the following Keras code: https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py

How to do fine-tuning in tensorflow with notop layers and define my own input image size

There are many examples about how to do fine-tuning with tensorflow. Almost all these examples are try to resize our images to the specified size that the existing model needs. Like for example, 224×224 is the input size that vgg19 needs. However, in keras, we can change the input size by setting the include_top to false:
base_model = VGG19(include_top=False, weights="imagenet", input_shape=(input_size, input_size, input_channels))
Then we do not have to fix the image size to be 224×224 anymore. Can we do such kind of fine-tuning by using official pre-trained models in tensorflow? I cannot find the solutions up till now, anyone help me?
Yes, it is possible to do this kind of fine-tuning. You would just have to ensure that you also fine-tune some of the first few layers (to account for changed input) of the original network in addition to the last few layers (to account for changed output).
I work with TensorFlow using Keras. If you are open to that, then there is a code snippet that shows the general fine-tuning flow here:
https://keras.io/applications/
Specifically, I had to write the following code to make it work for my case:
#img_width,img_height is the size of your new input, 3 is the number of channels
input_tensor = Input(shape=(img_width, img_height, 3))
base_model =
keras.applications.vgg19.VGG19(include_top=False,weights='imagenet', input_tensor=input_tensor)
#instantiate whatever other layers you need
model = Model(inputs=base_model.inputs, outputs=predictions)
#predictions is the new logistic layer added to account for new classes
Hope this helps.

How to change number of channels to fine tune VGG16 net in Keras

I would like to fine tune the VGG16 model using my own grayscale images. I know I can fine tune/add my own top layers by doing something like:
base_model = keras.applications.vgg16.VGG16(include_top=False, weights='imagenet', input_tensor=None, input_shape=(im_height,im_width,channels))
but only when channels = 3 according to the documentation.
I have thought of simply adding two redundant channels to my image, but this seems like a waste of computation/could make the classification worse. I could also replicate the same image across three channels, but I am similarly unsure of how it would preform.
Keras pre-trained models have trained on color images and if you want to use their full power, you should use color images for fine-tuning. However, if you have grayscale images you can still use these pre-trained models by repeating your grayscale image over three channels. But obviously, it will not as well as using color images as input.
The VGG keras model uses the function: keras.applications.imagenet_utils._obtain_input_shape.
This function was tailored for ImageNet data thus it enforces the input channel to be 3. One possible workaround will be to copy the VGG16 module and replace the line:
input_shape = _obtain_input_shape(input_shape, default_size=224, min_size=48, data_format=K.image_data_format(), include_top=include_top)
with:
input_shape = (im_height, im_width, 1)
As a side note, you will not be able to load ImageNet weights since your input space has changed and the first layer convolutions will not match.