Which final layer to choose for best representative features of objects in Mask-RCNN? - object-detection

I would like to extract the features from the final layer of the Mask R-CNN (https://medium.com/#jonathan_hui/image-segmentation-with-mask-r-cnn-ebe6d793272, https://github.com/matterport/Mask_RCNN, https://github.com/multimodallearning/pytorch-mask-rcnn) to feed into another network, so would it be best to take gt_mask, det_masks (I was looking at https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/inspect_model.ipynb) or any other recommended layer?

Related

Can I use final pooling layer to find best common features after concatenating deep features vector and handcrafted fetures vector?

I have two features vector. One is deep features vector extracted by CNN and another is handcrafted features extracted by uniform local binary pattern. I want to find common best features after concatenating these two features vector. I want to use a final pooling layer for this reason. Is it possible?
After you have concatenated the two feature vectors, the final pooling layer would help in reducing those feature vectors.
If you can define more what you aim to do / which pooling layer do you want to use ?
I'm not sure I understand correctly what you meant by "final pooling layer"
But in my opinion, adding ONLY a pooling layer after the concatenation layer and before the output layer (e.g., Dense-softmax...) may not help much in this case as pooling layers have no learnable parameters, and they operate over each activation map independently to reduce the size of the activation maps.
There is one simple way of feature fusion methods I would like to suggest is that you can apply another subnet (set of layers like convolution, pooling, dense) to the concatenated tensor. Thus, the model can keep learning to enhance the good features.

Where are the filter image data in this TensorFlow example?

I'm trying to consume this tutorial by Google to use TensorFlow Estimator to train and recognise images: https://www.tensorflow.org/tutorials/estimators/cnn
The data I can see in the tutorial are: train_data, train_labels, eval_data, eval_labels:
((train_data,train_labels),(eval_data,eval_labels)) =
tf.keras.datasets.mnist.load_data();
In the convolutional layers, there should be feature filter image data to multiply with the input image data? But I don't see them in the code.
As from this guide, the input image data matmul with filter image data to check for low-level features (curves, edges, etc.), so there should be filter image data too (the right matrix in the image below)?: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks
The filters are the weight matrices of the Conv2d layers used in the model, and are not pre-loaded images like the "butt curve" you gave in the example. If this were the case, we would need to provide the CNN with all possible types of shapes, curves, colours, and hope that any unseen data we feed the model contains this finite sets of images somewhere in them which the model can recognise.
Instead, we allow the CNN to learn the filters it requires to sucessfully classify from the data itself, and hope it can generalise to new data. Through multitudes of iterations and data( which they require a lot of), the model iteratively crafts the best set of filters for it to succesfully classify the images. The random initialisation at the start of training ensures that all filters per layer learn to identify a different feature in the input image.
The fact that earlier layers usually corresponds to colour and edges (like above) is not predefined, but the network has realised that looking for edges in the input is the only way to create context in the rest of the image, and thereby classify (humans do the same initially).
The network uses these primitive filters in earlier layers to generate more complex interpretations in deeper layers. This is the power of distributed learning: representing complex functions through multiple applications of much simpler functions.

What is meant by visualizing an embedding space(neural network)?

I was reading about an activity recognition paper https://arxiv.org/pdf/1705.07750.pdf. Here, they use 3D convolution on inception v1 to perform activity recognition. I was listening to a talk that said visualizing embedding space of the features from the video.
1) What does it mean to visualize an embedding space? Are you looking at the filters that it has learnt or are you looking for clusterings of similar activities?
2) Do you just visualize the weight matrix for seeing the features that it is capturing? If yes, which weight matrix?
3)Does tf.summary.image() help in visualizing the weight matrix?
The embedding space is the space of the features produced by some learning algorithm. In the specific case of a (convolutional) neural network, this usually means one of the output feature maps (flattened) at some predefined layer or the output of one of the fully connected layers.
What one would visualize is not the weight matrix, but the values of the produced features for some input test data. For example one takes the full test set and passes it through the network and computes the features for each image at a specific layer, and then visualizes those values.
TensorBoard has functionality to automatically visualize embeddings and other feature spaces, you should take a look at it.
Note that in some application contexts like NLP an embedding has a slightly different definition but the use is the same.

Can we use batch normalization with transfer learning for an instance with different data distribution?

This tutorial has the tensor-flow implementation of batch normal layer for training and testing phases.
When we using transfer learning is it ok to use batch normalization layer? Specially when data distributions are different.
Because in the inference phase BN layer just uses fixed mini batch mean and variance(Which is calculated with the help of training distribution).
So if our model has a different distribution of data , can it give wrong results?
With transfer learning, you're transferring the learned parameters from a domain to another.
Usually, this means that you're keeping fixed the learned values of the convolutional layer whilst adding new fully connected layers that learn to classify the features extracted by the CNN.
When you add batch normalization to every layer, you're injecting values sampled from the input distribution into the layer, in order to force the output layer to be normally distributed.
In order of doing that, you compute the exponential moving average of the layer output and then in the testing phase, you subtract this value from the layer output.
Although data dependent, this mean values (for every convolutional layer) are computed on the output of the layer, thus on the transformation learned.
Thus, in my opinion, the various averages that the BN layer subtracts from its convolutional layer output are general enough to be transferred: they are computed on the transformed data and not on the original data.
Moreover, the convolutional layer learns to extract local patterns thus they're more robust and difficult to influence.
Thus, in short and in my opinion:
you can apply transfer learning of convolutional layer with batch norm applied. But on fully connected layers the influence of the computed value (that see the whole input and not only local patches) can bee too much data dependent and thus I'll avoid it.
However, as a rule of thumb: if you're insecure about something just try it and see if it works!

Making sense of Inception V3 layers

The last layers of the Inception V3 network include a 8x8x2048 "mixed10" layer followed by a 1x1x2048 "avg_pool" layer. What is the real difference between these two layers ie. does the "mixed10" layer capture all the features of an image for example or is that only accomplished in the "avg_pool" layer?
As the name suggests, the avg_pool layer is obtained by, for each of the 2048 feature maps in mixed10, taking the global average (over the 64 values in the 8x8 feature map). Depending on your application, it can make sense to use either - if you want to use the network for classification and use the network as a feature extractor, using the average-pooled version and training a classifier on top makes sense.