How does Spatial Dropout work in the inference compared to Dropout? - tensorflow2.0

I would like to deploy a trained Keras model on a microcontroller. However, there is no support for Spatial Dropout layer. I thought about removing the layer from the graph similarly to the Dropout layer. However I didn't find any indication on how the Spatial Dropout works in inference.
I have tried to look into the documentations or similar problem but couldn't find any indication about it.

Related

Multi-Head attention layers - what is a warpper multi-head layer in Keras?

I am new to attention mechanisms and I want to learn more about it by doing some practical examples. I came across a Keras implementation for multi-head attention found it in this website Pypi keras multi-head. I found two different ways to implement it in Keras.
One way is to use a multi-head attention as a keras wrapper layer with either LSTM or CNN.
This is a snippet of implementating multi-head as a wrapper layer with LSTM in Keras. This example is taken from this website keras multi-head"
import keras
from keras_multi_head import MultiHead
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=100, output_dim=20, name='Embedding'))
model.add(MultiHead(keras.layers.LSTM(units=64), layer_num=3, name='Multi-LSTMs'))
model.add(keras.layers.Flatten(name='Flatten'))
model.add(keras.layers.Dense(units=4, activation='softmax', name='Dense'))
model.build()
model.summary()
The other way is to use it separately as a stand-alone layer.
This is a snippet of the second implementation for multi-head as stand-alone laye, also taken from keras multi-head"
import keras
from keras_multi_head import MultiHeadAttention
input_layer = keras.layers.Input( shape=(2, 3), name='Input',)
att_layer = MultiHeadAttention( head_num=3, name='Multi-Head',)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
model.compile( optimizer='adam', loss='mse', metrics={},)
I have been trying to find some documents that explain this but I have not found yet.
Update:
What I have found was that the second implementation (MultiHeadAttention) is more like the Transformer paper "Attention All You Need". However, I am still struggling to understand the first implementation which is the wrapper layer.
Does the first one (as a wrapper layer) would combine the output of multi-head with LSTM?.
I was wondering if someone could explain the idea behind them, especially, the wrapper layer.
I understand your confusion. From my experience, what the Multihead (this wrapper) does is that it duplicates (or parallelize) layers to form a kind of multichannel architecture, and each channel can be used to extract different features from the input.
For instance, each channel can have a different configuration, which is later concatenated to make an inference. So, the MultiHead can be used to wrap conventional architectures to form multihead-CNN, multihead-LSTM etc.
Note that the attention layer is different. You may stack attention layers to form a new architecture. You may also parallelize the attention layer (MultiHeadAttention) and configure each layer as explained above. See here for different implementation of the attention layer.

Predicting using pre-trained model in tf.keras

What is the difference between rescaling and not rescaling images for predicting using a tf.keras Resnet50 pre-trained on ImageNet?
Is it necessary? How much of an impact does it have on the predictions?
It is the difference between the model working as expected, and not working at all, usually if you do not apply the proper normalization that was applied to the training set, then the model performs weird, like always producing the same output, which is undesirable.
So always use the exact same scaling and normalization used to train a model.

Preventing overfitting in transfer learning using TensorFlow and Keras

I've got a TensorFlow 2 model with a pre-trained Keras layer coming from TensorFlow Hub. I want to fine-tune the weights in this sub-model to suit my dataset, but if I do that naively by setting trainable=True and training=True, my model will grossly overfit.
If I had the actual layers of the underlying model under my control, I would insert dropout layers or set L2 coefficient on those individual layers. But the layers are imported to my network using TensorFlow Hub KerasLayer method. Also, I suspect that the underlying model is quite complicated.
I wonder what's the standard practice for solving this kind of issues.
Maybe there is a way to force regularization to the whole network somehow? I know that in TensorFlow 1, there were optimizers like ProximalAdagradOptimizer that took L2 coefficients. In TensorFlow 2, the only optimizer like this is FTRL, but it's hard for me to make it work for my dataset.
I "solved" it by
pretraining non-transferred parts of the model,
then turning on learning for the shared layers,
introducing early stopping,
and configuring the optimizer to go really slow.
This way, I managed to not damage the transferred layers too much. Anyway, I still wonder whether this is the best one can do.

Manipulating pretrained layers of convnet in Tensorflow

I am learning convolutional networks in Tensorflow. I wonder if there is any tutorials of using TF to investigate a pre-trained convnet model, like these excellent tutorials for Caffe: this and this. I mean, how to access middle layers, get its learned parameters and blobs, to customize input shape to accept arbitrary image size or batch size, etc.
It's not quite the same thing, but there's a codelab here that shows you how to remove the top layer of a pretrained network and train up a new one on your own data:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/index.html?index=..%2F..%2Findex#0
It might give you some ideas on how to approach this in TensorFlow.

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.