I am following the Transfer learning and fine-tuning guide on the official TensorFlow website. It points out that during fine-tuning, batch normalization layers should be in inference mode:
Important notes about BatchNormalization layer
Many image models contain BatchNormalization layers. That layer is a
special case on every imaginable count. Here are a few things to keep
in mind.
BatchNormalization contains 2 non-trainable weights that get updated during training. These are the variables tracking the mean and variance of the inputs.
When you set bn_layer.trainable = False, the BatchNormalization layer will run in inference mode, and will not update its mean & variance statistics. This is not the case for other layers in general, as weight trainability & inference/training modes are two orthogonal concepts. But the two are tied in the case of the BatchNormalization layer.
When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model. Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.
You'll see this pattern in action in the end-to-end example at the end
of this guide.
Even tho, some other sources, for example this article (titled Transfer Learning with ResNet), says something completely different:
for layer in resnet_model.layers:
if isinstance(layer, BatchNormalization):
layer.trainable = True
else:
layer.trainable = False
ANYWAY, I know that there is a difference between training and trainable parameters in TensorFlow.
I am loading my model from file, as so:
model = tf.keras.models.load_model(path)
And I am unfreezing (or actually freezing the rest) some of the top layers in this way:
model.trainable = True
for layer in model.layers:
if layer not in model.layers[idx:]:
layer.trainable = False
NOW about batch normalization layers: I can either do:
for layer in model.layers:
if isinstance(layer, keras.layers.BatchNormalization):
layer.trainable = False
or
for layer in model.layers:
if layer.name.startswith('bn'):
layer.call(layer.input, training=False)
Which one should I do? And whether finally it is better to freeze batch norm layer or not?
Not sure about the training vs trainable difference, but personally I've gotten good results settings trainable = False.
Now as to whether to freeze them in the first place: I've had good results with not freezing them. The reasoning is simple, the batch norm layer learns the moving average of the initial training data. This may be cats, dogs, humans, cars e.t.c. But when you're transfer learning, you could be moving to a completely different domain. The moving averages of this new domain of images are far different from the prior dataset.
By unfreezing those layers and freezing the CNN layers, my model saw a 6-7% increase in accuracy (82 -> 89% ish). My dataset was far different from the inital Imagenet dataset that efficientnet was trained on.
P.S. Depending on how you plan on running the mode post training, I would advise you to freeze the batch norm layers once the model is trained. For some reason, if you ran the model online (1 image at a time), the batch norm would get all funky and give irregular results. Freezing them post training fixed the issue for me.
Use the code below to see whether the batch norm layer are being freezed or not. It will not only print the layer names but whether they are trainable or not.
def print_layer_trainable(conv_model):
for layer in conv_model.layers:
print("{0}:\t{1}".format(layer.trainable, layer.name))
In this case i have tested your method but did not freezed my model's batch norm layers.
for layer in model.layers:
if isinstance(layer, keras.layers.BatchNormalization):
layer.trainable = False
The code below worked nice for me. In my case the model is a ResNetV2 and the batch norm layers are named with the suffix "preact_bn". By using the code above for printing layers you can see how the batch norm layers are named and configure as you want.
for layer in new_model.layers[:]:
if ('preact_bn' in layer.name):
trainable = False
else:
trainable = True
layer.trainable = trainable
Just to add to #luciano-dourado answer;
In my case, I started by following the Transfer Learning guide as is, that is, freezing BN layers throughout the entire training (classifier + fine-tuning).
What I saw is that training the classifier worked without problems but as soon as I started fine-tuning, the loss went to NaN after a few batches.
After running the usual checks: input data without NaNs, loss functions yielding correct values, etc. I checked if BN layers were running in inference mode (trainable = False).
But in my case, the dataset was so different to ImageNet that I needed to do the contrary, set all trainable BN attributes to True. I found this empirically just as #zwang commented. Just remember to freeze them after training, before you deploy the model for inference.
By the way, just as an informative note, ResNet50V2, for example, has a total 49 BN layers of which only 16 are pre-activations BNs. This means that the remaining 33 layers were updating their mean and variance values.
Yet another case where one has to run several empirical tests to find out why the "standard" approach does not work in his/her case. I guess this further reinforces the importance of data in Deep Learning :)
Related
As far as I know, cnn's last layers identify objects as a whole, this is irrelevant to the dataset with signatures. Thus, I want to remove them and add additional layers on top of the model, freezing the VGG16 from training. How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
I need to add additional layers on top anyway for the school report about the effect of convolutional layers' configurations on the model's performance.
p.s my dataset is really small it contains nearly 700 samples, which is extremely small n i know that(i tried augmenting data)
I have a dataset with Chinese signatures, but I thought that it is better to train it separately//
I am not proficient in this field and I started my acquaintance from deep learning, so pls correct me if you noticed any misconception in my explanation?/
Easiest way is to use VGG with include_top=False, weights='imagenet, and set pooling = max. This will instantiate the model with imagenet weights, the top classification layer is removed and the output of the VGG model is a flat vector you can feed directly into a dense layer. My typical code for this is shown below. In the final layer class_count is the number of classes in the training data.
base_model=tf.keras.applications.VGG16(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.45, seed=123)(x)
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)
How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
This is hard to answer because what performance are you talking about? VGG16 originally were build to Imagenet problem with 1000 classes, so if you use it without any modifications probably won't work at all.
Now, if you are talking about transfer learning, so yes, the last dense layers could be replaced to classify your dataset, because the model created with cnn layers in VGG16 is a good pattern recognizer. The fully connected layers at the end work as a classifier for this patterns and you should replace it and train it again for your specific problem. VGG16 has 3 dense layers (FC1, FC2 and FC3) at end, keras only allow you to remove all three, so if you want replace just the last one, you will need to remove all three and rebuild the FC1 and FC2.
The key is what you are going to train after that, you could:
Use original weights (imagenet) in cnn layers and start you trainning from that, just finetunning with a small learning rate. A good choice when you dataset is similar to original and you have a good amount of it.
Use original weights (imagenet) in cnn layers, but freeze them, and just training the weights in the dense layers you replaced. A good choice when your dataset is small.
Don't use the original weights and retrain all the model. Usually not a good choice, because you will need to be an expert to tunning the parameters, tons of data and computacional power to make it work.
I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1 denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).
In Tensorflow guide about transfer learning, they said:
When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model.
What I understand from this is, even when I unfreeze layers, if the pre-trained model contains the BatchNormalization layer, I should set 'traininig=False' just like the code below:
resnet = ResNet50(weights='imagenet', include_top=False)
resnet.trainable = True # unfreeze
inputs = Input(shape=(150,150,3))
x = resnet(inputs, training=False) # because of BN
x = GlobalAveragePooling2D()(x)
x = Dropout(0.2)(x)
outputs = Dense(150,kernel_regularizer=regularizers.l2(0.005), activation='softmax')(x)
However, I got very low accuracy and learning rarely occurred whereas when I set training to True the accuracy rate was satisfied.
So, these are my questions:
Is it wrong to set training as True when it comes to model with BN?
what does 'training = False' mean? I thought it relates to back-propagation.
Thanks in advance!
There is 4 parameters in a BN layer, 2 of which are trainale scale factors, and anoter 2 are mean and std of the input feature (for this BN layer).
Therefore:
Generally, we set training=True in the training. procedure.
However, when it comes to transfer learning, it's optional, that is, "True" or "False" are acceptable, where the former unfroze the BN layer while the latter uses BN layers trianed on pervious data sets.
'Training=False' means don't update "mean", "std" and scale factors of the BN layer. When testing, it's necessary to set training=False, otherwise which would cause test data leakage of the test data thus making the test accuracy unreliable.
I am new to transfer learning(tensorflow = 2.x ). In tutorials of a course we used weights = 'None', This means we are randomly initialzing the weights. Also in the same tutorial it was said that we set layers.trainable = False. So, my question is how will our model learn? Any help is useful. Thankyou
Also i am using InceptionV3.
The model will not learn. The weights will be initialized randomly. Since layers in Inception model are set to be not trainable they will stay frozen at their initial levels. I recommend you selected the weights as "imagenet". This will make use of the pre-trined weights the model learned from processing the imagenet data set. Set include_top=False and set pooling='max'. Then add a dense layer. The number of nodes in the dense layer should be equal to the number of classes you have.
I am using transfer learning and keras.applications.InceptionV3. I manage to train the model successfully.
However, when I want to generate "activation maximisation" images (e.g. the input image that maximizes the activation of one of the custom classes, ref eg https://arxiv.org/pdf/1512.02017v3.pdf ) I struggle to use the pre-trained model since I do manage to use it in "fit" mode and disable all dropouts etc.
What I do is that I combine the pre-trained model in a tf.keras.Sequential to do gradient descent on the weights of the first layer (the input image).
Despite setting base_model.trainable = False however it seems as if the pre-trained model is put into training mode (although weights are not updated) when using model.fit(data) on the outer sequential model.
Is there any way to force the base_model (a child of a Sequential) to be in "predict" mode when calling fit on the outer?
I just came across the same question. After reading some documentation and having a look on the source code of TensorFlows implementations of tf.keras.layers.Layer, tf.keras.layers.Dense, and tf.keras.layers.BatchNormalization I got the following understanding.
If training = False is passed on calling the layer, it will run in inference mode. This has nothing to do with the attribute trainable, which means something different. It would probably lead to less misunderstanding, if they would have called it training_mode instead.
When doing Transfer Learning or Fine Tuning training = False should be passed on calling the base model itself. As far as I saw until now this will only affect layers like tf.keras.layers.Dropout and tf.keras.layers.BatchNormalization and will have not effect on the other layers.
Running in inference mode via training = False will result in tf.layers.Dropout not to apply the dropout rate at all.
As tf.layers.Dropout has no trainable weights, setting the attribute trainable = False will have no effect at all,