Should I delete last 7 layers of VGG16 as I am going to use it as a pretrained model for a signature verification task? - tensorflow

As far as I know, cnn's last layers identify objects as a whole, this is irrelevant to the dataset with signatures. Thus, I want to remove them and add additional layers on top of the model, freezing the VGG16 from training. How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
I need to add additional layers on top anyway for the school report about the effect of convolutional layers' configurations on the model's performance.
p.s my dataset is really small it contains nearly 700 samples, which is extremely small n i know that(i tried augmenting data)
I have a dataset with Chinese signatures, but I thought that it is better to train it separately//
I am not proficient in this field and I started my acquaintance from deep learning, so pls correct me if you noticed any misconception in my explanation?/

Easiest way is to use VGG with include_top=False, weights='imagenet, and set pooling = max. This will instantiate the model with imagenet weights, the top classification layer is removed and the output of the VGG model is a flat vector you can feed directly into a dense layer. My typical code for this is shown below. In the final layer class_count is the number of classes in the training data.
base_model=tf.keras.applications.VGG16(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.45, seed=123)(x)
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)

How would the removal of layers potentially affect the model's performance, or should I just leave and delete only dense layers?
This is hard to answer because what performance are you talking about? VGG16 originally were build to Imagenet problem with 1000 classes, so if you use it without any modifications probably won't work at all.
Now, if you are talking about transfer learning, so yes, the last dense layers could be replaced to classify your dataset, because the model created with cnn layers in VGG16 is a good pattern recognizer. The fully connected layers at the end work as a classifier for this patterns and you should replace it and train it again for your specific problem. VGG16 has 3 dense layers (FC1, FC2 and FC3) at end, keras only allow you to remove all three, so if you want replace just the last one, you will need to remove all three and rebuild the FC1 and FC2.
The key is what you are going to train after that, you could:
Use original weights (imagenet) in cnn layers and start you trainning from that, just finetunning with a small learning rate. A good choice when you dataset is similar to original and you have a good amount of it.
Use original weights (imagenet) in cnn layers, but freeze them, and just training the weights in the dense layers you replaced. A good choice when your dataset is small.
Don't use the original weights and retrain all the model. Usually not a good choice, because you will need to be an expert to tunning the parameters, tons of data and computacional power to make it work.

Related

Is validation curve slight greater or lower in CNN models good?

Can you tell me which one among the two is a good validation vs train plot?
Both of them are trained with same keras sequential layers, but the second one is trained using more number of samples, i.e. augmented the dataset.
I'm a little bit confused about the zigzags in the first plot, otherwise I think it is better than the second.
In the second plot, there are no zigzags but the validation accuracy tends to be a little high than train, is it overfitting or considerable?
It is an image detection model where the first model's dataset size is 5170 and the second had 9743 samples.
The convolutional layers defined for the model building:
tf.keras.layers.Conv2D(128,(3,3), activation = 'relu', input_shape = (150,150,3)),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(64,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Conv2D(32,(3,3), activation = 'relu'),
tf.keras.layers.MaxPool2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512,activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Dense(1,activation='sigmoid')
Can the model be improved?
From the graphs the second graph where you have more samples is better. The reason is with more samples the model is trained on a much wider probability distribution of images. So when validation is run you have a better chance of correctly classifying the image. You have a lot of dropout in your model. This is good to prevent over fitting, however it will lower the training accuracy relative to the validation accuracy. Your model seems to be doing well. It might improve if you add additional convolution- max pooling layers. Alternative of course is to use transfer learning. I would recommend efficientnetb3. I also recommend using an adjustable learning rate. The Keras callback ReduceLROnPlateau works well for that purpose. Documentation is here.. Code below shows my recommended settings.
rlronp=tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=2,
verbose=1,
mode='auto'
)
in model.fit include callbacks=[rlronp]

How to freeze batch-norm layers during Transfer-learning

I am following the Transfer learning and fine-tuning guide on the official TensorFlow website. It points out that during fine-tuning, batch normalization layers should be in inference mode:
Important notes about BatchNormalization layer
Many image models contain BatchNormalization layers. That layer is a
special case on every imaginable count. Here are a few things to keep
in mind.
BatchNormalization contains 2 non-trainable weights that get updated during training. These are the variables tracking the mean and variance of the inputs.
When you set bn_layer.trainable = False, the BatchNormalization layer will run in inference mode, and will not update its mean & variance statistics. This is not the case for other layers in general, as weight trainability & inference/training modes are two orthogonal concepts. But the two are tied in the case of the BatchNormalization layer.
When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model. Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.
You'll see this pattern in action in the end-to-end example at the end
of this guide.
Even tho, some other sources, for example this article (titled Transfer Learning with ResNet), says something completely different:
for layer in resnet_model.layers:
if isinstance(layer, BatchNormalization):
layer.trainable = True
else:
layer.trainable = False
ANYWAY, I know that there is a difference between training and trainable parameters in TensorFlow.
I am loading my model from file, as so:
model = tf.keras.models.load_model(path)
And I am unfreezing (or actually freezing the rest) some of the top layers in this way:
model.trainable = True
for layer in model.layers:
if layer not in model.layers[idx:]:
layer.trainable = False
NOW about batch normalization layers: I can either do:
for layer in model.layers:
if isinstance(layer, keras.layers.BatchNormalization):
layer.trainable = False
or
for layer in model.layers:
if layer.name.startswith('bn'):
layer.call(layer.input, training=False)
Which one should I do? And whether finally it is better to freeze batch norm layer or not?
Not sure about the training vs trainable difference, but personally I've gotten good results settings trainable = False.
Now as to whether to freeze them in the first place: I've had good results with not freezing them. The reasoning is simple, the batch norm layer learns the moving average of the initial training data. This may be cats, dogs, humans, cars e.t.c. But when you're transfer learning, you could be moving to a completely different domain. The moving averages of this new domain of images are far different from the prior dataset.
By unfreezing those layers and freezing the CNN layers, my model saw a 6-7% increase in accuracy (82 -> 89% ish). My dataset was far different from the inital Imagenet dataset that efficientnet was trained on.
P.S. Depending on how you plan on running the mode post training, I would advise you to freeze the batch norm layers once the model is trained. For some reason, if you ran the model online (1 image at a time), the batch norm would get all funky and give irregular results. Freezing them post training fixed the issue for me.
Use the code below to see whether the batch norm layer are being freezed or not. It will not only print the layer names but whether they are trainable or not.
def print_layer_trainable(conv_model):
for layer in conv_model.layers:
print("{0}:\t{1}".format(layer.trainable, layer.name))
In this case i have tested your method but did not freezed my model's batch norm layers.
for layer in model.layers:
if isinstance(layer, keras.layers.BatchNormalization):
layer.trainable = False
The code below worked nice for me. In my case the model is a ResNetV2 and the batch norm layers are named with the suffix "preact_bn". By using the code above for printing layers you can see how the batch norm layers are named and configure as you want.
for layer in new_model.layers[:]:
if ('preact_bn' in layer.name):
trainable = False
else:
trainable = True
layer.trainable = trainable
Just to add to #luciano-dourado answer;
In my case, I started by following the Transfer Learning guide as is, that is, freezing BN layers throughout the entire training (classifier + fine-tuning).
What I saw is that training the classifier worked without problems but as soon as I started fine-tuning, the loss went to NaN after a few batches.
After running the usual checks: input data without NaNs, loss functions yielding correct values, etc. I checked if BN layers were running in inference mode (trainable = False).
But in my case, the dataset was so different to ImageNet that I needed to do the contrary, set all trainable BN attributes to True. I found this empirically just as #zwang commented. Just remember to freeze them after training, before you deploy the model for inference.
By the way, just as an informative note, ResNet50V2, for example, has a total 49 BN layers of which only 16 are pre-activations BNs. This means that the remaining 33 layers were updating their mean and variance values.
Yet another case where one has to run several empirical tests to find out why the "standard" approach does not work in his/her case. I guess this further reinforces the importance of data in Deep Learning :)

Purpose of pooling layer after text embedding layer

I'm following the tutorial on the tensorflow site (https://www.tensorflow.org/tutorials/text/word_embeddings#create_a_simple_model) to learn word embeddings, and a confusion that I have is about the purpose of having a Globalaveragepooling layer right after the embedding layer as follows:
model = keras.Sequential([
layers.Embedding(encoder.vocab_size, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(16, activation='relu'),
layers.Dense(1)
])
I understand what pooling means and how it's done. If someone can explain why we need a pooling layer, and what would change if we didn't use it, I'd appreciate it.
The purpose of this tutorial is to get you to understand word-embeddings through a simple toy task: binary sentiment analysis.
To start with, they make you code a simple model: take the average of all embeddings in a sentence and add a feed-forward neural net to classify this aggregated input. GlobalAveragePooling1D does this averaging.
Obviously in the real world you'd want to use more complex models as RNNs, LSTMs, bidirectional models, atrous-convolution-based models or Transformers but that's not the point in this tutorial.
The "simple model" they mention being a feed-forward neural net, it expects a fixed input dimension so when you have sequential data of variable length you need to address this somehow: averaging, padding, cropping etc. Here they average with this GlobalAveragePooling1D layer

What is the difference between conv1d with kernel_size=1 and dense layer?

I am building a CNN with Conv1D layers, and it trains pretty well. I'm now looking into how to reduce the number of features before feeding it into a Dense layer at the end of the model, so I've been reducing the size of the Dense layer, but then I came across this article. The article talks about the effect of using a Conv2D filters with a kernel_size=(1,1) to reduce the number of features.
I was wondering what the difference is between using a Conv2D layer with kernel_size=(1,1) tf.keras.layers.Conv2D(filters=n,kernel_size=(1,1)) and using a Dense layer of the same size tf.keras.layers.Dense(units=n)? From my perspective (I'm relatively new to neural nets), a filter with kernel_size=(1,1) is a single number, which is essentially equivalent to weight in a Dense layer, and both layers have biases, so are they equivalent, or am I misunderstanding something? And if my understanding is correct, in my case where I am using Conv1D layers, not Conv2D layers, does that change anything? As in is tf.keras.layers.Conv1D(filters=n, kernel_size=1) equivalent to tf.keras.layers.Dense(units=n)?
Please let me know if you need anything from me to clarify the question. I'm mostly curious about if Conv1D layers with kernel_size=1 and Conv2D layers with kernel_size=(1,1) behave differently than Dense layers.
Yes, since Dense layer is applied on the last dimension of its input (see this answer), Dense(units=N) and Conv1D(filters=N, kernel_size=1) (or Dense(units=N) and Conv2D(filters=N, kernel_size=1)) are basically equivalent to each other both in terms of connections and number of trainable parameters.
In 1D CNN, the kernel moves in 1 direction. The input and output data of 1D CNN is 2 dimensional. Mostly used on Time-Series Data, Natural Language Processing tasks etc. Definitely gonna see people using it in Kaggle NLP competitions and notebooks.
In 2D CNN, the kernel moves in 2 directions. The input and output data of 2D CNN is 3 dimensional. Mostly used on Image data.
Definitely gonna see people using it in Kaggle CNN Image Processing competitions and notebooks
In 3D CNN, the kernel moves in 3 directions. The input and output data of 3D CNN is 4 dimensional. Mostly used on 3D Image data (MRI, CT Scans). Haven't personally seen applied version in competitions

Is my training data set too complex for my neural network?

I am new to machine learning and stack overflow, I am trying to interpret two graphs from my regression model.
Training error and Validation error from my machine learning model
my case is similar to this guy Very large loss values when training multiple regression model in Keras but my MSE and RMSE are very high.
Is my modeling underfitting? if yes what can I do to solve this problem?
Here is my neural network I used for solving a regression problem
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
and my data set
I have 500 samples, 10 features and 1 target
Quite the opposite: it looks like your model is over-fitting. When you have low error rates for your training set, it means that your model has learned from the data well and can infer the results accurately. If your validation data is high afterwards however, that means that the information learned from your training data is not successfully being applied to new data. This is because your model has 'fit' onto your training data too much, and only learned how to predict well when its based off of that data.
To solve this, we can introduce common solutions to reduce over-fitting. A very common technique is to use Dropout layers. This will randomly remove some of the nodes so that the model cannot correlate with them too heavily - therefor reducing dependency on those nodes and 'learning' more using the other nodes too. I've included an example that you can test below; try playing with the value and other techniques to see what works best. And as a side note: are you sure that you need that many nodes within your dense layer? Seems like quite a bit for your data set, and that may be contributing to the over-fitting as a result too.
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
Dropout(0.2),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
Well i think your model is overfitting
There are several ways that can help you :
1-Reduce the network’s capacity Which you can do by removing layers or reducing the number of elements in the hidden layers
2- Dropout layers, which will randomly remove certain features by setting them to zero
3-Regularization
If i want to give a brief explanation on these:
-Reduce the network’s capacity:
Some models have a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.by lowering the capacity of the network, it's going to learn the patterns that matter or that minimize the loss. But remember،reducing the network’s capacity too much will lead to underfitting.
-regularization:
This page can help you a lot
https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
-Drop out layer
You can use some layer like this
model.add(layers.Dropout(0.5))
This is a dropout layer with a 50% chance of setting inputs to zero.
For more details you can see this page:
https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
As mentioned in the existing answer by #omoshiroiii your model in fact seems to be overfitting, that's why RMSE and MSE are too high.
Your model learned the detail and noise in the training data to the extent that it is now negatively impacting the performance of the model on new data.
The solution is therefore randomly removing some of the nodes so that the model cannot correlate with them too heavily.