Unable to understand Transfer Learning with Vgg16 - tensorflow

So, I have to work with Vgg16 in my semester group project, and was following this to do transfer learning.
I don't understand CNN much, but am learning currently.
The very first problem was that Vgg16 has 16 layers, whereas the base_model.summary() had 26 when initialised with VGGFace(include_top=True and 19 when VGGFace(include_top=False. Looks like the 16 layers are those with weight.
Now, tutorial uses include_top=False and did
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
x = Dense(1024, activation='relu')(x)
x = Dense(512, activation='relu')(x)
preds = Dense(NO_CLASSES, activation='softmax')(x)
model = Model(base_model.input, preds)
As much as I understood, we first took output layer of base_model and it added 5 layers to that 1 GlobalAveragePooling2d, 4 Dense layers.
My question is why did it modify the Vgg16 layer structure. Why do we need to repace last 7 layers with 5 different layers. Couldn't we set the same 7 layers as trainable or just add identical layers. What is the actual advantage of this replacement.
Before replacement
After replacement
'global_average_pooling2d_11', 'dense_42', 'dense_43', 'dense_44', 'dense_45'

Related

Text classification CNN nan loss

I'm attempting to train a neural network for text classification (sarcasm detection on reddit comments). I have the comment itself, its parent comment, and the subreddit in which these comments were made.
I engineer features such as the positivity, negativity, and neutrality of the comment, the same for the parent comment, and engineer 2 more numerical features based on the subreddit. This is a total of 8 engineered features.
The neural network I use is structured as such: I run a bunch of filters over my (word-embedding translated) comment, a bunch of filters over my parent comment (word-embedding translated), using a 1D CNN. I subsequently do some pooling and feed the outputs to a fully connected layer.
This works fine.
However, I wish to add in those 8 engineered features to my inputs to the fully connected layer. But when I do so, the loss decreases (rather quickly), and suddenly I see the loss turn to nan and accuracy decrease.
Below is the code within my function to produce the model: -
embedding_layer = Embedding(num_words,
embedding_dim,
weights=[embeddings],
input_length=max_sequence_length,
trainable=False)
parent_embedding_layer = Embedding(parent_num_words,
embedding_dim,
weights=[parent_embeddings],
input_length=max_sequence_length,
trainable=False)
sequence_input = Input(shape=(2, 205))
// some tensor manipulation
comment_sequence_input:= length 200 tensor of words in comment
parent_sequence_input:= length 200 tensor of words in comment
non_text_comment_input:= length 5 tensor with 5 engineered features
non_text_parent_input:= length 5 tensor with 5 engineered features
embedded_sequences = embedding_layer(comment_sequence_input)
parent_embedded_sequences = parent_embedding_layer(parent_sequence_input)
convs = []
// Convolutions
for filter_size in filter_sizes:
l_conv = Conv1D(filters=filters, kernel_size=filter_size,
activation='relu')(embedded_sequences)
l_pool = GlobalMaxPooling1D()(l_conv)
convs.append(l_pool)
for filter_size in parent_filter_sizes:
parent_l_conv = Conv1D(filters=parent_filters, kernel_size=filter_size,
activation='relu')(parent_embedded_sequences)
parent_l_pool = GlobalMaxPooling1D()(parent_l_conv)
convs.append(parent_l_pool)
// End of convolutions
//Inclusion of engineered features
convs.append(non_text_comment_input)
convs.append(non_text_parent_input)
l_merge = concatenate(convs, axis=1)
// end of section
x = Dropout(0.30)(l_merge)
// Start of fully connected layer
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
preds = Dense(labels_index, activation='sigmoid')(x)
adam_optimizer = Adam(learning_rate=0.000001)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
optimizer=adam_optimizer,
metrics=['acc'])
model.summary()
return model
The input provided is a (batch_dim, 2, 205) size tensor. 200 for the number of words in each entry, and 5 for the 3 sentiment engineered features and 2 subreddit engineered features. The 2 dimensions are for comment and parent comment.
Things I have tried:
Lowering the learning rate
Adding dropout
Normalising inputs
Normalising inputs and scaling them down by a factor of 100
Different optimizers (RMSprop, Nadam, etc)
Adding regularization (bias reg, kernel reg, activity reg) of various levels to my layers.
I tried testing this by setting all my engineered feature values to 0, and the network trains perfectly fine. Not sure what I should do here.

Masking before Conv1D in keras

I want to mask zeros before conv1D layer, but this is not supported do you have a solution for this problem.
Here's a piece of my code :
# Defining the neural network architecture. This contains five TDNN layers, statistics layer, two fully connected layers followed by the softmax output.
input_shape_data=(1865,20)
main_input = Input(shape=input_shape_data, dtype='float', name='main_input')
u = Masking(mask_value=0.)(main_input)
x1 = Conv1D(512, kernel_size=5, strides=1, padding='same',dilation_rate=1,activation='relu',kernel_initializer='glorot_uniform', bias_initializer='zeros')(u)
x1 = BatchNormalization()(x1)
Thank you in advance

How is my model working if all the base layer trainables are set to false?

This is the model I make for my deep learning project and I am getting decent accuracy out of it. My question is, if I froze the weights of the initial model(which is my base model of VGG19) how did I manage to train the whole model? And also after adding the VGG19 layer with the layers frozen I got better results than I acheived only which a few layers of CNN. Could it be because the weights of the VGG19 were initialized into my CNN layer?
img_h=224
img_w=224
initial_model = applications.vgg19.VGG19(weights='imagenet', include_top=False,input_shape = (img_h,img_w,3))
last = initial_model.output
for layer in initial_model.layers:
layer.trainable = False
x = Conv2D(128, kernel_size=3, strides=1, activation='relu')(last)
x = Conv2D(64, kernel_size=3, strides=1, activation='relu')(x)
x = Flatten()(x)
x = Dense(512, activation='relu')(x)
x = Dense(256, activation='relu')(x)
x = Dense(128, activation='relu')(x)
x = (Dropout(0.1))(x)
preds = Dense(2, activation='sigmoid')(x)
"Freezing the layers" just means you don't update the weights on those layers when you backpropagate the error. Therefore, you'll just update the weights on those layers that are not frozen, which enables your neural net to learn.
You are adding some layers after VGG. I don't know if this is a common approach, but it totally makes sense that it kind of works, assuming you are interpreting your metrics right.
Your VGG has already been pre-trained on ImageNet, so it's a pretty good baseline for many use-cases. You are basically using VGG as your encoder. Then, on the output of this encoder (which we can call latent representation of your input), you train a neural net.
I would also try out more mainstream transfer learning techniques, where you gradually unfreeze layers starting from the end, or you have gradually smaller learning rate.

Batch normalization layer for CNN-LSTM

Suppose that I have a model like this (this is a model for time series forecasting):
ipt = Input((data.shape[1] ,data.shape[2])) # 1
x = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out = Dense(1, activation = 'relu')(x) # 5
Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.
Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)
What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?
Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.
BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:
"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks
Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients
from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np
def make_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ConvBlock(ipt)
x = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
# x = BatchNormalization()(x) # may or may not work well
out = Dense(1, activation='relu')
model = Model(ipt, out)
model.compile('nadam', 'mse')
return model
def make_data(batch_shape): # toy data
return (np.random.randn(*batch_shape),
np.random.uniform(0, 2, (batch_shape[0], 1)))
batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y = make_data(batch_shape)
model.train_on_batch(x, y)
Functions used:
def ConvBlock(_input): # cleaner code
x = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
kernel_initializer='lecun_normal')(_input)
x = BatchNormalization(scale=False)(x)
x = Activation('selu')(x)
x = AlphaDropout(0.1)(x)
out = SqueezeExcite(x)
return out
def SqueezeExcite(_input, r=4): # r == "reduction factor"; see paper
filters = K.int_shape(_input)[-1]
se = GlobalAveragePooling1D()(_input)
se = Reshape((1, filters))(se)
se = Dense(filters//r, activation='relu', use_bias=False,
kernel_initializer='he_normal')(se)
se = Dense(filters, activation='sigmoid', use_bias=False,
kernel_initializer='he_normal')(se)
return multiply([_input, se])
Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

How to add a few layers before the model in transfer learning with tensorflow

I am trying to use transfer learning in tensorflow. I know the high level paradigm
base_model=MobileNet(weights='imagenet',include_top=False) #imports the
mobilenet model and discards the last 1000 neuron layer.
x=base_model.output
x=GlobalAveragePooling2D()(x)
x=Dense(1024,activation='relu')(x) #we add dense layers so that the model can learn more complex functions and classify for better results.
x=Dense(1024,activation='relu')(x) #dense layer 2
x=Dense(512,activation='relu')(x) #dense layer 3
preds=Dense(120,activation='softmax')(x) #final layer with softmax activation
and then one compiles it by
model=Model(inputs=base_model.input,outputs=preds)
However i want the there to be a few other layers before the base_model.input. I want to add adversarial noise to the images that come in and a few other things. So effectively i want to know how to :
base_model=MobileNet(weights='imagenet',include_top=False) #imports the
mobilenet model and discards the last 1000 neuron layer
x = somerandomelayers(x_in)
base_model.input = x_in
x=base_model.output
x=GlobalAveragePooling2D()(x)
x=Dense(1024,activation='relu')(x) #we add dense layers so that the model can learn more complex functions and classify for better results.
x=Dense(1024,activation='relu')(x) #dense layer 2
x=Dense(512,activation='relu')(x) #dense layer 3
preds=Dense(120,activation='softmax')(x) #final layer with softmax activation
model=Model(inputs=x_in,outputs=preds)
but the line base_model.input = x_in is apparently not the way to do it as it throws can't set attribute error. How do i go about achieving the desired behavior?
You need to define input layer. It's rather straightforward, just be sure to set right shapes. For example, you can use any predefined model from Keras.
base_model = keras.applications.any_model(...)
input_layer = keras.layers.Input(shape)
x = keras.layers.Layer(...)(input_layer)
...
x = base_model(x)
...
output = layers.Dense(num_classes, activation)(x)
model = keras.Model(inputs=input_layer, outputs=output)