Deep Learning model (LSTM) predicts same class label - tensorflow

I am trying to solve the Spoken Digit Recognition task using the LSTM model, where the audio files are converted into spectrograms and fed into an LSTM model after doing Global Average Pooling. Here is the architecture of it
tf.keras.backend.clear_session()
#input layer
input_= Input(shape = (64, 35))
lstm = LSTM(100, activation='tanh', return_sequences= True, kernel_regularizer = l2(0.000001),
recurrent_initializer = 'glorot_uniform')(input_)
lstm = GlobalAveragePooling1D(data_format='channels_first')(lstm)
dense = Dense(20, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer='glorot_uniform')(lstm)
drop = Dropout(0.8)(dense)
dense1 = Dense(25, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer= 'he_uniform')(drop)
drop = Dropout(0.95)(dense1)
output = Dense(10,activation = 'softmax', kernel_regularizer = l2(0.000001), kernel_initializer= 'glorot_uniform')(drop)
model_2 = Model(inputs = [input_], outputs = output)
model_2.summary()
Having summary as -
I need to calculate the F1 score to check the performance of the model, I have implemented a custom callback and used TensorFlow addons F1 score too. However, I won't get the correct result, for every epoch I get the constant F1 score value.
On further digging, I found out that my model predicts the same class label, for the entire epoch, whereas it is supposed to predict 10 classes in one epoch. as there are 10 class label values present.
Here is my model.compile and model.predict commands. I have used TensorFlow addon here -
from tensorflow import keras
opt = keras.optimizers.Adam(0.001, clipnorm=0.8)
model_2.compile(loss='categorical_crossentropy', optimizer=opt, metrics = metric)
hist = model_2.fit([X_train_spectrogram],
[y_train_converted],
validation_data= ([X_test_spectrogram], [y_test_converted]),
epochs = 10,
verbose =1,
callbacks=[tensorBoard_callbk2, ClearMemory()],
# steps_per_epoch = 3,
batch_size=32)
Here is what I mean by getting the same prediction, the entire array is filled with the same predicted values.
Why is the model predicting the same class label? or How to rectify it?
I have tried increasing the number of trainable parameters, increasing - decreasing batch size too, but it won't help me. If anyone knows can you please help me out?

Related

RNN LSTM network for inputting a sequence of numbers

I'm trying to use LSTM networks to input a simple dataset that has multiple different sequences of numbers that represent musical data. The data is just a bunch of numpy arrays of floating point numbers with each song being one array. The data looks like this:
Song 1: [0.00013487907, 0.0002517006, 0.00021654845, ...]
Song 2: [-0.007279772, -0.011207076, -0.010082608, ...]
Song 3: [-0.00060827745, -0.00082834775, -0.0006534484, ...]
..and so on
I have done this before for MIDI files before, but those require embeddings of the different characters, however this is more continuous data as opposed to discrete data, so I'm not sure what the input model will look like, and how the data can be loaded for this particular task. For example, for the MIDI file project the input had an embedding layer to the model:
batch_size = 16
seq_length = 64
num_epochs = 100
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(Embedding(input_dim = num_unique_chars, output_dim = 512, batch_input_shape = (batch_size, seq_length)))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(num_unique_chars)))
model.add(Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
I wanna know how to do the same without tokenization/embedding, and feed each song into the model separately, and then further be able to generate samples from it.
I've tried looking for examples of this but everything related to LSTM networks seems to be text-based. Would appreciate any help/guidance with this!
Thanks
If you already have continuous values, you will not need an Embedding-layer. Either you directly pass the data into the LSTMs or you can use a Dense layer in-between. Additionally, you can also add a Masking-layer (depending on your data).
Also you have to adjust the shape of your data to (batch_size, seq_len, 1) as you only have one feature, but the time-series has to be "recognizable".
Here is a minimum working example with a Dense-layer instead the non-functioning Embedding-layer:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import Sequential
batch_size = 16
seq_length = 64
num_epochs = 100
num_unique_chars = 55 # I just picked any number
optimizer_ = tf.keras.optimizers.Adam()
model = Sequential()
model.add(layers.Dense(256, use_bias=False))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256, return_sequences = True, stateful = True))
model.add(layers.Dropout(0.2))
model.add(layers.TimeDistributed(layers.Dense(num_unique_chars)))
model.add(layers.Activation("softmax"))
model.compile(loss = "categorical_crossentropy", optimizer = optimizer_, metrics = ["accuracy"])
test_data = tf.random.normal(shape=(batch_size, seq_length, 1))
test_out = model(test_data)
print(test_out.shape)
Output: (16, 64, 55)
P. S.: With Dense layers the TimeDistributed-layer is optional. The Dense layer will just manipulate the last dimension of its input tensor.
P. P. S.: I think for your limited amount of features, three LSTM-layers with a dimension of 256 might easily result in over-fitting or some other unpleasant effects. So it might be useful to reduce the number of layers and their dimension. (Of course, this does not target your initial question)

Keras accuracy not increasing

I am trying to perform sentiment classification using Keras. I am trying to do this using a basic neural network (no RNN or other more complex type). However when I run the script I see no increase in accuracy during training/evaluation. I am guessing I am setting up the output layer incorrectly but I am not sure of that. y_train is a list [1,2,3,1,2,4,5] (5 different labels) containing the targets belonging to the features in X_train_seq_padded. The setup is as follows:
padding_len = 24 # len of each tokenized sentence
neurons = 16 # 2/3 the length of the text that is padded
model = Sequential()
model.add(Dense(neurons, input_dim = padding_len, activation = 'relu', name = 'hidden-1'))
model.add(Dense(neurons, activation = 'relu', name = 'hidden-2'))
model.add(Dense(neurons, activation = 'relu', name = 'hidden-3'))
model.add(Dense(1, activation = 'sigmoid', name = 'output_layer'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
callbacks = [EarlyStopping(monitor = 'accuracy', patience = 5, mode = 'max')]
history = model.fit(X_train_seq_padded, y_train, epochs = 100, batch_size = 64, callbacks = callbacks)
First of all, in your above set up if you choose sigmoid in your last layer activation function which generally uses for binary classification or multi-label classification then, the loss function should be binary_crossentropy.
But if your labels are represented multi-class and transformed into one-hot encoded then your last layer should be Dense(num_classes, activations='softmax') and the loss function would be categorical_crossentropy.
But if you don't transform your multi-class label but integer then your last layer and loss function should be
Dense(num_classes) # with logits
SparseCategoricalCrossentropy(from_logits= True)
Or, (#Frightera)
Dense(num_classes, activation='softmax') # with probabilities
SparseCategoricalCrossentropy(from_logits=False)

'Channels first' training accuracy very low compared to 'channels last'

My issue:
I am trying to train a semantic segmentation model in tf.keras, in fact it works very well when I am using channels_last (WHC) mode (it reaches 96%+ val acc). I wanted to train it in channels_first (CHW) mode so the weights are compatible with TensorRT. When I do this, the ~80% training accuracy in the first few epochs dips down to around 0.020% and stays there permanently.
It is useful to know that the base of my model is a tf.keras.applications.MobileNet() model with the pre-trained 'imagenet' weights. (Model architecture at the bottom.)
The transformation process:
I used the guidelines provided and I change only a few things here:
Set tf.keras.backend.set_image_data_format() to 'channels_first'.
I change the channel order in the input tensor from: input_tensor=Input(shape=(376, 672, 3)) to: input_tensor=Input(shape=(3, 376, 672))
In my image preprocessing (using tf.data.Dataset), i use tf.transpose(img, perm=[2, 0, 1]) on both my input image and one-hot encoded mask to change the channel orders. I checked this with equality assertion to make sure its correct and it seems to be fine.
When I change these the training starts fine but as I said the training accuracy goes down to almost zero. When I revert back everything's fine again.
Possible leads:
What am I doing wrong or what could be the problematic part here? My suspicions are around these questions:
Are the pre-trained imageNet weights changed to the 'channels_first' order also when I set the backend? Is this something I should consider at all?
Could it be that the tf.transpose() function messes up the mask's one-hot encoding? (I have 3 classes represented by 3 colors: lane, opposing lane, background)
Maybe I am not seeing something obvious. I can provide further code and answers as needed.
EDIT:
08/17: This is still an ongoing issue, I have tried several things:
I checked if the image and the mask is correct after the transpose with numpy assertion, seems correct.
I suspected that the loss function calculates on the wrong axis, so I customized the loss function for the first axis (where the channels are). Here it is:
def ReverseAxisLoss(y_true, y_pred):
return K.categorical_crossentropy(y_true, y_pred, from_logits=True, axis=1)
My main suspicion is that the 'channels first' backend setting does nothing to transpose the pretrained 'imagenet' weights for the mobilenet part. Is there an updated way for TF2.x / Keras to transpose the pre-trained weights into CHW format?
Here is the architecture that I use (the skipNet() is the head network and the mobilenet is the base, and it is connected in the create_model() function)
def skipNet(encoder_output, feed1, feed2, classes):
# random initializer and regularizer
stddev = 0.01
init = RandomNormal(stddev=stddev)
weight_decay = 1e-3
reg = l2(weight_decay)
score_feed2 = Conv2D(kernel_size=(1, 1), filters=classes, padding="SAME",
kernel_initializer=init, kernel_regularizer=reg)(feed2)
score_feed2_bn = BatchNormalization()(score_feed2)
score_feed1 = Conv2D(kernel_size=(1, 1), filters=classes, padding="SAME",
kernel_initializer=init, kernel_regularizer=reg)(feed1)
score_feed1_bn = BatchNormalization()(score_feed1)
upscore2 = Conv2DTranspose(kernel_size=(4, 4), filters=classes, strides=(2, 2),
padding="SAME", kernel_initializer=init,
kernel_regularizer=reg)(encoder_output)
height_pad1 = ZeroPadding2D(padding=((1,0),(0,0)))(upscore2)
upscore2_bn = BatchNormalization()(height_pad1)
fuse_feed1 = add([score_feed1_bn, upscore2_bn])
upscore4 = Conv2DTranspose(kernel_size=(4, 4), filters=classes, strides=(2, 2),
padding="SAME", kernel_initializer=init,
kernel_regularizer=reg)(fuse_feed1)
height_pad2 = ZeroPadding2D(padding=((0,1),(0,0)))(upscore4)
upscore4_bn = BatchNormalization()(height_pad2)
fuse_feed2 = add([score_feed2_bn, upscore4_bn])
upscore8 = Conv2DTranspose(kernel_size=(16, 16), filters=classes, strides=(8, 8),
padding="SAME", kernel_initializer=init,
kernel_regularizer=reg, activation="softmax")(fuse_feed2)
return upscore8
def create_model(classes):
base_model = tf.keras.applications.MobileNet(input_tensor=Input(shape=IMG_SHAPE),
include_top=False,
weights='imagenet')
conv4_2_output = base_model.get_layer(index=43).output
conv3_2_output = base_model.get_layer(index=30).output
conv_score_output = base_model.output
head_model = skipNet(conv_score_output, conv4_2_output, conv3_2_output, classes)
for layer in base_model.layers:
layer.trainable = False
model = Model(inputs=base_model.input, outputs=head_model)
return model

How to compute saliency map using keras backend

I am trying to construct a basic "vanilla gradient" saliency heatmap (gradient-based feature attribution) for MNIST using keras. I know there are libraries such as this one to compute saliency heatmaps, but I would like to construct this from scratch since the vanilla gradient approach seems conceptually straightforward to implement. I have trained the following digit classifier in Keras using functional model definition:
input = layers.Input(shape=(28,28,1), name='input')
conv2d_1 = layers.Conv2D(32, kernel_size=(3, 3), activation='relu')(input)
maxpooling2d_1 = layers.MaxPooling2D(pool_size=(2, 2), name='maxpooling2d_1')(conv2d_1)
conv2d_2 = layers.Conv2D(64, kernel_size=(3, 3), activation='relu')(maxpooling2d_1)
maxpooling2d_2 = layers.MaxPooling2D(pool_size=(2, 2))(conv2d_2)
flatten = layers.Flatten(name='flatten')(maxpooling2d_2)
dropout = layers.Dropout(0.5, name='dropout')(flatten)
dense = layers.Dense(num_classes, activation='softmax', name='dense')(dropout)
model = keras.models.Model(inputs=input, outputs=dense)
Now, I want to compute the saliency map for a single MNIST image. Since the final layer has a softmax activation and the denominator is a normalization term (so that the output nodes add up to 1), I believe that I need to either take the pre-softmax output or change the activation of the trained model linear for computing saliency maps. I will do the latter.
model.layers[-1].activation = tf.keras.activations.linear # swap activation to linear
input = loaded_model.layers[0].input
output = loaded_model.layers[-1].output
input_image = x_test[0] # shape is (28, 28, 1)
pred = np.argmax(loaded_model.predict(np.expand_dims(input_image, axis=0))) # predicted class
However, I am not sure what to do beyond this. I know I can use the following K.gradients(output, input) to compute gradients. That being said, I believe I should compute the gradient of the predicted class with respect to the input image, versus computing the gradient of the entire output. How would I do this? Also, I'm not sure how to evaluate the saliency heatmap for a specific image/prediction. I imagine I will have to use sess = tf.keras.backend.get_session() and sess.run(), but not sure exactly. I would greatly appreciate any help with completing the saliency heatmap code. Thanks!
If you add the activation as a single layer after the last dense layer with:
keras.layers.Activation('softmax')
you can do:
linear_model = keras.Model(input=model, output=model.layers[-2].output)
To then compute the gradients like:
def get_saliency_map(model, image, class_idx):
with tf.GradientTape() as tape:
tape.watch(image)
predictions = model(image)
loss = predictions[:, class_idx]
# Get the gradients of the loss w.r.t to the input image.
gradient = tape.gradient(loss, image)
# take maximum across channels
gradient = tf.reduce_max(gradient, axis=-1)
# convert to numpy
gradient = gradient.numpy()
# normaliz between 0 and 1
min_val, max_val = np.min(gradient), np.max(gradient)
smap = (gradient - min_val) / (max_val - min_val + keras.backend.epsilon())
return smap

Keras: inconsistent results when fitting ConvNets in two different ways

I'm trying to use VGG16 network to do image classification. I've tried two different ways to do it which should be approximately equivalent as far as I understand, yet the results are very different.
Method 1: Extract features using VGG16 and fit these features using a custom fully connected network. Here is the code:
model = vgg16.VGG16(include_top=False, weights='imagenet',
input_shape=(imsize,imsize,3),
pooling='avg')
model_pred = keras.Sequential()
model_pred.add(keras.layers.Dense(1024, input_dim=512, activation='sigmoid'))
model_pred.add(keras.layers.Dropout(0.5))
model_pred.add(keras.layers.Dense(512, activation='sigmoid'))
model_pred.add(keras.layers.Dropout(0.5))
model_pred.add(keras.layers.Dense(num_categories, activation='sigmoid'))
model_pred.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
(xtr, ytr) = tools.extract_features(model, 3000, imsize, datagen,
rootdir + '/train',
pickle_name = rootdir + '/testpredstrain.pickle')
(xv, yv) = tools.extract_features(model, 300, imsize, datagen,
rootdir + '/valid1',
pickle_name = rootdir + '/testpredsvalid.pickle')
model_pred.fit(xtr, ytr, epochs = 10, validation_data = (xv, yv), verbose=1)
(The function extract_features() simply uses Keras ImageDataGenerator to generate sample images and returns the output after using model.predict() on those images)
Method 2: Take the VGG16 network without the top part, set all the convolutional layers to non-trainable and add a few densely connected layers that are trainable. Then fit using keras fit_generator(). Here is the code:
model2 = vgg16.VGG16(include_top=False, weights='imagenet',
input_shape=(imsize,imsize,3),
pooling='avg')
for ll in model2.layers:
ll.trainable = False
out1 = keras.layers.Dense(1024, activation='softmax')(model2.layers[-1].output)
out1 = keras.layers.Dropout(0.4)(out1)
out1 = keras.layers.Dense(512, activation='softmax')(out1)
out1 = keras.layers.Dropout(0.4)(out1)
out1 = keras.layers.Dense(num_categories, activation='softmax')(out1)
model2 = keras.Model(inputs = model2.input, outputs = out1)
model2.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model2.fit_generator(train_gen,
steps_per_epoch = 100,
epochs = 10,
validation_data = valid_gen,
validation_steps = 10)
The number of epochs, samples, etc. are not exactly the same in both methods, but they don't need to be to notice the inconsistency: method 1 yields validation accuracy of 0.47 after just one epoch and gets as high as 0.7-0.8 and even better when I'm using larger number of samples to fit. Method 2, however, gets stuck at validation accuracy of 0.1-0.15 and never gets any better no matter how much I train.
Also, method 2 is considerably slower than method 1 even though it seems to me that they should be approximately as fast (when taking into account the time it takes to extract the features in method 1).
With your first method you extract features with vgg16 pre-trained model once and then you train - finetune your network while in your second approach you are constantly passing your images through every layer including vgg's layers at every epoch. That causes your model to run slower with your second method.