Tensorflow model with custom loss function gets no training done - tensorflow

I created a custom loss function as below:
import tensorflow.keras.backend as K
def custom_loss(y_true, y_pred):
y_true = K.cast(y_true, tf.float32)
y_pred = K.cast(y_pred, tf.float32)
mask = K.sign(y_true) * K.sign(y_pred)
mask = ((mask * -1) + 1) / 2
losses = K.abs(y_true * mask)
return K.sum(losses)
However, when I try to train the model using this loss function, I get no training done.
The model works normally with other loss functions such as mse and mae, and I've tried all learning rates and model complexities.
Below is how I know no training is being done.
model = get_compiled_model()
print(model.predict(train_x)[:10])
model.fit(train_x, train_y, epochs=5, verbose=1)
print(model.predict(train_x)[:10])
model.fit(train_x, train_y, epochs=5, verbose=1)
print(model.predict(train_x)[:10])
[[0.19206487]
[0.19201839]
[0.19199933]
[0.19199185]
[0.19206186]
[0.19208357]
[0.1920282 ]
[0.19203594]
[0.1919941 ]
[0.19202243]]
Epoch 1/5
1/1 [==============================] - 0s 1ms/step - loss: 0.0179
Epoch 2/5
1/1 [==============================] - 0s 2ms/step - loss: 0.0179
Epoch 3/5
1/1 [==============================] - 0s 1ms/step - loss: 0.0179
Epoch 4/5
1/1 [==============================] - 0s 1ms/step - loss: 0.0179
Epoch 5/5
1/1 [==============================] - 0s 2ms/step - loss: 0.0179
[[0.19206487]
[0.19201839]
[0.19199933]
[0.19199185]
[0.19206186]
[0.19208357]
[0.1920282 ]
[0.19203594]
[0.1919941 ]
[0.19202243]]
Epoch 1/5
1/1 [==============================] - 0s 1ms/step - loss: 0.0179
Epoch 2/5
1/1 [==============================] - 0s 2ms/step - loss: 0.0179
Epoch 3/5
1/1 [==============================] - 0s 2ms/step - loss: 0.0179
Epoch 4/5
1/1 [==============================] - 0s 951us/step - loss: 0.0179
Epoch 5/5
1/1 [==============================] - 0s 1ms/step - loss: 0.0179
[[0.19206487]
[0.19201839]
[0.19199933]
[0.19199185]
[0.19206186]
[0.19208357]
[0.1920282 ]
[0.19203594]
[0.1919941 ]
[0.19202243]]
The 2d array in the code above is the first 10 predictions of the model, and it does not change in the slightest even with 5 epochs of training.
My intuition tells me something is wrong with the loss function, but I have no idea what.
The model looks like as follows
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, input_dim=2*training_size+1, activation='softmax'),
tf.keras.layers.Dense(10, activation='softmax'),
tf.keras.layers.Dense(1, activation='tanh')
])
opt = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=opt,
loss=custom_loss,
metrics=[])
return model

I was playing around with some fake data using your model and loss function, and I wanted to check the derivatives.
if __name__=="__main__":
m = get_compiled_model()
x = numpy.random.random( (1000, 21))
x = numpy.array(x, dtype="float32")
exp_y = numpy.random.random( (1000, 1))
exp_y = (exp_y>0.5)*1.0
with tf.GradientTape() as tape:
y = m(x)
loss = custom_loss(y, exp_y)
#loss = keras.losses.mse(y, exp_y)
grad = tape.gradient(loss, m.trainable_variables)
for var, g in zip(m.trainable_variables, grad):
print(f'{var.name}, shape: {K.sum(g*g)}')
For the mse loss function:
dense/kernel:0, shape: 2817.013671875
dense/bias:0, shape: 530.52197265625
dense_1/kernel:0, shape: 3826.3974609375
dense_1/bias:0, shape: 25160.9375
dense_2/kernel:0, shape: 125238.34375
dense_2/bias:0, shape: 1241268.5
For the custom loss function
dense/kernel:0, shape: 34.87071228027344
dense/bias:0, shape: 6.609962463378906
dense_1/kernel:0, shape: 107.27591705322266
dense_1/bias:0, shape: 824.83740234375
dense_2/kernel:0, shape: 5944.91796875
dense_2/bias:0, shape: 59201.58203125
We can see that the sum of derivatives are orders of magnitude different. Even with this random data, the MSE loss function will cause the output of the model to change over time.
This might only be the case for the fake data I made.

Related

Is the loss function wrong in the following code for binary classification of images using soft labels? Or is there some other problem?

We are using CNN to classify images with labels 0 and 1 in tensorflow.
However, in reality, images have probability values between 0 and 1, not one-hot labels of 0 and 1. Images with probabilities in the range [0, 0.5) are labeled 0, and images in the range [0.5, 1.0] are labeled 1. I want to check whether the classification performance is better if binary classification is performed using soft labels between 0 and 1 instead of one-hot labels.
The code below is an example of binary classification only with data labeled 0 and 1 in the cifar10 dataset.
In the code below, the accuracy is about 98% without 'making soft labels part', but about 48% with 'making soft labels part'.
Should I modify the 'BinaryCrossEntropy_custom' function, which is the loss function, to solve the problem? Or is something else wrong?
This answer says that using logits solves it. I understand soft_labels argument, but what value should I put in logits argument in this example code?
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import vgg16
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
# Rewrite the binary cross entropy function. We will modify this function to return a loss that fits the soft label later.
def BinaryCrossEntropy_custom(y_true, y_pred):
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
term_0 = (1 - y_true) * K.log(1 - y_pred + K.epsilon())
term_1 = y_true * K.log(y_pred + K.epsilon())
return -K.mean(term_0 + term_1, axis=0)
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# Only data with labels 0 and 1 are used.
train_ind01 = np.where((train_labels == 0) | (train_labels == 1))[0]
test_ind01 = np.where((test_labels == 0) | (test_labels == 1))[0]
train_images = train_images[train_ind01, :, :, :]
test_images = test_images[test_ind01, :, :, :]
train_labels = train_labels[train_ind01, :]
test_labels = test_labels[test_ind01, :]
train_labels = np.array(train_labels).astype('float64')
test_labels = np.array(test_labels).astype('float64')
# making soft labels part start
# Samples with label 0 are replaced with labels in the range [0,0.2],
# and samples with label 1 are replaced by labels in the range [0.8, 1.0].
sampl_train = np.random.uniform(low=-0.2, high=0.2, size=train_labels.shape)
sampl_test = np.random.uniform(low=-0.2, high=0.2, size=test_labels.shape)
train_labels = train_labels + sampl_train
test_labels = test_labels + sampl_test
train_labels = np.clip(train_labels, 0.0, 1.0)
test_labels = np.clip(test_labels, 0.0, 1.0)
# making soft labels part end
vgg = vgg16.VGG16(include_top=False, weights='imagenet', input_shape=(32, 32, 3))
output = vgg.layers[-1].output
output = layers.Flatten()(output)
output = layers.Dense(512, activation='relu')(output)
output = layers.Dropout(0.2)(output)
output = layers.Dense(256, activation='relu')(output)
output = layers.Dropout(0.2)(output)
predictions = layers.Dense(units=1, activation="sigmoid")(output)
model = Model(inputs=vgg.input, outputs=predictions)
model.compile(optimizer=Adam(learning_rate=.0001), loss=BinaryCrossEntropy_custom, metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=100,
validation_data=(test_images, test_labels))
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(test_acc)
The console output with making soft labels part is
Epoch 1/100
2022-09-16 15:29:29.136931: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
313/313 [==============================] - 17s 42ms/step - loss: 0.2951 - accuracy: 0.4779 - val_loss: 0.2775 - val_accuracy: 0.4650
Epoch 2/100
313/313 [==============================] - 12s 38ms/step - loss: 0.2419 - accuracy: 0.4931 - val_loss: 0.2488 - val_accuracy: 0.4695
Epoch 3/100
313/313 [==============================] - 12s 39ms/step - loss: 0.2290 - accuracy: 0.4978 - val_loss: 0.2424 - val_accuracy: 0.4740
Epoch 4/100
313/313 [==============================] - 12s 39ms/step - loss: 0.2161 - accuracy: 0.5002 - val_loss: 0.2404 - val_accuracy: 0.4765
Epoch 5/100
313/313 [==============================] - 12s 39ms/step - loss: 0.2139 - accuracy: 0.5007 - val_loss: 0.2620 - val_accuracy: 0.4730
Epoch 6/100
313/313 [==============================] - 12s 38ms/step - loss: 0.2118 - accuracy: 0.5023 - val_loss: 0.2480 - val_accuracy: 0.4745
Epoch 7/100
313/313 [==============================] - 12s 38ms/step - loss: 0.2097 - accuracy: 0.5019 - val_loss: 0.2350 - val_accuracy: 0.4775
Epoch 8/100
313/313 [==============================] - 12s 39ms/step - loss: 0.2098 - accuracy: 0.5024 - val_loss: 0.2289 - val_accuracy: 0.4780
Epoch 9/100
313/313 [==============================] - 12s 38ms/step - loss: 0.2034 - accuracy: 0.5039 - val_loss: 0.2364 - val_accuracy: 0.4780
Epoch 10/100
313/313 [==============================] - 12s 39ms/step - loss: 0.2025 - accuracy: 0.5040 - val_loss: 0.2481 - val_accuracy: 0.4720

Loss when starting fine tuning is higher than loss from transfer learning

Since I start fine tuning with the weights learned by transfer learning, I would expect the loss to be the same or less. However it looks like it starts fine tuning using a different set of starting weights.
Code to start transfer learning:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(units=3, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
epochs = 1000
callback = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
history = model.fit(train_generator,
steps_per_epoch=len(train_generator),
epochs=epochs,
validation_data=val_generator,
validation_steps=len(val_generator),
callbacks=[callback],)
Output from last epoch:
Epoch 29/1000
232/232 [==============================] - 492s 2s/step - loss: 0.1298 - accuracy: 0.8940 - val_loss: 0.1220 - val_accuracy: 0.8937
Code to start fine tuning:
model.trainable = True
# Fine-tune from this layer onwards
fine_tune_at = -20
# Freeze all the layers before the `fine_tune_at` layer
for layer in model.layers[:fine_tune_at]:
layer.trainable = False
model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
loss='binary_crossentropy',
metrics=['accuracy'])
history_fine = model.fit(train_generator,
steps_per_epoch=len(train_generator),
epochs=epochs,
validation_data=val_generator,
validation_steps=len(val_generator),
callbacks=[callback],)
But this is what I see for the first few epochs:
Epoch 1/1000
232/232 [==============================] - ETA: 0s - loss: 0.3459 - accuracy: 0.8409/usr/local/lib/python3.7/dist-packages/PIL/Image.py:960: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
"Palette images with Transparency expressed in bytes should be "
232/232 [==============================] - 509s 2s/step - loss: 0.3459 - accuracy: 0.8409 - val_loss: 0.7755 - val_accuracy: 0.7262
Epoch 2/1000
232/232 [==============================] - 502s 2s/step - loss: 0.1889 - accuracy: 0.9066 - val_loss: 0.5628 - val_accuracy: 0.8881
Eventually the loss drops and passes the transfer learning loss:
Epoch 87/1000
232/232 [==============================] - 521s 2s/step - loss: 0.0232 - accuracy: 0.8312 - val_loss: 0.0481 - val_accuracy: 0.8563
Why was the loss in the first epoch of fine tuning higher than the last loss from transfer learning?

Why accuracy stays zero in Keras LSTM while other metrics improve when training?

I have constructed a LSTM model for binary text classification problem. Here the labels are one hot encoded. Following is my constructed model
model = Sequential()
model.add(Embedding(input_dim=dictionary_length, output_dim=60, input_length=max_word_count))
model.add(LSTM(600))
model.add(Dense(units=max_word_count, activation='tanh', kernel_regularizer=regularizers.l2(0.04), activity_regularizer=regularizers.l2(0.015)))
model.add(Dense(units=max_word_count, activation='relu', kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01)))
model.add(Dense(2, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
adam_optimizer = Adam(lr=0.001, decay=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=adam_optimizer,
metrics=[tf.keras.metrics.Accuracy(), metrics.AUC(), metrics.Precision(), metrics.Recall()])
When I fit this model the accuracy stays 0 all the time but other matrices get improved. What is the issue here?
Epoch 1/3
94/94 [==============================] - 5s 26ms/step - loss: 3.4845 - accuracy: 0.0000e+00 - auc_4: 0.7583 - precision_4: 0.7251 - recall_4: 0.7251
Epoch 2/3
94/94 [==============================] - 2s 24ms/step - loss: 0.4772 - accuracy: 0.0000e+00 - auc_4: 0.9739 - precision_4: 0.9249 - recall_4: 0.9249
Epoch 3/3
94/94 [==============================] - 3s 27ms/step - loss: 0.1786 - accuracy: 0.0000e+00 - auc_4: 0.9985 - precision_4: 0.9860 - recall_4: 0.9860
Because you need CategoricalAccuracy : https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CategoricalAccuracy to mesuring the accuracy of this problem.
model.compile(loss='categorical_crossentropy', optimizer=adam_optimizer,
metrics=[tf.keras.metrics.CategoricalAccuracy(), metrics.AUC(), metrics.Precision(), metrics.Recall()])

Why masking input produces the same loss as unmasked input on Keras?

I am experimenting with LSTM using variable-length input due to this reason. I wanted to be sure that loss is calculated correctly under masking. So, I trained the below model that uses Masking layer on padded sequences.
from tensorflow.keras.layers import LSTM, Masking, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models, losses
import tensorflow as tf
import numpy as np
import os
"""
For generating reproducible results, set seed.
"""
def set_seed(seed):
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
"""
Set some right most indices to mask value like padding
"""
def create_padded_seq(num_samples, timesteps, num_feats, mask_value):
feats = np.random.random([num_samples, timesteps, num_feats]).astype(np.float32) # Generate samples
for i in range(0, num_samples):
rand_index = np.random.randint(low=2, high=timesteps, size=1)[0] # Apply padding
feats[i, rand_index:, 0] = mask_value
return feats
set_seed(42)
num_samples = 100
timesteps = 6
num_feats = 1
num_classes = 3
num_lstm_cells = 1
mask_value = -100
num_epochs = 5
X_train = create_padded_seq(num_samples, timesteps, num_feats, mask_value)
y_train = np.random.randint(num_classes, size=num_samples)
cat_y_train = to_categorical(y_train, num_classes)
masked_model = models.Sequential(name='masked')
masked_model.add(Masking(mask_value=mask_value, input_shape=(timesteps, num_feats)))
masked_model.add(LSTM(num_lstm_cells, return_sequences=False))
masked_model.add(Dense(num_classes, activation='relu'))
masked_model.compile(loss=losses.categorical_crossentropy, optimizer='adam', metrics=["accuracy"])
print(masked_model.summary())
masked_model.fit(X_train, cat_y_train, batch_size=1, epochs=5, verbose=True)
This is the verbose output,
Model: "masked"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
masking (Masking) (None, 6, 1) 0
_________________________________________________________________
lstm (LSTM) (None, 1) 12
_________________________________________________________________
dense (Dense) (None, 3) 6
=================================================================
Total params: 18
Trainable params: 18
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 2/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 3/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 4/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 5/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
I also removed Masking layer and trained another model on the same data to see the effect of masking, this is the model,
unmasked_model = models.Sequential(name='unmasked')
unmasked_model.add(LSTM(num_lstm_cells, return_sequences=False, input_shape=(timesteps, num_feats)))
unmasked_model.add(Dense(num_classes, activation='relu'))
unmasked_model.compile(loss=losses.categorical_crossentropy, optimizer='adam', metrics=["accuracy"])
print(unmasked_model.summary())
unmasked_model.fit(X_train, cat_y_train, batch_size=1, epochs=5, verbose=True)
And this is the verbose output,
Model: "unmasked"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 1) 12
_________________________________________________________________
dense (Dense) (None, 3) 6
=================================================================
Total params: 18
Trainable params: 18
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 2/5
100/100 [==============================] - 0s 2ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 3/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 4/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Epoch 5/5
100/100 [==============================] - 0s 1ms/step - loss: 10.6379 - accuracy: 0.3400
Losses are the same in both outputs, what is the reason for that ? It seems like Masking layer has no effect on loss, is that correct ? If not, then how can I observe the effect of Masking layer ?
In the case of a multi-classification task, the problem seems to be the last activation function...
If you change relu with softmax, your network can produce probabilities in the range [0,1]

Training loss and validation loss not reducing while using elmo embeddings with keras

I am building a LSTM network using elmo embeddings with keras. My objective is to minimize the RMSE. The elmo embeddings are obtained using the following code segment:
def ElmoEmbedding(x):
return elmo_model(inputs={
"tokens": tf.squeeze(tf.cast(x, tf.string)),
"sequence_len": tf.constant(batch_size*[max_len])
},
signature="tokens",
as_dict=True)["elmo"]
The model is defined as below:
def create_model(max_len):
input_text = Input(shape=(max_len,), dtype=tf.string)
embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(input_text)
x = Bidirectional(LSTM(units=512, return_sequences=False,
recurrent_dropout=0.2, dropout=0.2))(embedding)
out = Dense(1, activation = "relu")(x)
model = Model(input_text, out)
return model
The model is compiled as:
model.compile(optimizer = "rmsprop", loss = root_mean_squared_error,
metrics =[root_mean_squared_error])
And then trained as:
model.fit(np.array(X_tr), y_tr, validation_data=(np.array(X_val), y_val),
batch_size=batch_size, epochs=5, verbose=1)
The root_mean_square_error is defined as:
def root_mean_squared_error(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))
The dataset size I have is 9652 consisting of sentences and the label is a numeric value. The dataset is divided into train and validation set. The maximum sentence length is 142. I added padding (PAD) to make each sentence of length 142. So, a sentence looks like this:
['france', 'is', 'hunting', 'down', 'its', 'citizens', 'who', 'joined', 'twins', 'without', 'trial', 'in', 'iraq']
['france', 'is', 'hunting', 'down', 'its', 'citizens', 'who', 'joined', 'twins', 'without', 'trial', 'in', 'iraq', '__PAD__', '__PAD__', '__PAD__',...., '__PAD__']
When I train this model, I get the following output
Train on 8704 samples, validate on 928 samples
Epoch 1/5
8704/8704 [==============================] - 655s 75ms/step - loss: 0.9960 -
root_mean_squared_error: 0.9960 - val_loss: 0.9389 - val_root_mean_squared_error: 0.9389
Epoch 2/5
8704/8704 [==============================] - 650s 75ms/step - loss: 0.9354 -
root_mean_squared_error: 0.9354 - val_loss: 0.9389 - val_root_mean_squared_error: 0.9389
Epoch 3/5
8704/8704 [==============================] - 650s 75ms/step - loss: 0.9354 -
root_mean_squared_error: 0.9354 - val_loss: 0.9389 - val_root_mean_squared_error: 0.9389
Epoch 4/5
8704/8704 [==============================] - 650s 75ms/step - loss: 0.9354 -
root_mean_squared_error: 0.9354 - val_loss: 0.9389 - val_root_mean_squared_error: 0.9389
Epoch 5/5
8704/8704 [==============================] - 650s 75ms/step - loss: 0.9354 -
root_mean_squared_error: 0.9354 - val_loss: 0.9389 - val_root_mean_squared_error: 0.9389
Both the loss and metric do not improve and remain same from epoch 2-5.
I am not sure what is wrong here? Any help would be appreciated.