Neural Network isn't learning anything meaningful using Triplet Loss - tensorflow

I am working on a triplet loss based model for this Kaggle competition.
Short Description- In this competition, we have been challenged to build an algorithm to identify individual whales in images by analyzing a database of containing more than 25,000 images, gathered from research institutions and public contributors.
https://www.kaggle.com/c/humpback-whale-identification?rvi=1
I have decided to use a Siamese network architecture and train it to give me encodings which I can then use to calculate the distance between two pictures of whales. If this distance is below a particular threshold the two pictures belong to the same whale and if this distance is greater then, they aren't the same whale.
This is the Triplet loss function(learnt it from Andrew's deeplearning specialization) I used but i also normalized the encoding's to make the loss function more interpretable(easier to determine margin and split point) across different models(if that makes sense).(First, tried it without the normalization and when it didnt work i tried normalizing.) I also have tried changing alpha(margin) and varied it from 0.2 to 0.6.
from tensorflow.nn import l2_normalize as norm_l2
def triplet_loss(y_true, y_pred, alpha = 0.3):
"""
Arguments:
y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
y_pred -- python list containing three objects:
anchor -- the encodings for the anchor images, of shape (None, 128)
positive -- the encodings for the positive images, of shape (None, 128)
negative -- the encodings for the negative images, of shape (None, 128)
Returns:
loss -- real number, value of the loss
"""
anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
anchor, positive, negative = norm_l2(anchor), norm_l2(positive), norm_l2(negative)
# Step 1: Compute the (encoding) distance between the anchor and the positive
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)), axis = -1)
# Step 2: Compute the (encoding) distance between the anchor and the negative
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)), axis = -1)
# Step 3: subtract the two previous distances and add alpha.
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), alpha)
# Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
return loss
This is an example of one of the model architectures i tried out. I have tried using pretrained Facenet, ResNet, DenseNet and Xception till now. I have tried Freezing different numbers of layers in each.
R = tf.keras.applications.ResNet50(include_top=False, weights = 'imagenet', input_shape=(224,224,3))
lr = 0.0001
optimizer = Adam(learning_rate=lr)
R.compile(optimizer=optimizer, loss = triplet_loss)
for layer in R.layers[0:30]:
layer.trainable = False
em_Rmodel = Sequential([
R,
GlobalAveragePooling2D(),
#tf.keras.layers.GlobalMaxPooling2D(),
Dense(512, activation='relu'),
bn(),
Dense(256, activation = 'sigmoid'),
Dense(128, activation = 'sigmoid')
])
def make_tripletModel(model):
#I was manually changing the input shape to fit the default shape of pretrained networks
A = Input(shape = (224, 224, 3), name='anchor')
P = Input(shape = (224, 224, 3), name = 'anchorPositive')
N = Input(shape = (224, 224, 3), name = 'anchorNegative')
enc_A = model(A)
enc_P = model(P)
enc_N = model(N)
tripletModel = Model(inputs=[A, P, N], outputs=[enc_A, enc_P, enc_N])
return tripletModel
tripletModel = make_tripletModel(em_Rmodel)
I have been training using semi-hard triplets and have also been augmenting data properly to generate more training images.
This is the batch generator that i used for training. crop_batch is a function that crops images to show only the whale's tail, using which one can identify whales. It uses a DenseNet trained on more than 1000 images with whale tails and the bounding box surrounding it. Does the work sufficiently well.
def batch_generator_RN(batch_size = batch_size, ishape = (256, 256, 3), model_input_shape = (224, 224, 3)):
triplet_generator = get_triplets()
y_val = np.zeros((batch_size, 2, 1))
anchors = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
positives = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
negatives = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
while True:
for i in range(batch_size):
anchors[i], positives[i], negatives[i] = next(triplet_generator)
anc = crop_batch(anchors, batch_size= batch_size, img_shape=model_input_shape)
pos = crop_batch(positives, batch_size= batch_size, img_shape=model_input_shape)
neg = crop_batch(negatives, batch_size= batch_size, img_shape=model_input_shape)
x_data = {'anchor': anc,
'anchorPositive': pos,
'anchorNegative': neg
}
yield (x_data, [y_val, y_val, y_val])
And finally, this, in general, is how i have been trying to train these models. I have tried reducing and increasing learning rate, batch_size = 16.
lr = 0.0001
optimizer = Adam(learning_rate=lr)
tripletModel.compile(optimizer = optimizer, loss = triplet_loss)
es = EarlyStopping(monitor='loss', patience=20, min_delta=0.05, restore_best_weights=True)
#mc = ModelCheckpoint('Rmodel.h5', monitor='loss', save_best_only=True, save_weights_only=True)
rlr = ReduceLROnPlateau(monitor='loss',min_delta=0.05,factor = 0.1,patience = 5, verbose = 1, min_lr = 0)
gen = batch_generator(batch_size)
tripletModel.fit(gen, steps_per_epoch=64, epochs = 40, callbacks=[es, rlr])
So after training all these models, in some models the triplet loss does go down for a while but then plateaus and basically learns nothing meaningful(which basically means that just by looking at the distance between two embeddings i cant figure out if they are the same whale or not.). In other models, immediately after the first or the second epoch the weights converge and don't change at all and doesn't learning anything.
I have tried a very wide range of learning rates and i am pretty sure that it isnt the problem.
Please tell me if i should add all the code files for you to understand the problem better. The reason i havent done it yet because i havent cleaned it but will gladly do so if required. Thanks.

When you say that it doesn't learn anything, is it that the loss reaches a plateau and thus it stops decreasing or it does decrease significantly but when you predict the embeddings of both same and different whales are are similar in value?
The triples_loss() fn and batch_generator_RN() fn are correct, the problem is not related to the data generation.
However, I suspect that your learning rate is too high while you freeze a lot of layers, i.e. numerous trainable parameters are frozen, which may lead to your network being unable to converge.
My suggestion is to unfreeze all the layers and decrease the learning rate to 0.00001 and start training again, regardless of the architecture that you use (Xception/ResNet etc.)

Related

Output logits with softmax aren't extreme when using Tensorflow. No prediction is very confident

I trained a text classification model on data in TensorFlow and plotted the SoftMax confidence for the correct prediction as well as the SoftMax confidence for incorrect predictions. When I did this I noticed that there were no output predictions with a high logit/class probability. For example, predicting between 4 classes had these results:
(TensorFlow version)
input text: "Text that fits into class 0"
logits: [.3928, 0.2365, 0.1854, 0.1854]
label: class 0
I would hope that the logit output for class 0 would be higher than .3928! Looking at the graph you can see that none of the prediction logits output a number higher than (.5).
Next, I retrained the exact same model and dataset but in Pytorch. With Pytorch, I got the results I was looking for. Both models had the exact same validation accuracy after training. (90% val accuracy)
(Pytorch Version)
input text: "Text that fits into class 0"
logits: [0.8778, 0.0532, 0.0056, 0.0635]
label: class 0
Here is what I understand about the softmax function:
The softmax function transforms its inputs, which can be seen as logits, into a probability distribution over the classes. The maximum value of the softmax is 1, so if your largest logit is 0.5, then it means that the highest probability assigned by the softmax will be relatively low (less than 1/2).
To have more extreme outputs, you need to have larger logits. One way to do this is to train your model with more data and a more powerful architecture, so that it can learn more complex relationships between inputs and outputs. It may not be desirable for a model to have extremely high confidence in its predictions, as it could lead to overfitting to the training data. The appropriate level of confidence will depend on the specific use case and the desired trade-off between precision and recall.
My Tensorflow Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
encoder = TFAutoModel.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
# Define input layer with token and attention mask
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="attention_mask")
# Call the ALBERT model with the inputs
pooler_output = encoder(input_ids, attention_mask=attention_mask)[1] # 1 is pooler output
# Define a dense layer on top of the pooled output
x = tf.keras.layers.Dense(units=params['fc_layer_size'])(pooler_output)
x = tf.keras.layers.Dropout(params['dropout'])(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
# Define a model with the inputs and dense layer
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=outputs)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0008)
metrics = [tf.metrics.SparseCategoricalAccuracy()]
# Compile the model
model.compile(optimizer='sgd', loss=loss, metrics=metrics)
My Pytorch Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')
encoder = AutoModel.from_pretrained('prajjwal1/bert-tiny')
loss_fn = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0008)
class EscalationClassifier(nn.Module):
def __init__(self, encoder):
super(EscalationClassifier, self).__init__()
self.encoder = encoder
self.fc1 = nn.Linear(128, 312)
self.fc2 = nn.Linear(312, 4)
self.dropout = nn.Dropout(0.2)
def forward(self, input_ids, attention_mask):
pooled_output = self.encoder(input_ids, attention_mask=attention_mask)[1]# [0] is last hidden state, 1 for pooler output
# pdb.set_trace()
x = self.fc1(pooled_output)
x = self.dropout(x)
x = self.fc2(x)
return x
model = EscalationClassifier(encoder)
Can anyone help me explain why my Tensorflow logit outputs aren't more confident like the pytorch outputs? *The problem doesn't seem to be with the softmax itself.

Deep Learning model (LSTM) predicts same class label

I am trying to solve the Spoken Digit Recognition task using the LSTM model, where the audio files are converted into spectrograms and fed into an LSTM model after doing Global Average Pooling. Here is the architecture of it
tf.keras.backend.clear_session()
#input layer
input_= Input(shape = (64, 35))
lstm = LSTM(100, activation='tanh', return_sequences= True, kernel_regularizer = l2(0.000001),
recurrent_initializer = 'glorot_uniform')(input_)
lstm = GlobalAveragePooling1D(data_format='channels_first')(lstm)
dense = Dense(20, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer='glorot_uniform')(lstm)
drop = Dropout(0.8)(dense)
dense1 = Dense(25, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer= 'he_uniform')(drop)
drop = Dropout(0.95)(dense1)
output = Dense(10,activation = 'softmax', kernel_regularizer = l2(0.000001), kernel_initializer= 'glorot_uniform')(drop)
model_2 = Model(inputs = [input_], outputs = output)
model_2.summary()
Having summary as -
I need to calculate the F1 score to check the performance of the model, I have implemented a custom callback and used TensorFlow addons F1 score too. However, I won't get the correct result, for every epoch I get the constant F1 score value.
On further digging, I found out that my model predicts the same class label, for the entire epoch, whereas it is supposed to predict 10 classes in one epoch. as there are 10 class label values present.
Here is my model.compile and model.predict commands. I have used TensorFlow addon here -
from tensorflow import keras
opt = keras.optimizers.Adam(0.001, clipnorm=0.8)
model_2.compile(loss='categorical_crossentropy', optimizer=opt, metrics = metric)
hist = model_2.fit([X_train_spectrogram],
[y_train_converted],
validation_data= ([X_test_spectrogram], [y_test_converted]),
epochs = 10,
verbose =1,
callbacks=[tensorBoard_callbk2, ClearMemory()],
# steps_per_epoch = 3,
batch_size=32)
Here is what I mean by getting the same prediction, the entire array is filled with the same predicted values.
Why is the model predicting the same class label? or How to rectify it?
I have tried increasing the number of trainable parameters, increasing - decreasing batch size too, but it won't help me. If anyone knows can you please help me out?

How to improve similarity learning neural network with low precision but high recall?

I am currently training a similarity learning neural network using triplet loss.
I have employed semi-hard negative mining to train the network to learn the features.
The trained model has high accuracy(or recall= 88%) but it has low average precision.
Specifics:
Margin used in triplet loss =1.6.
embeddings are normalized using L2 norm
Image pairs with lowest square Distance are identified matches
scores are negated and sorted in decreasing order (lowest distances represents higher confidence)
PR(precision-recall) curve is plot for each prediction after scores are sorted in decreasing order of confidence( high confidence scores are plot first followed by lower confidence)
Confidence(or score) = - (square distance between two image-pair embedding)
Problem:
PR curve shows a dip up until recall=0.05 highest confidence scores are pretty bad
This improves later after recall 0.05
Question: How to investigate and try to improve the precision for the highest confidence scores.
Any thoughts , pointers?
What i have tried:
tested for bugs; code looks good and the accuracy(high recall) is accurate
tested for validation accuracy(pairs of triplets randomly visualized)
lowered down the ALPHA =0.2(default by triplet loss paper) but it yields low recall(Accuracy) and lower average precision
def triplet_loss(x, alpha=ALPHA):
# Triplet Loss function.
anchor, positive, negative = x
# distance between the anchor and the positive
pos_dist = K.sum(K.square(anchor - positive), axis=1)
# distance between the anchor and the negative
neg_dist = K.sum(K.square(anchor - negative), axis=1)
# compute loss
basic_loss = pos_dist - neg_dist + alpha
loss = K.maximum(basic_loss, 0.0)
return loss
def identity_loss(y_true, y_pred):
return K.mean(y_pred)
def my_norm(ip):
return K.l2_normalize(ip, axis=-1)
def embedding_model():
# used for the embedding model.
base_cnn = keras.applications.ResNet50(weights="imagenet", input_shape=IM_SIZE + (3,), include_top=False)
flatten = keras.layers.Flatten()(base_cnn.output)
drop1 = keras.layers.Dropout(rate=0.25)(flatten)
dense1 = keras.layers.Dense(256, activation="relu")(drop1)
dense1 = keras.layers.BatchNormalization()(dense1)
output = keras.layers.Dense(256)(dense1)
output = Lambda(my_norm)(output)
trainable = False
for layer in base_cnn.layers:
if layer.name == "conv5_block1_out":
trainable = True
layer.trainable = trainable
mdl = Model(inputs=base_cnn.input, outputs=output, name="Embedding")
return mdl
def complete_model(base_model, alpha=0.2):
# Create the complete model with three
# embedding models and minimize the loss
# between their output embeddings
input_1 = Input((imsize, imsize, 3))
input_2 = Input((imsize, imsize, 3))
input_3 = Input((imsize, imsize, 3))
A = base_model(input_1)
P = base_model(input_2)
N = base_model(input_3)
loss = Lambda(triplet_loss)([A, P, N])
model = Model(inputs=[input_1, input_2, input_3], outputs=loss)
model.compile(loss=identity_loss, optimizer=Adam(LR))
return model
def get_model_name():
return "resnet50Reg0.8"
def preprocess(x):
return keras.applications.resnet50.preprocess_input(x)

How to compute saliency map using keras backend

I am trying to construct a basic "vanilla gradient" saliency heatmap (gradient-based feature attribution) for MNIST using keras. I know there are libraries such as this one to compute saliency heatmaps, but I would like to construct this from scratch since the vanilla gradient approach seems conceptually straightforward to implement. I have trained the following digit classifier in Keras using functional model definition:
input = layers.Input(shape=(28,28,1), name='input')
conv2d_1 = layers.Conv2D(32, kernel_size=(3, 3), activation='relu')(input)
maxpooling2d_1 = layers.MaxPooling2D(pool_size=(2, 2), name='maxpooling2d_1')(conv2d_1)
conv2d_2 = layers.Conv2D(64, kernel_size=(3, 3), activation='relu')(maxpooling2d_1)
maxpooling2d_2 = layers.MaxPooling2D(pool_size=(2, 2))(conv2d_2)
flatten = layers.Flatten(name='flatten')(maxpooling2d_2)
dropout = layers.Dropout(0.5, name='dropout')(flatten)
dense = layers.Dense(num_classes, activation='softmax', name='dense')(dropout)
model = keras.models.Model(inputs=input, outputs=dense)
Now, I want to compute the saliency map for a single MNIST image. Since the final layer has a softmax activation and the denominator is a normalization term (so that the output nodes add up to 1), I believe that I need to either take the pre-softmax output or change the activation of the trained model linear for computing saliency maps. I will do the latter.
model.layers[-1].activation = tf.keras.activations.linear # swap activation to linear
input = loaded_model.layers[0].input
output = loaded_model.layers[-1].output
input_image = x_test[0] # shape is (28, 28, 1)
pred = np.argmax(loaded_model.predict(np.expand_dims(input_image, axis=0))) # predicted class
However, I am not sure what to do beyond this. I know I can use the following K.gradients(output, input) to compute gradients. That being said, I believe I should compute the gradient of the predicted class with respect to the input image, versus computing the gradient of the entire output. How would I do this? Also, I'm not sure how to evaluate the saliency heatmap for a specific image/prediction. I imagine I will have to use sess = tf.keras.backend.get_session() and sess.run(), but not sure exactly. I would greatly appreciate any help with completing the saliency heatmap code. Thanks!
If you add the activation as a single layer after the last dense layer with:
keras.layers.Activation('softmax')
you can do:
linear_model = keras.Model(input=model, output=model.layers[-2].output)
To then compute the gradients like:
def get_saliency_map(model, image, class_idx):
with tf.GradientTape() as tape:
tape.watch(image)
predictions = model(image)
loss = predictions[:, class_idx]
# Get the gradients of the loss w.r.t to the input image.
gradient = tape.gradient(loss, image)
# take maximum across channels
gradient = tf.reduce_max(gradient, axis=-1)
# convert to numpy
gradient = gradient.numpy()
# normaliz between 0 and 1
min_val, max_val = np.min(gradient), np.max(gradient)
smap = (gradient - min_val) / (max_val - min_val + keras.backend.epsilon())
return smap

how to define a TensorFlow graph with more than one input of different dim and combined multi different dim's layer to one layer?

After set each layer's name, My codes in below run well.
=================== old ===============
how to define a TensorFlow graph with more than one input of different dim?
for example, I have the Input (X1, X2, X3) with different dim(d1, d2, d3).
how to define a multi-input layer combined with different size's hidden-1 layer, and then combine the three hidden-1 layer to hidden-2 layer, then with a output layer ?
Thanks for all!
I tryed some code like this:
model_fn(features, labels, mode, params):
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense( inputs=hidden1_c, units=32, activation=tf.nn.selu, )
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
optimizer = tf.train.AdamOptimizer(learning_rate=1)
train_op = optimizer.minimize( loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec( mode=mode, loss=loss, train_op=train_op)
But it doesn't work. The accuracy is not changing at training time.
The tensorboard's model graph is (the dense_xx is the hidden1's tensors):
The biggest problem lies in these lines
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
First, since you have multiple classes, you should use softmax_cross_entropy, or better, sparse_softmax_cross_entropy to dispense with the one-hot encoding.
Second, the input to softmax_cross_entropy or sigmoid_cross_entropy should be unnormalized scores, so activation=tf.nn.softmax is wrong. All deep learning frameworks combine the softmax/sigmoid with cross entropy in one step because the combined operation has better performance and numeric stability, so you should not calculate the softmax yourself first.
Third, your learning rate is too high. Even 0.0025 is, under most circumstances, still too high. You should start with 0.001 and then tune it up and down from there.
Finally, I don't understand why you first dense then concat. Why not just concatenate all the features and then transform together?
for how to concat the layers, give my complete running code for examples:
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu, name="h_1_%s" % i,
kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(scale_l1=1e-3, scale_l2=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(H1_DIM[i]+FEA_DIM[i]))
) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense(inputs=hidden1_c, units=128, activation=tf.nn.selu, name="h_2",
kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(128+H1_DIM[i])))
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=None, kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=0.1), name="output")