How to make use of class_weights to calculated custom loss fuction while using custom training loop (i.e. not using .fit ) - tensorflow

I have written my custom training loop using tf.GradientTape(). My data has 2 classes. The classes are not balanced; class1 data contributes almost 80% and class2 contributes remaining 20%. Therefore in order to remove this imbalance I was trying to write custom loss function which will take into account this imbalance and apply the corresponding class weights and calculate the loss. i.e. I want to use the class_weights = [0.2, 0.8]. I am not able to find similar examples.
However all the examples I am seeing are using model.fit approach where its easier to pass the class_weights. I am not able to find out the example which uses class_weights with custom training loop using tf.GradientTape.
I did go through the suggestions of using sample_weight, however I don't have the data where in I can specify the weights for samples, therefore my preference is to use class weight.
I am using BinaryCrossentropy loss as loss function but I want to change the loss based on the class_weights. That's where I am stuck, how to tell BinaryCrossentropy to consider the class_weights.
Is my approach of using custom loss function correct or there is better way to make use of class_weights while training with custom training loop (not using model.fit)?

you can write your own loss function. in that loss function call BinaryCrossentropy and then multiply the result in the weight you want and return that

Here's an implementation that should work for n classes instead of just 2.
For your example of 80:20 split, calculate weights as below (assuming 100 samples in total).
Weight calculation (ref: Handling Class Imbalance: TensorFlow):
weight_class_0 = (1/count_for_class_0) * (total_samples / num_classes) # (80%) 0.625
weight_class_1 = (1/count_for_class_1) * (total_samples / num_classes) # (20%) 2.5
class_wts = tf.constant([weight_class_0, weight_class_1])
Loss function: Requires labels to be sparse and logits unscaled (no activations applied).
# Example logits=[[-3.2, 2.0], [1.2, 0.5], ...], (sparse)labels=[0, 1, ...]
def weighted_sparse_categorical_crossentropy(labels, logits, weights):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, logits)
class_weights = tf.gather(weights, labels)
return tf.reduce_mean(class_weights * loss)
You can supply this loss function to custom training loops.

Related

Regression custom loss return value in Keras with and without custom loop

When a custom loss is defined in a Keras model, online sources seem to indicate that the the loss should return an array of values (a loss for each sample in the batch). Something like this
def custom_loss_function(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=custom_loss_function)
In the example above, I have no idea when or if the model is taking the batch sum or mean with tf.reduce_sum() or tf.reduce_mean()
In another situation when we want to implement a custom training loop with a custom function, the template to follow according to Keras documentation is this
for epoch in range(epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
with tf.GradientTape() as tape:
y_batch_pred = model(x_batch_train, training=True)
loss_value = custom_loss_function(y_batch_train, y_batch_pred)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
So by the book, if I understand correctly, we are supposed to take the mean of the batch gradients. Therefore, the loss value above should be a single value per batch.
However, the example will work with both of the following variations:
tf.reduce_mean(squared_difference, axis=-1) # array of loss for each sample
tf.reduce_mean(squared_difference) # mean loss for batch
So, why does the first option (array loss) above still work? Is apply_gradients applying small changes for each value sequentially? Is this wrong although it works?
What is the correct way without a custom loop, and with a custom loop?
Good question. In my opinion, this is not well documented in the TensorFlow/Keras API. By default, if you do not provide a scalar loss_value, TensorFlow will add them up (and the updates are not sequential). Essentially, this is equivalent to summing the losses along the batch axis.
Currently, the losses in the TensorFlow API include a reduction argument (for example, tf.losses.MeanSquaredError) that allows specifying how to aggregate the loss along the batch axis.

Custom loss function in Keras that penalizes output from intermediate layer

Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!
Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.
Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2
Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs

Sparse annotation in U-Net

I am training a U-Net image segmentation on whole slide pathology images. I was wondering how can I handle un-annotated areas? I am working with huge tissues and it’s impossible to annotate all or the vast majority of the tissue, so I have annotations from a pathologist who has annotated selected tissue structures of interest to us. That means that in many tiles I’m generating, there is a segment that’s not annotated.
Would it affect the U-Net negatively by indirectly indicating that the un-annotated area is negative to one category or another, although it’s not negative? How do I handle this important case? Does it make sense to mask the image to only the annotated parts, such that un-annotated regions are black?
Thanks
One way to deal with this is to use a weighted loss function where you simply assign a weight of zero to the class that you don't want to include. Essentially, you're treating the unannotated area as an additional class that doesn't contribute to the loss function. You can find the GitHub repo to a fully functional Keras implementation here.
Specifically, I would use a weighted categorical cross-entropy loss function. You can find an implementation for Keras here:
from keras import backend as K
def weighted_categorical_crossentropy(weights):
"""
A weighted version of keras.objectives.categorical_crossentropy
Variables:
weights: numpy array of shape (C,) where C is the number of classes
Usage:
weights = np.array([0.5,2,10]) # Class one at 0.5, class 2 twice the normal weights, class 3 10x.
loss = weighted_categorical_crossentropy(weights)
model.compile(loss=loss,optimizer='adam')
"""
weights = K.variable(weights)
def loss(y_true, y_pred):
# scale predictions so that the class probas of each sample sum to 1
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
# clip to prevent NaN's and Inf's
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
# calc
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
return loss
And you can then compile your model for training like this:
model.compile(optimizer='adam', loss=weighted_categorical_crossentropy(np.array([background_weight, foreground_weight, 0])), metrics='accuracy')

Where exactly are the KL losses used after the forward pass?

I've noticed that the KL part of the loss is added to the list self._losses of the Layer class when self.add_loss is called from the call method of the DenseVariational (i.e. during the forward pass).
But how is this list self._losses (or the method losses of the same Layer class) treated during training? Where is it called from during training? For example, are they summed or average before adding them to the final loss? I would like to SEE the ACTUAL CODE.
I would like to know how exactly these losses are combined with the loss that you specify in the fit method. Can you provide me with the code that combines them? Note that I am interested in the Keras that is shipped with TensorFlow (because that's the one I am using).
Actually, the part where the total loss is computed is in compile method of Model class, specifically in this line:
# Compute total loss.
# Used to keep track of the total loss value (stateless).
# eg., total_loss = loss_weight_1 * output_1_loss_fn(...) +
# loss_weight_2 * output_2_loss_fn(...) +
# layer losses.
self.total_loss = self._prepare_total_loss(masks)
The _prepare_total_loss method adds the regularization and layer losses to the total loss (i.e. so all the losses are summed together) and then averages them over the batch axis in these lines:
# Add regularization penalties and other layer-specific losses.
for loss_tensor in self.losses:
total_loss += loss_tensor
return K.mean(total_loss)
Actually, self.losses is not the attribute of the Model class; rather, it's the attribute of the parent class, i.e. Network, which returns all the layer-specific losses as a list. Further, to resolve any confusion, total_loss at above code is a single tensor which is eqaul to the summation of all the losses in the model (i.e. loss function values, and layer-specific losses). Note that loss functions by definition must return a single loss value per each input sample (not the whole batch). Therefore, K.mean(total_loss) would average all these values over the batch axis to one final loss value which should be minimized by optimizer.
As for the tf.keras this is more or less the same as native keras; however, the structures and flow of things is a bit different which are explained below.
First, in compile method of Model class a loss container is created which holds and computes value of loss functions:
self.compiled_loss = compile_utils.LossesContainer(
loss, loss_weights, output_names=self.output_names)
Next, in train_step method of Model class this container is called to compute the loss value of a batch:
loss = self.compiled_loss(
y, y_pred, sample_weight, regularization_losses=self.losses)
As you can see above self.losses is passed to this container. The self.losses, as in native Keras implementation, contains all the layer-specific loss values with the only difference that in tf.keras it's implemented in Layer class (instead of Network class as in native Keras). Note that Model is a subclass of Network which itself is a subclass of Layer. Now, let's see how regularization_losses would be treated in the __call__ method of LossesContainer (these lines):
if (loss_obj.reduction == losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE or
loss_obj.reduction == losses_utils.ReductionV2.AUTO):
loss_value = losses_utils.scale_loss_for_distribution(loss_value)
loss_values.append(loss_value)
loss_metric_values.append(loss_metric_value)
if regularization_losses:
regularization_losses = losses_utils.cast_losses_to_common_dtype(
regularization_losses)
reg_loss = math_ops.add_n(regularization_losses)
loss_metric_values.append(reg_loss)
loss_values.append(losses_utils.scale_loss_for_distribution(reg_loss))
if loss_values:
loss_metric_values = losses_utils.cast_losses_to_common_dtype(
loss_metric_values)
total_loss_metric_value = math_ops.add_n(loss_metric_values)
self._loss_metric.update_state(
total_loss_metric_value, sample_weight=batch_dim)
loss_values = losses_utils.cast_losses_to_common_dtype(loss_values)
total_loss = math_ops.add_n(loss_values)
return total_loss
As you can see, regularization_losses will be added to the total_loss which would hold the summation of layer-specific losses and sum of average of all the loss functions over the batch axis (therefore, it would be a single value).

How to define weight decay for individual layers in TensorFlow?

In CUDA ConvNet, we can write something like this (source) for each layer:
[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0
where wc=0 refers to the L2 weight decay.
How can the same be achieved in TensorFlow?
You can add all the variables you want to add weight decay to, to a collection name 'variables' and then you calculate the L2 norm weight decay for the whole collection.
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None)
This is the usage of tensorflow function get_variable. You can easily specify the regularizer to do weight decay.
Following is an example:
weight_decay = tf.constant(0.0005, dtype=tf.float32) # your weight decay rate, must be a scalar tensor.
W = tf.get_variable(name='weight', shape=[4, 4, 256, 512], regularizer=tf.contrib.layers.l2_regularizer(weight_decay))
Both current answers are wrong in that they do not give you "weight decay as in cuda-convnet" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.