Where exactly are the KL losses used after the forward pass? - tensorflow

I've noticed that the KL part of the loss is added to the list self._losses of the Layer class when self.add_loss is called from the call method of the DenseVariational (i.e. during the forward pass).
But how is this list self._losses (or the method losses of the same Layer class) treated during training? Where is it called from during training? For example, are they summed or average before adding them to the final loss? I would like to SEE the ACTUAL CODE.
I would like to know how exactly these losses are combined with the loss that you specify in the fit method. Can you provide me with the code that combines them? Note that I am interested in the Keras that is shipped with TensorFlow (because that's the one I am using).

Actually, the part where the total loss is computed is in compile method of Model class, specifically in this line:
# Compute total loss.
# Used to keep track of the total loss value (stateless).
# eg., total_loss = loss_weight_1 * output_1_loss_fn(...) +
# loss_weight_2 * output_2_loss_fn(...) +
# layer losses.
self.total_loss = self._prepare_total_loss(masks)
The _prepare_total_loss method adds the regularization and layer losses to the total loss (i.e. so all the losses are summed together) and then averages them over the batch axis in these lines:
# Add regularization penalties and other layer-specific losses.
for loss_tensor in self.losses:
total_loss += loss_tensor
return K.mean(total_loss)
Actually, self.losses is not the attribute of the Model class; rather, it's the attribute of the parent class, i.e. Network, which returns all the layer-specific losses as a list. Further, to resolve any confusion, total_loss at above code is a single tensor which is eqaul to the summation of all the losses in the model (i.e. loss function values, and layer-specific losses). Note that loss functions by definition must return a single loss value per each input sample (not the whole batch). Therefore, K.mean(total_loss) would average all these values over the batch axis to one final loss value which should be minimized by optimizer.
As for the tf.keras this is more or less the same as native keras; however, the structures and flow of things is a bit different which are explained below.
First, in compile method of Model class a loss container is created which holds and computes value of loss functions:
self.compiled_loss = compile_utils.LossesContainer(
loss, loss_weights, output_names=self.output_names)
Next, in train_step method of Model class this container is called to compute the loss value of a batch:
loss = self.compiled_loss(
y, y_pred, sample_weight, regularization_losses=self.losses)
As you can see above self.losses is passed to this container. The self.losses, as in native Keras implementation, contains all the layer-specific loss values with the only difference that in tf.keras it's implemented in Layer class (instead of Network class as in native Keras). Note that Model is a subclass of Network which itself is a subclass of Layer. Now, let's see how regularization_losses would be treated in the __call__ method of LossesContainer (these lines):
if (loss_obj.reduction == losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE or
loss_obj.reduction == losses_utils.ReductionV2.AUTO):
loss_value = losses_utils.scale_loss_for_distribution(loss_value)
loss_values.append(loss_value)
loss_metric_values.append(loss_metric_value)
if regularization_losses:
regularization_losses = losses_utils.cast_losses_to_common_dtype(
regularization_losses)
reg_loss = math_ops.add_n(regularization_losses)
loss_metric_values.append(reg_loss)
loss_values.append(losses_utils.scale_loss_for_distribution(reg_loss))
if loss_values:
loss_metric_values = losses_utils.cast_losses_to_common_dtype(
loss_metric_values)
total_loss_metric_value = math_ops.add_n(loss_metric_values)
self._loss_metric.update_state(
total_loss_metric_value, sample_weight=batch_dim)
loss_values = losses_utils.cast_losses_to_common_dtype(loss_values)
total_loss = math_ops.add_n(loss_values)
return total_loss
As you can see, regularization_losses will be added to the total_loss which would hold the summation of layer-specific losses and sum of average of all the loss functions over the batch axis (therefore, it would be a single value).

Related

Training RNN with error evaluation at every time step

I have a simpleRNN / LSTM that I'm trying to train on a sequential classification task using tensorflow. There is a sequence of data (300 time steps) that predicts a label at t=300. For my task I would like for the RNN to evaluate the error at every timestep (not just at the final time point) and propagate it backwards (as figure below).
After some responses below it seems I need to do a few things: use return_sequences flag; use the TimeDistributed layer to access the output from the LSTM/RNN; and also defined a custom loss function.
model = Sequential()
layer1 = LSTM(n_neurons, input_shape=(length, 1), return_sequences=True)
model.add(layer1)
layer2 = TimeDistributed(Dense(1))
model.add(layer2)
# Define custom loss
def custom_loss(layer1):
# Create a loss function
def loss(y_true,y_pred):
# access layer1 at every time point and compute mean error
# UNCLEAR HOW TO RUN AT EVERY TIME STEP
err = K.mean(layer1(X) - y_true, axis=-1)
return err
# Return a function
return loss
# Compile the model
model.compile(optimizer='adam', loss=custom_loss(layer), metrics=['accuracy'])
For now I'm a bit confused of the custom_loss function as it's not clear that how I can pass in layer1 and compute the error inside the inner most loss function.
Anyone has a suggestion or can point me to a more detailed answer?
The question is not easy to answer since it is not clear what you're trying to achieve (it shouldn't be the same using a FFNN or a RNN, and what works best depends definitely on the application).
Anyway, you might be confusing the training steps (say, the forward- and back- propagation over a minibatch of sequences) with the "internal" steps of the RNN. A single sequence (or a single minibatch) will always "unroll" entirely through time during the forward pass before any output is made available: only after (thus, at the end of the training step), you can use the predictions and compute the losses to backpropagate.
What you can do is return sequences of outputs (one y_predicted for every internal time step) including the argument return_sequences=True inside SimpleRNN(...). This will give you a sequence of 300 predictions, each of which depends only on the past inputs with respect to the considered internal time step. You can then use the outputs that you need to compute the loss, possibly in a custom loss function.
I hope I've been clear enough. Otherwise, let me know if I can help further.

How to make use of class_weights to calculated custom loss fuction while using custom training loop (i.e. not using .fit )

I have written my custom training loop using tf.GradientTape(). My data has 2 classes. The classes are not balanced; class1 data contributes almost 80% and class2 contributes remaining 20%. Therefore in order to remove this imbalance I was trying to write custom loss function which will take into account this imbalance and apply the corresponding class weights and calculate the loss. i.e. I want to use the class_weights = [0.2, 0.8]. I am not able to find similar examples.
However all the examples I am seeing are using model.fit approach where its easier to pass the class_weights. I am not able to find out the example which uses class_weights with custom training loop using tf.GradientTape.
I did go through the suggestions of using sample_weight, however I don't have the data where in I can specify the weights for samples, therefore my preference is to use class weight.
I am using BinaryCrossentropy loss as loss function but I want to change the loss based on the class_weights. That's where I am stuck, how to tell BinaryCrossentropy to consider the class_weights.
Is my approach of using custom loss function correct or there is better way to make use of class_weights while training with custom training loop (not using model.fit)?
you can write your own loss function. in that loss function call BinaryCrossentropy and then multiply the result in the weight you want and return that
Here's an implementation that should work for n classes instead of just 2.
For your example of 80:20 split, calculate weights as below (assuming 100 samples in total).
Weight calculation (ref: Handling Class Imbalance: TensorFlow):
weight_class_0 = (1/count_for_class_0) * (total_samples / num_classes) # (80%) 0.625
weight_class_1 = (1/count_for_class_1) * (total_samples / num_classes) # (20%) 2.5
class_wts = tf.constant([weight_class_0, weight_class_1])
Loss function: Requires labels to be sparse and logits unscaled (no activations applied).
# Example logits=[[-3.2, 2.0], [1.2, 0.5], ...], (sparse)labels=[0, 1, ...]
def weighted_sparse_categorical_crossentropy(labels, logits, weights):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, logits)
class_weights = tf.gather(weights, labels)
return tf.reduce_mean(class_weights * loss)
You can supply this loss function to custom training loops.

Regression custom loss return value in Keras with and without custom loop

When a custom loss is defined in a Keras model, online sources seem to indicate that the the loss should return an array of values (a loss for each sample in the batch). Something like this
def custom_loss_function(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=custom_loss_function)
In the example above, I have no idea when or if the model is taking the batch sum or mean with tf.reduce_sum() or tf.reduce_mean()
In another situation when we want to implement a custom training loop with a custom function, the template to follow according to Keras documentation is this
for epoch in range(epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
with tf.GradientTape() as tape:
y_batch_pred = model(x_batch_train, training=True)
loss_value = custom_loss_function(y_batch_train, y_batch_pred)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
So by the book, if I understand correctly, we are supposed to take the mean of the batch gradients. Therefore, the loss value above should be a single value per batch.
However, the example will work with both of the following variations:
tf.reduce_mean(squared_difference, axis=-1) # array of loss for each sample
tf.reduce_mean(squared_difference) # mean loss for batch
So, why does the first option (array loss) above still work? Is apply_gradients applying small changes for each value sequentially? Is this wrong although it works?
What is the correct way without a custom loop, and with a custom loop?
Good question. In my opinion, this is not well documented in the TensorFlow/Keras API. By default, if you do not provide a scalar loss_value, TensorFlow will add them up (and the updates are not sequential). Essentially, this is equivalent to summing the losses along the batch axis.
Currently, the losses in the TensorFlow API include a reduction argument (for example, tf.losses.MeanSquaredError) that allows specifying how to aggregate the loss along the batch axis.

Sparse annotation in U-Net

I am training a U-Net image segmentation on whole slide pathology images. I was wondering how can I handle un-annotated areas? I am working with huge tissues and it’s impossible to annotate all or the vast majority of the tissue, so I have annotations from a pathologist who has annotated selected tissue structures of interest to us. That means that in many tiles I’m generating, there is a segment that’s not annotated.
Would it affect the U-Net negatively by indirectly indicating that the un-annotated area is negative to one category or another, although it’s not negative? How do I handle this important case? Does it make sense to mask the image to only the annotated parts, such that un-annotated regions are black?
Thanks
One way to deal with this is to use a weighted loss function where you simply assign a weight of zero to the class that you don't want to include. Essentially, you're treating the unannotated area as an additional class that doesn't contribute to the loss function. You can find the GitHub repo to a fully functional Keras implementation here.
Specifically, I would use a weighted categorical cross-entropy loss function. You can find an implementation for Keras here:
from keras import backend as K
def weighted_categorical_crossentropy(weights):
"""
A weighted version of keras.objectives.categorical_crossentropy
Variables:
weights: numpy array of shape (C,) where C is the number of classes
Usage:
weights = np.array([0.5,2,10]) # Class one at 0.5, class 2 twice the normal weights, class 3 10x.
loss = weighted_categorical_crossentropy(weights)
model.compile(loss=loss,optimizer='adam')
"""
weights = K.variable(weights)
def loss(y_true, y_pred):
# scale predictions so that the class probas of each sample sum to 1
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
# clip to prevent NaN's and Inf's
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
# calc
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
return loss
And you can then compile your model for training like this:
model.compile(optimizer='adam', loss=weighted_categorical_crossentropy(np.array([background_weight, foreground_weight, 0])), metrics='accuracy')

Tensorflow CIFAR10 Multi GPU - Why Combined Loss?

In the TensorFlow CIFAR10 example, trained over multiple GPUs, the loss seems to be combined for each "tower", and the gradient is calculated from this combined loss.
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.contrib.deprecated.scalar_summary(loss_name, l)
return total_loss
I'm new to TensorFlow, but from my understanding, every time cifar10.loss is called, tf.add_to_collection('losses', cross_entropy_mean) is run and the loss from the current batch is being stored in the collection.
Then losses = tf.get_collection('losses', scope) is called, and all the losses are being retrieved from the collection. Then tf.add_n op is adding all the retrieved loss tensors from this "tower" together.
I expected the loss to be just from the current training step/batch, not all batches.
Am I misunderstanding something? Or is there a reason for combining the losses together?
If weight decay is enabled, it will also add it to the losses collection.
Therefore, for each tower(scope), it will add_n all the losses: cross_entropy_mean and weight_decay.
Then Gradients are calculated for each tower(scope). At the end all the gradients for different towers (scopes) will get averaged in the average_gradients.
Why combined loss
The example you are referring is a example of data parallelism over multiple gpus. Data parallelism helps towards training deeper model with bigger batch_size. In this setting you need to combine loss from the gpus as each of the gpus is holding one part of the input batch (loss and gradients corresponding to that input part). One illustration is provided in the following example from tensorflow data parallism example.
Note: In case of model parallelism different subgraph of the model run on separate gpus and intermediate outputs are collected by the master.
example
if you want to train the model using a batch size of 256, for a deeper model (for example, resnet/inception)that mayn't fit into one single gpu (for example a 8 GB memory), so you can split the batch into two batches of size 128 and do forward pass of the model using the two batches on separate gpus and compute loss and gradients. The computed (loss. gradients) from each of the gpus are collected and averaged over. the averaged gradient is used to update the model parameters.