How do I properly combine several loss functions that have a different shape in CNTK? - cntk

In CNTK, I'd like to combine several loss functions that have a different shape. The loss I’d like to use has four parts, each contributes a gradient for training the network:
loss = rpn_loss_cls + rpn_loss_bbox + loss_cls + loss_bbox
where the individual shapes are
rpn_loss_cls: (33489,1)
rpn_loss_bbox: (33489,1)
loss_cls: (100,1)
loss_cls: (100,1)
Obviously I can’t just add them up, do I have to stack them using ‘splice’ before passing the loss to the trainer? Do we have a Python example that does that already?

You could use reduce_sum or reduce_mean with all_axes() or all_static_axes() to make each loss a scalar, then combine them as you wish.

Related

tf.keras multiple output model combined loss

I have a three output network with three defined custom loss functions and during training, Keras returns three loss values as I would expect but also an additional value which I suspect is a combined loss. How is it defined or what does it represent? I didn't find anything in the documentation, clarification is appreciated.
Also if it really is combined loss, does it just serve as an indicator or does it affect gradients in any way?
implementation example
losses = [my_loss(config1), my_loss(config2), my_loss(config3)]
model.compile(optimizer=optimizer, loss=losses, run_eagerly=False)
model.fit(...) # training returns 4 loss values - 'loss', 'my_loss1', 'my_loss2' and 'my_loss3'
EDIT:
Example losses training curves. It's clear that sum of my losses is not the combined loss. And I do not use any weights in compile method.

Convolutional Neural Network Loss

While Calculating the Loss Function. Can i manually Calculate Loss like
Loss = tf.reduce_mean(tf.square(np.array(Prediction) - np.array(Y)))
and then Optimize this Loss using Adam Optimizer
No.
Tensorflow loss functions typically accept tensors as input and also outputs a tensor. So np.array() wouldn't work.
In case of CNNs, you'd generally come across loss functions like cross-entropy, softmax corss-entropy, sigmoid cross-entropy etc. These are already in-built in tf.losses module. So you can use them directly.
The loss function that you're trying to apply looks like a Mean-squared loss. This is built in tf.losses as well. tf.losses.mean_squared_error.
Having said that, I've also implemented a few loss functions like cross-entropy using hand-coded formula such as: -tf.reduce_mean(tf.reduce_sum(targets * logProb)). This works equally fine, as long as the inputs targets and logProb are computed as tensors and not as numpy arrays.
No, actually you need to use tensor Variable for Loss, not use numpy.array(np.array(Prediction)).
Since tensorflow will eval these tensors in tensorflow engine.

TensorFlow softmax_crossentropy_with logits: are "labels" also trained (if differentiable)?

The softmax cross-entropy with logits loss function is used to reduce the difference between the logits and labels provided to the function. Typically, the labels are fixed for supervised learning and the logits are adapted. But what happens when the labels come from a differentiable source, e.g., another network? Do both networks, i.e., the "logits network" and the "labels network" get trained by the subsequent optimizer, or does this loss function always treat the labels as fixed?
TLDR: Does tf.nn.softmax_cross_entropy_with_logits() also provide gradients for the labels (if they are differentiable), or are they always considered fixed?
Thanks!
You need to use tf.softmax_cross_entropy_with_logits_v2 to get gradients with respect to labels.
The gradient is calculated from loss provided to the optimizer, if the "labels" are coming from another trainable network, then yes, these will be modified, since they influence the loss. The correct way of using another networks outputs for your own is to define it as untrainable, or make a list of all variables you want to train and pass them to the optimizer explicitly.

what's the difference between softmax_cross_entropy_with_logits and losses.log_loss?

whats the primary difference between tf.nn.softmax_cross_entropy_with_logits and tf.losses.log_loss? both methods accept 1-hot labels and logits to calculate cross entropy loss for classification tasks.
Those methods are not so different in theory, however have number of differences in implementation:
1) tf.nn.softmax_cross_entropy_with_logitsis designed for single-class labels, while tf.losses.log_losscan be used for multi-class classification. tf.nn.softmax_cross_entropy_with_logits won't throw an error if you feed multi-class labels, however your gradients won't be calculated correctly and training most probably will fail.
From official documentation:
NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.
2) tf.nn.softmax_cross_entropy_with_logits calculates (as it's seen from the name) soft-max function on top of your predictions first, while log_loss doesn't do this.
3) tf.losses.log_loss has a little wider functionality in a sense that you can weight each element of the loss function or you can specify epsilon, which is used in calculations, to avoid log(0) value.
4) Finally, tf.nn.softmax_cross_entropy_with_logits returns loss for every entry in the batch, while tf.losses.log_loss returns reduced (sum over all samples by default) value which can be directly used in optimizer.
UPD: Another difference is the way the calculate the loss, Logarithmic loss takes into account negative classes (those where you have 0s in the vector). Shortly, cross-enthropy loss forces network to produce maximum input for the correct class and does not care about negative classes. Logarithmic loss does both at the same time, it forces correct classes to have larger values and negative lesser. In mathematic expression it looks as following:
Cross-enthropy loss:
Logarithmic Loss:
Where i is the corresponding class.
So for example, if you have labels=[1,0] and predictions_with_softmax = [0.7,0.3], then:
1) Cross-Enthropy Loss: -(1 * log(0.7) + 0 * log(0.3)) = 0.3567
2) Logarithmic Loss: - (1*log(0.7) + (1-1) * log(1 - 0.7) +0*log(0.3) + (1-0) log (1- 0.3)) = - (log(0.7) + log (0.7)) = 0.7133
And then if you use default value for tf.losses.log_loss you then need to divide the log_loss output by the number of non-zero elements (here it's 2). So finally: tf.nn.log_loss = 0.7133 / 2 = 0.3566
In this case we got equal outputs, however it is not always the case
There are basically two differences between,
1) Labels used in tf.nn.softmax_cross_entropy_with_logits are the one hot version of labels used in tf.losses.log_loss.
2) tf.nn.softmax_cross_entropy_with_logits calcultes the softmax of logits internally before the calculation of the cross-entrophy.
Notice that tf.losses.log_loss also accepts one-hot encoded labels. However, tf.nn.softmax_cross_entropy_with_logits only accepts the labels with one-hot encoding.
Hope this helps.

Do we need to add the regularization loss into the total loss in tensorflow models?

In the model definition, I used the kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=0.00001) into the tf.layers.conv2d() to regularize the convolution kernels in each convolution layer.
My question is: to compute the total loss of the whole network for some batch inputs, do we need to manually add the regularization loss as follows:
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_constant = 0.01 # Choose an appropriate one.
loss = my_normal_loss + reg_constant * sum(reg_losses)
And if yes, how to determine the reg_constant above? What's the relationship between the scale and the reg_constant? Thanks.
You are right.
You technically do not need reg_constant. You can control each layer regularization by the scale param, which can be the same for all layers. In this case you can just set reg_constant=1.
The only advantage of using reg_constant I see over scale, multiplying the regularization loss by reg_constant, is perhaps readability of your code.
If you're using a standard architecture I suggest to start with setting reg_constant=1 and set scale to some small scalar, say 0.001. If you have the resources, a better approach is to apply grid search to find the value that empirically minimizes your validation loss, i.e in [0.0001, 0.1].
If you suspect of a layer which should be specifically regularized, you can follow the first case only set that specific layer scale for a different value. Apply grid search like before only this time over the two different scale values.