TensorFlow softmax_crossentropy_with logits: are "labels" also trained (if differentiable)? - tensorflow

The softmax cross-entropy with logits loss function is used to reduce the difference between the logits and labels provided to the function. Typically, the labels are fixed for supervised learning and the logits are adapted. But what happens when the labels come from a differentiable source, e.g., another network? Do both networks, i.e., the "logits network" and the "labels network" get trained by the subsequent optimizer, or does this loss function always treat the labels as fixed?
TLDR: Does tf.nn.softmax_cross_entropy_with_logits() also provide gradients for the labels (if they are differentiable), or are they always considered fixed?
Thanks!

You need to use tf.softmax_cross_entropy_with_logits_v2 to get gradients with respect to labels.

The gradient is calculated from loss provided to the optimizer, if the "labels" are coming from another trainable network, then yes, these will be modified, since they influence the loss. The correct way of using another networks outputs for your own is to define it as untrainable, or make a list of all variables you want to train and pass them to the optimizer explicitly.

Related

Intersection over Union used as Metric or Loss

Im currently struggling to understand the use of the IoU. Is the IoU just a Metric to monitor the quality of a network, or is used as a loss function where the value has some impact on the backprop?
For a measure to be used as a loss function, it must be differentiable, with non-trivial gradients.
For instance, in image classification, accuracy is the most common measure of success. However, if you try to differentiate accuracy, you'll see that the gradients are zero almost everywhere and therefore one cannot train a model with accuracy as a loss function.
Similarly, IoU, in its native form, also has meaningless gradients and cannot be used as a loss function. However, extensions to IoU that preserve gradients exist and can be effectively used as a loss function for training.

where does class_weights or weighted loss penalize the network?

I am working on a Semantic segmentation project where I have to work on multiclass data which is highly imbalanced. I searched for optimizing it during training using the model.fit parameter and in that to use class_weights or sample_weights.
I can implement a following using a class_weight dictionary as
{ 0:1, 1:10,2:15 }
I also saw a method of updating weights in loss function
But at what point do these weights get updated?
If class_weights are used where will it get penalized? I already have a kernel_regularizer for each layer so if my classes have to be penalized based on my class weights then will it penalize the output of each layer y=Wx+b or only at the final layer?
Same if I use a weighted loss function will it get penalized only on the final layer before loss calculation or on each layer and then the final loss is calculated?
Any explanation on this would be very useful.
The class_weights you mentioned in your dictionary are there to account for your imbalanced data. They will never change, they are only there to increase the penalty for misclassified instances of minority classes (that way your network pays more attention to them and the gradients returned treat one 'Class2' instance as if it was 15 times more important than one 'Class0' instance).
The kernel_regularizer you mention resides at your loss function and penalizes large weight norms for weight matrices throughout the network (if you use kernel_regularizer = tf.keras.regularizers.l1(0.01) in a Dense layer, it only affects that layer). So that is a different weight that has nothing to do with classes, only with weights inside your network. Your eventual loss will be something like loss = Cross_entropy + a * norm(Weight_matrix) and that way the network will have as an additional task assigned to it to minimize the classification loss (cross entropy) while the weight norms remain low.

What does `training=True` mean when calling a TensorFlow Keras model?

In TensorFlow's offcial documentations, they always pass training=True when calling a Keras model in a training loop, for example, logits = mnist_model(images, training=True).
I tried help(tf.keras.Model.call) and it shows that
Help on function call in module tensorflow.python.keras.engine.network:
call(self, inputs, training=None, mask=None)
Calls the model on new inputs.
In this case `call` just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Arguments:
inputs: A tensor or list of tensors.
training: Boolean or boolean scalar tensor, indicating whether to run
the `Network` in training mode or inference mode.
mask: A mask or list of masks. A mask can be
either a tensor or None (no mask).
Returns:
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.
It says that training is a Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode. But I didn't find any information about this two modes.
In a nutshell, I don't know what is the influence of this argument. And what if I missed this argument when training?
Some neural network layers behave differently during training and inference, for example Dropout and BatchNormalization layers. For example
During training, dropout will randomly drop out units and correspondingly scale up activations of the remaining units.
During inference, it does nothing (since you usually don't want the randomness of dropping out units here).
The training argument lets the layer know which of the two "paths" it should take. If you set this incorrectly, your network might not behave as expected.
Training indicating whether the layer should behave in training mode or in inference mode.
training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Usually in inference mode training=False, but in some networks such as pix2pix_cGAN‍‍‍‍‍‍ At both times of inference and training, training=True.

Convolutional Neural Network Loss

While Calculating the Loss Function. Can i manually Calculate Loss like
Loss = tf.reduce_mean(tf.square(np.array(Prediction) - np.array(Y)))
and then Optimize this Loss using Adam Optimizer
No.
Tensorflow loss functions typically accept tensors as input and also outputs a tensor. So np.array() wouldn't work.
In case of CNNs, you'd generally come across loss functions like cross-entropy, softmax corss-entropy, sigmoid cross-entropy etc. These are already in-built in tf.losses module. So you can use them directly.
The loss function that you're trying to apply looks like a Mean-squared loss. This is built in tf.losses as well. tf.losses.mean_squared_error.
Having said that, I've also implemented a few loss functions like cross-entropy using hand-coded formula such as: -tf.reduce_mean(tf.reduce_sum(targets * logProb)). This works equally fine, as long as the inputs targets and logProb are computed as tensors and not as numpy arrays.
No, actually you need to use tensor Variable for Loss, not use numpy.array(np.array(Prediction)).
Since tensorflow will eval these tensors in tensorflow engine.

How to cancel BP in some layers in tensorflow?

when I try to fine-tune a VGG network, I only want to update the weights after 5th convolution layers ,in caffe , we can cancel BP in configure file. What should I do in tensorflow ? thanks !
Just use tf.stop_gradient() on the input of your 5th layer. Tensorflow will not backpropagate the error below. tf.stop_gradient() is an operation that acts as the identity function in the forward direction, but stops the gradient in the backward direction.
From documentation:
tf.stop_gradient
Stops gradient computation.
When executed in a graph, this op outputs its input tensor as-is.
When building ops to compute gradients, this op prevents the
contribution of its inputs to be taken into account. Normally, the
gradient generator adds ops to a graph to compute the derivatives of a
specified 'loss' by recursively finding out inputs that contributed to
its computation. If you insert this op in the graph it inputs are
masked from the gradient generator. They are not taken into account
for computing gradients.
Otherwise you can use optimizer.minimize(loss, variables_of_fifth_layer). Here you are running backpropagation and updating only on the variables of your 5th layer.
For a fast selection of the variables of interest you could:
Define as trainable=False all the variables that you don't want to update, and use variables_of_fifth_layer=tf.trainable_variables().
Divide layers by defining specific scopes and then variables_of_fifth_layer = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"scope/of/fifth/layer")