Im currently struggling to understand the use of the IoU. Is the IoU just a Metric to monitor the quality of a network, or is used as a loss function where the value has some impact on the backprop?
For a measure to be used as a loss function, it must be differentiable, with non-trivial gradients.
For instance, in image classification, accuracy is the most common measure of success. However, if you try to differentiate accuracy, you'll see that the gradients are zero almost everywhere and therefore one cannot train a model with accuracy as a loss function.
Similarly, IoU, in its native form, also has meaningless gradients and cannot be used as a loss function. However, extensions to IoU that preserve gradients exist and can be effectively used as a loss function for training.
Related
I'm new to Deep Learning and I saw this for the first time. Having MAE as loss function and MSE to metric. What is the purpose of this and what is gained?
(loss=tf.metrics.MeanAbsoluteError(), metrics=[tf.losses.MeanSquaredError()])
In some cases it is useful to have a loss function different from the metric you are going to evaluate.
Consider the case in which you want to denoise an image, that is you design a network that takes as input a noise image and outputs its clean version. Here, your metric might be the Peak-Signal-to-Noise Ratio (PSNR) or some sort of structural similarity (SSIM) between your output and the ground truth clean image. However, during training, you might consider different loss function, such as L1 (MAE), L2 (MSE) or even a Perceptual Loss, such as the VGG loss, because these have been proved to lead to better results than directly optimizing for PSNR or SSIM.
Under the hood, is a single gradient computed with respect to the whole batch, or is it the mean of gradients for each training pair? I'm writing a custom loss function and would like to include a loss component that is a function of the aggregate statistics over the batch. I'm wondering if this is consistent with the framework. My actual use case is complicated, but as an example, consider that I want my loss function to be whether the categories are correct (dog or cat) plus a term pushing for a 50/50 split between dog and cats in the batch. It's easy enough to program this into the loss function, but will the gradients do the right thing?
I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.
The softmax cross-entropy with logits loss function is used to reduce the difference between the logits and labels provided to the function. Typically, the labels are fixed for supervised learning and the logits are adapted. But what happens when the labels come from a differentiable source, e.g., another network? Do both networks, i.e., the "logits network" and the "labels network" get trained by the subsequent optimizer, or does this loss function always treat the labels as fixed?
TLDR: Does tf.nn.softmax_cross_entropy_with_logits() also provide gradients for the labels (if they are differentiable), or are they always considered fixed?
Thanks!
You need to use tf.softmax_cross_entropy_with_logits_v2 to get gradients with respect to labels.
The gradient is calculated from loss provided to the optimizer, if the "labels" are coming from another trainable network, then yes, these will be modified, since they influence the loss. The correct way of using another networks outputs for your own is to define it as untrainable, or make a list of all variables you want to train and pass them to the optimizer explicitly.
cifar10_multi_gpu_train.py
At this line, every loss for each tower in the multi GPU is calculated
However, these losses are not averaged, and it seems like the loss from the last GPU is used to return loss.
Is this on purpose (if yes, why?) or is it a bug in the code?
At this line, note that loss is in different name scopes (tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i))); so if I understand correctly, it is not that only the loss for the last GPU is used; instead, all losses under a corresponding naming scope for each GPU are used.
Each tower (corresponding to each GPU) will have a loss, which is used to calculate the gradient. Losses are not averaged; instead, all gradients for all towers are averaged at line 196.
Note that in this figure from the tutorial, there is no aggregation for all individual loss, it is the gradients that are averaged.