Why would I choose a loss-function differing from my metrics? - tensorflow

When I look through tutorials in the internet or at models posted here at SO, I often see that the loss function differs from the metrics used to evaluate the model. This might look like:
model.compile(loss='mse', optimizer='adadelta', metrics=['mae', 'mape'])
Anyhow, following this example, why wouldn't I optimize 'mae' or 'mape' as loss instead of 'mse' when I don't even care about 'mse' in my metrics (hypothetically speaking when this would be my model)?

In many cases the metric you are interested might not be differentiable, so you cannot use it as a loss, this is the case for accuracy for example, where the cross entropy loss is used instead as it is differentiable.
For metrics that are already differentiable, you just want to get additional information from the learning process, as each metrics measures something different. For example the MSE has a scale that is squared from the scale of the data/predictions, so to get the same scale you have to use RMSE or the MAE. The MAPE gives you relative (not absolute) error, so all of these metrics measure something different that might be of interest.
In the case of accuracy, this metric is used because it is easily interpretable by a human, while cross entropy loss are less intuitive to interpret.

That is a very good question.
Knowing your modeling, you should use a convenience loss function to minimize to achieve your goals.
But to evaluate your model, you will use metrics to report the quality of your generalization using some metrics.
For many reasons, the evaluation part might differ from the optimization criteria.
Giving you an example, in Generative Adversarial Networks, many papers suggest that a mse loss minimization leads to more fuzzy images although mae helps to get a more clear output. You might want to trace both of them in your evaluation to see how it really changes the things.
Another possible case is when you have a customized loss, but you still want to report the evaluation based on accuracy.
I can think of possible cases where you set the loss function in a way to converge faster, better and etc, but you might measure the quality of the model with some other metrics as well.
Hope this can help.

I just asked myself that question when I came across a GAN Implementation that uses mae as loss. I already knew that some metrics are not differentiable and thought that mae is an ecample, albeit only at x=0. So is there simply an exception like just assume a slope of 0? That would make sense to me.
I also wanted to add that I learned to use mae instead of mae because a small error stays smaller when squared while bigger errors increase on relative magnitude. So bigger are being penalized more with mse.

Related

GAN - loss and evaluation of model

I'm struggling with understanding how to "objectively" evaluate a GAN (that is, not simply look at what it generates saying "this looks good/bad").
My understanding is that the discriminator should get a head start and, in theory, discriminator loss and generator loss both ought to converge to 0.5 - at which point both are equally "good".
I'm currently training a model, and I get discriminator loss beginning at 0.7 but quickly converging toward 0.25, and generator loss beginning at 50 and converging toward 0.35 (possibly less with further training).
This doesn't entirely make sense. How can both be better than 0.5?
Are my loss functions incorrect, or what else am I missing? How should performance be measured?
In a GAN setting, it is normal for you to have the losses be better because you are training only one of the networks at a time (thus beating the other network).
You can evaluate the generated output with some of the metrics PSNR, SSIM, FID, L2, Lpips, VGG, or something similar (depending on your particular task). This is still an ongoing area of research on how to objectively evaluate an image, and they are generally used as loss objectives in certain tasks.
I recommend looking at something like Analysis and Evaluation of Image Quality Metrics
I would recommend you look at the generator metrics over time to see if its improving, and obviously confirm that visually as well. You can use logging to see the metric changes or some visualization tools, tensorboard, or wandb for this.

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Unusual behavior of ADAM optimizer with AMSGrad

I am trying some 1, 2, and 3 layer LSTM networks to classify land cover of some selected pixels from a Landsat time-series spectral data. I tried different optimizers (as implemented in Keras) to see which of them is better, and generally found AMSGrad variant of ADAM doing a relatively better job in my case. However, one strange thing to me is that for the AMSGrad variant, the training and test accuracies start at a relatively high value from the first epoch (instead of increasing gradually) and it changes only slightly after that, as you see in the below graph.
Performance of ADAM optimizer with AMSGrad on
Performance of ADAM optimizer with AMSGrad off
I have not seen this behavior in any other optimizer. Does it show a problem in my experiment? What can be the explanation for this phenomenon?
Pay attention to the number of LSTM layers. They are notorious for easily overfitting the data. Try a smaller model initially(less number of layers), and gradually increase the number of units in a layer. If you notice poor results, then try adding another LSTM layer, but only after the previous step has been done.
As for the optimizers, I have to admit I have never used AMSGrad. However, the plot with regard to the accuracy does seem to be much better in case of the AMSGrad off. You can see that when you use AMSGrad the accuracy on the training set is much better than that on the test set, which a strong sign of overfitting.
Remember to keep things simple, experiment with simple models and generic optimizers.

Neural Network High Confidence Inaccurate Predictions

I have a trained a neural network on a classification task, and it is learning, although it's accuracy is not high. I am trying to figure out which test examples it is not confident about, so that I can gain some more insight into what is happening.
In order to do this, I decided to use the standard softmax probabilities in Tensorflow. To do this, I called tf.nn.softmax(logits), and used the probabilities provided here. I noticed that many times the probabilities were 99%, but the prediction was still wrong. As such, even when I only consider examples who have prediction probabilities higher than 99%, I get a poor accuracy, only 2-3 percent higher than my original accuracy.
Does anyone have any ideas as to why the network is so confident about wrong predictions? I am still new to deep learning, so am looking for some ideas to help me out.
Also, is using the softmax probabilities the right way to do determine confidence of predictions from a neural network? If not, is there a better way?
Thanks!
Edit: From the answer below, it seems like my network is just performing poorly. Is there another way to identify which predictions the network makes are likely to be wrong besides looking at the confidence (since the confidence doesn't seem to work well)?
Imagine your samples are split by a vertical line but you NN classifier learnt a horizontal line, in this case any prediction given by your classifier can only obtain 50% accuracy always. However NN will assign higher confidence to the samples which are further away from the horizontal line.
In short, when your model is doing poor classification higher confidence has little to none contribution to accuracy.
Suggestion: Check if the information you needed to do the correct classification are in the data then improve the overall accuracy first.

Neural network weights explode in linear unit

I am currently implementing a simple neural network and the backprop algorithm in Python with numpy. I have already tested my backprop method using central differences and the resulting gradient is equal.
However, the network fails to approximate a simple sine curve. The network hast one hidden layer (100 neurons) with tanh activation functions and a output layer with a linear activation function. Each unit hast also a bias input. The training is done by simple gradient descent with a learning rate of 0.2.
The problem arises from the gradient, which gets with every epoch larger, but I don't know why? Further, the problem is unchanged, if I decrease the learning rate.
EDIT: I have uploaded the code to pastebin: http://pastebin.com/R7tviZUJ
There are two things you can try, maybe in combination:
Use a smaller learning rate. If it is too high, you may be overshooting the minimum in the current direction by a lot, and so your weights will keep getting larger.
Use smaller initial weights. This is related to the first item. A smaller learning rate would fix this as well.
I had a similar problem (with a different library, DL4J), even in the case of extremely simple target functions. In my case, the issue turned out to be the cost function. When I changed from negative log likelihood to Poisson or L2, I started to get decent results. (And my results got MUCH better once I added exponential learning rate decay.)
Looks like you dont use regularization. If you train your network long enough it will start to learn the excact data rather than abstract pattern.
There are a couple of method to regularize your network like: stopped training, put a high cost to large gradients or more complex like e.g.g drop out. If you search web/books you probably will find many options for this.
A too big learning rate can fail to converge, and even DIVERGE, that is the point.
The gradient could diverge for this reason: when exceeding the position of the minima, the resulting point could not only be a bit further, but could even be at a greater distance than initially, but the other side. Repeat the process, and it will continue to diverge. in other words, the variation rate around the optimal position could be just to big compared to the learning rate.
Source: my understanding of the following video (watch near 7:30).
https://www.youtube.com/watch?v=Fn8qXpIcdnI&list=PLLH73N9cB21V_O2JqILVX557BST2cqJw4&index=10