Neural Network High Confidence Inaccurate Predictions - tensorflow

I have a trained a neural network on a classification task, and it is learning, although it's accuracy is not high. I am trying to figure out which test examples it is not confident about, so that I can gain some more insight into what is happening.
In order to do this, I decided to use the standard softmax probabilities in Tensorflow. To do this, I called tf.nn.softmax(logits), and used the probabilities provided here. I noticed that many times the probabilities were 99%, but the prediction was still wrong. As such, even when I only consider examples who have prediction probabilities higher than 99%, I get a poor accuracy, only 2-3 percent higher than my original accuracy.
Does anyone have any ideas as to why the network is so confident about wrong predictions? I am still new to deep learning, so am looking for some ideas to help me out.
Also, is using the softmax probabilities the right way to do determine confidence of predictions from a neural network? If not, is there a better way?
Thanks!
Edit: From the answer below, it seems like my network is just performing poorly. Is there another way to identify which predictions the network makes are likely to be wrong besides looking at the confidence (since the confidence doesn't seem to work well)?

Imagine your samples are split by a vertical line but you NN classifier learnt a horizontal line, in this case any prediction given by your classifier can only obtain 50% accuracy always. However NN will assign higher confidence to the samples which are further away from the horizontal line.
In short, when your model is doing poor classification higher confidence has little to none contribution to accuracy.
Suggestion: Check if the information you needed to do the correct classification are in the data then improve the overall accuracy first.

Related

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Why would I choose a loss-function differing from my metrics?

When I look through tutorials in the internet or at models posted here at SO, I often see that the loss function differs from the metrics used to evaluate the model. This might look like:
model.compile(loss='mse', optimizer='adadelta', metrics=['mae', 'mape'])
Anyhow, following this example, why wouldn't I optimize 'mae' or 'mape' as loss instead of 'mse' when I don't even care about 'mse' in my metrics (hypothetically speaking when this would be my model)?
In many cases the metric you are interested might not be differentiable, so you cannot use it as a loss, this is the case for accuracy for example, where the cross entropy loss is used instead as it is differentiable.
For metrics that are already differentiable, you just want to get additional information from the learning process, as each metrics measures something different. For example the MSE has a scale that is squared from the scale of the data/predictions, so to get the same scale you have to use RMSE or the MAE. The MAPE gives you relative (not absolute) error, so all of these metrics measure something different that might be of interest.
In the case of accuracy, this metric is used because it is easily interpretable by a human, while cross entropy loss are less intuitive to interpret.
That is a very good question.
Knowing your modeling, you should use a convenience loss function to minimize to achieve your goals.
But to evaluate your model, you will use metrics to report the quality of your generalization using some metrics.
For many reasons, the evaluation part might differ from the optimization criteria.
Giving you an example, in Generative Adversarial Networks, many papers suggest that a mse loss minimization leads to more fuzzy images although mae helps to get a more clear output. You might want to trace both of them in your evaluation to see how it really changes the things.
Another possible case is when you have a customized loss, but you still want to report the evaluation based on accuracy.
I can think of possible cases where you set the loss function in a way to converge faster, better and etc, but you might measure the quality of the model with some other metrics as well.
Hope this can help.
I just asked myself that question when I came across a GAN Implementation that uses mae as loss. I already knew that some metrics are not differentiable and thought that mae is an ecample, albeit only at x=0. So is there simply an exception like just assume a slope of 0? That would make sense to me.
I also wanted to add that I learned to use mae instead of mae because a small error stays smaller when squared while bigger errors increase on relative magnitude. So bigger are being penalized more with mse.

Why is it that the graph of mAP not ascending as training steps increases?

I trained my own ssd coco model with 1000 train pictures and 100 test. I was just curious why is the number of training steps is not directly proportional to the mAP or why does it have lower mAP at certain training steps like shown below image?
Neural Network optimizer functions such as gradient descent and it's variations (http://ruder.io/optimizing-gradient-descent/) attempt to update the weights of your model at each time step in such a way as to get closer to the smallest possible loss. Sometimes it steps in the wrong direction, sometimes it steps in the right directions, but the step is too big so that it steps right past the minimum.
Sophisticated optimizer functions such as Adam seek to minimize this problem by making the steps taken more consistent and also progressively smaller over time.
What you are seeing above is therefore completely normal - i.e. the mAP jumps up and down but over time it increases.

Unusual behavior of ADAM optimizer with AMSGrad

I am trying some 1, 2, and 3 layer LSTM networks to classify land cover of some selected pixels from a Landsat time-series spectral data. I tried different optimizers (as implemented in Keras) to see which of them is better, and generally found AMSGrad variant of ADAM doing a relatively better job in my case. However, one strange thing to me is that for the AMSGrad variant, the training and test accuracies start at a relatively high value from the first epoch (instead of increasing gradually) and it changes only slightly after that, as you see in the below graph.
Performance of ADAM optimizer with AMSGrad on
Performance of ADAM optimizer with AMSGrad off
I have not seen this behavior in any other optimizer. Does it show a problem in my experiment? What can be the explanation for this phenomenon?
Pay attention to the number of LSTM layers. They are notorious for easily overfitting the data. Try a smaller model initially(less number of layers), and gradually increase the number of units in a layer. If you notice poor results, then try adding another LSTM layer, but only after the previous step has been done.
As for the optimizers, I have to admit I have never used AMSGrad. However, the plot with regard to the accuracy does seem to be much better in case of the AMSGrad off. You can see that when you use AMSGrad the accuracy on the training set is much better than that on the test set, which a strong sign of overfitting.
Remember to keep things simple, experiment with simple models and generic optimizers.

High variability loss of neural networks

I'm getting really high variability in both the accuracy and loss between each epoch, as high as 10%. It happens to my accuracy all the time, and my loss when I start adding in dropout. However I really need the dropout, any ideas on how to smooth it out?
It is hard to say anything concrete without knowing what you do. But because you mentioned that your dataset is very small: 500 samples, I say that your 10% performance jumps are not surprising. Still a few ideas:
definitely use a bigger dataset if you can. If it is not possible to collect a bigger dataset, try to augment whatever you have.
try a smaller dropout and see how it goes, try different regularizers (dropout is not the only option)
you data is small, you can afford to run more than 200 iterations
see how your model performs on the test set, it is possible that it just severely overfitted the data
Beside the fact that the data set is very small, during a training with a dropout regularization the loss function is not anymore well defined and I presume the accuracy is also biased. Therefore any tracked metric should be assessed without dropout. It seams that keras does not switch it off while calculating the accuracy during training.