Keras model has very good loss after 1 epoch but doesn't really get better with more epochs

Keras model has very good loss after 1 epoch but doesn't really get better with more epochs - tensorflow

Hello I just wanted to ask this theoretical question.
What could be the causes of a model that has already a very good loss (0.004 on normalized data) after one single epoch but this loss doesn't really decrease over time (after 10 epochs it's still 0.0032).
Shouldn't it normally decrease way more over time?
The dataset is pretty big with bit more than a million datapoints and I didn't expect this very good loss just after 1 epoch.
So what could I change about this model or what am I doing wrong? (it's a densely connected NN predicting regression with adam and mse)

There are multiple possibilities, but the problem needs some clarification.
Could you specify the range of your target?
0.004 might sound low as a loss, but it's not if your target ranges from 0 to 0.0001 for example.
What are the metrics of your validation & test data set? Loss on itself does not say much without knowing the validation loss.
Guessing that the 0.004 is too good to be true, your model might be over fitting.
Try implementing dropout to avoid over fitting.
In case your model is not over fitting, it might be the case that Adam is overshooting a (local) minima. Try lowering its learning rate, or try sgd with custom hyper-parameters. This does take a lot of tuning.
There is a free course on Coursera called Machine Learning by Stanford. This covers theory on these concepts (and more) in a good way.

Related

Unstable loss in binary classification for time-series data - extremely imbalanced dataset

I am working on deep learning model to detect regions of timesteps with anomalies. This model should classify each timestep as possessing the anomaly or not.
My labels are something like this:
labels = [0 0 0 1 0 0 0 0 1 0 0 0 ...]
The 0s represent 'normal' timesteps and the 1s represent the existence of an anomaly. In reality, my dataset is very very imbalanced:
My training set consists of over 7000 samples, where only 1400 samples = 20% of those contain at least 1 anomaly (timestep = 1)
I am feeding samples with 4096 timesteps each. The average number of anomalies, in the samples that contain them, is around 2. So, assuming there is an anomaly, the % of anomalous timesteps ranges from 0.02% to 0.04% for each sample.
With that said, I do need to shift from the standard binary cross entropy to something that highlights the anomalous timesteps from the anomaly free timesteps.
So, I experimented adding weights to the anomalous class in such a way that the model is forced to learn from the anomalies and not just reduce its loss from the anomaly-free timesteps. It actually worked well and the model seems to learn to detect anomalous timesteps. One problem however is that training can become quite unstable (and unpredictable), with sudden loss spikes appearing and affecting the learning process. Below, you can see the effects on the loss and metrics charts for two of my trainings:
After going through a debugging process for the trainings, I am confident that the problem comes from ocasional predictions given for the anomalous timesteps. That is, in some samples of a certain epoch, and in some anomalous timesteps, the model is giving a very low prediction, e.g. 0.01, for the 1s label (should be close to 1 ofc). Considering the very high (but supposedly necessary) weights given to the anomalous timesteps, the penaly is really extreme and the loss just skyrockets.
Going deeper, if I inspect the losses of the sample where the jump happened and look for the batch right before the loss jumped, I see that the losses are all around 10^-2 - 0.0053, 0.004, 0.0041... - not a single sample with a loss over those values. Overall, an average loss of 0.005. However, if I inspect the loss of the following batch, in that same sample, the avg. loss of the batch is already 3.6, with a part of the samples with a low loss but another part with a very high loss - e.g. 9.2, 7.7, 8.9... I can confirm that all the high losses come from the penalties given at predicting the 1s timesteps. The following batches of the same sample and some of the batches of the next epoch get affected and take some time to start decreasing again and going back to a stable learning process.
With this said, I am having this problem for some weeks already and really need some guidance in what I could try to deal with the spikes, which I assume that arise on the gradient updates associated with anomalous timesteps that are harder to learn.
I am currently using a simple 2-layer keras LSTM model with 64 units each and a dense as the last layer with a 1 unit dense layer with sigmoid activation. As for the optimizer I am using Adam. I am training with batch size 128. Some things to consider also:
I have tried changes in weights and other loss functions. Ultimately, if I reduce the weights given to the anomalous timesteps the model doesn't give so much importance to them and the loss reduces by considering only the anomalous free timesteps. I have also considered focal binary cross entropy loss but it doesn't seem to do anything that could avoid those jumps as, in the end, it is all about adding or reducing weights for certain timesteps.
My current learning rate is the Adam's default, 10⁻3. I have tried reducing the learning rate which leads to less impactful spikes (they're still there though) but the model also takes much more time or gets stuck. Not sure if it would be the way to go in this case, as the training seems to go well except for these cases. Decaying learning rate might also not make too much sense as the spikes can happen earlier in the training and not only on later epochs. Not sure if this is the way to go.
I am still investigating gradient clipping as a solution. I am still not sure on what values to use and if it is actually an effective solution for my case, but from what I understood of it, it should allow to counter those jumps resulting from those 'almost' exploding gradients.
The spikes could originate from sample noise / bad samples. However, since I am already using batch size 128 and I have already tested training with simple synthetic samples I have created and the spikes were still there, I guess it is not a problem with specific samples.
The imbalance obviously plays the bigger role here. Not sure if undersampling the majority class of samples of 4096 timesteps (like increasing from 20% to 50% the amount of samples with at least an anomalous timestep) would make a big difference here since each sample of timesteps is by itself very imbalanced as it contains around 2 timesteps with anomalies. It is a problem with the imbalance within each sample.
I know it might be quite some context but honestly I am already into my limit of trying stuff for weeks.
The solutions I am inclined to go for next are either gradient clipping or just changing my samples to be more centered around the anomalous timesteps, in such a way that it contains less anomaly free timesteps and hopefully allows for convergence without having to apply such drastic weights to anomalous timesteps. This last option is more difficult for me to opt for due to some restrictions, but I might look at it if I have nothing else available.
What do you think? I am able to provide more information if needed.

How do I know when to stop training my CNN?

I've been training my CNN and got the following as results:
I just know that the training and validation accuracy needs to both be high, but are these numbers good enough? How do I know when to stop? Should I concern myself with the losses, or only accuracy? Which epoch shows the best result so far?

Loss value implies how poorly or well a model behaves after each iteration of optimization. where as accuracy of a model is usually determined after the model parameters and is calculated in the form of a percentage.
Yes, training and validation should be high. Numbers always depend on subject area where we are dealing. In case of medical domain these numbers not good.
If you have serious class imbalance, your model will maximize accuracy by simply always picking the most common class, but this would not be a useful model. In this case cross entropy or log-loss would be a better loss function to optimize.
Generally the lower the loss the better a model unless the model has overfitted to the training data.
10th epoch is best where you got higher validation accuracy and lower validation loss.

Validation loss oscillates a lot, validation accuracy > learning accuracy, but test accuracy is high. Is my model overfitting?

I am training a model, and using the original learning rate of the author (I use their github too), I get a validation loss that keeps oscillating a lot, it will decrease but then suddenly jump to a large value and then decrease again, but never really converges as the lowest it gets is 2 (while training loss converges to 0.0 something - much below 1)
At each epoch I get the training accuracy and at the end, the validation accuracy. Validation accuracy is always greater than the training accuracy.
When I test on real test data, I get good results, but I wonder if my model is overfitting. I expect a good model's val loss to converge in a similar fashion with training loss, but this doesn't happen and the fact that the val loss oscillates to very large values at times worries me.
Adjusting the learning rate and scheduler etc etc, I got the val loss and training loss to a downward fashion with less oscilliation, but this time my test accuracy remains low (as well as training and validation accuracies)
I did try a couple of optimizers (adam, sgd, adagrad) with step scheduler and also the pleateu one of pytorch, I played with step sizes etc. but it didn't really help, neither did clipping gradients.
Is my model overfitting?
If so, how can I reduce the overfitting besides data augmentation?
If not (I read some people on quora said it is nothing to worry about, though I would think it must be overfitting), how can I justify it? Even if I would get similar results for a k-fold experiment, would it be good enough? I don't feel it would justify the oscilliating. How should I proceed?

The training loss at each epoch is usually computed on the entire training set.
The validation loss at each epoch is usually computed on one minibatch of the validation set, so it is normal for it to be more noisey.
Solution: You can report the Exponential Moving Average of the validation loss across different epochs to have less fluctuations.
It is not overfitting since your validation accuracy is not less than the training accuracy. In fact, it sounds like your model is underfitting since your validation accuracy > training accuracy.

Tensorflow: loss decreasing, but accuracy stable

My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images

A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.

Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.

Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.

How to interpret increase in both loss and accuracy

I have run deep learning models(CNN's) using tensorflow. Many times during the epoch, i have observed that both loss and accuracy have increased, or both have decreased. My understanding was that both are always inversely related. What could be scenario where both increase or decrease simultaneously.

The loss decreases as the training process goes on, except for some fluctuation introduced by the mini-batch gradient descent and/or regularization techniques like dropout (that introduces random noise).
If the loss decreases, the training process is going well.
The (validation I suppose) accuracy, instead, it's a measure of how good the predictions of your model are.
If the model is learning, the accuracy increases. If the model is overfitting, instead, the accuracy stops to increase and can even start to decrease.
If the loss decreases and the accuracy decreases, your model is overfitting.
If the loss increases and the accuracy increase too is because your regularization techniques are working well and you're fighting the overfitting problem. This is true only if the loss, then, starts to decrease whilst the accuracy continues to increase.
Otherwise, if the loss keep growing your model is diverging and you should look for the cause (usually you're using a too high learning rate value).

I think the top-rated answer is incorrect.
I will assume you are talking about cross-entropy loss, which can be thought of as a measure of 'surprise'.
Loss and accuracy increasing/decreasing simultaneously on the training data tells you nothing about whether your model is overfitting. This can only be determined by comparing loss/accuracy on the validation vs. training data.
If loss and accuracy are both decreasing, it means your model is becoming more confident on its correct predictions, or less confident on its incorrect predictions, or both, hence decreased loss. However, it is also making more incorrect predictions overall, hence the drop in accuracy. Vice versa if both are increasing. That is all we can say.

I'd like to add a possible option here for all those who struggle with a model training right now.
If your validation data is a bit dirty, you might experience that in the beginning of the training the validation loss is low as well as the accuracy, and the more you train your network, the accuracy increases with the loss side by side. The reason why it happens, because it finds the possible outliers of your dirty data and gets a super high loss there. Therefore, your accuracy will grow as it guesses more data right, but the loss grows with it.

This is just what I think based on the math behind the loss and the accuracy,
Note :-
I expect your data is categorical
Your models output :-
[0.1,0.9,0.9009,0.8] (used to calculate loss)
Maxed output :-
[0,0,1,0] (used to calculate acc )
Expected output :-
[0,1,0,0]
Lets clarify what loss and acc calculates :
Loss :- The overall error of y and ypred
Acc :- Just if y and maxed(ypred) is equal
So in a overall our model almost nailed it , resulting in a low loss
But in maxed output no overall is seen its just that they should completely match ,
If they completely match :-
1
else:
0
Thus resulting in a low accuracy too
Try to check mae of the model
remove regularization
check if your are using correct loss

You should check your class index (both train and valid) in training process. It might be sorted in different ways. I have this problem in colab.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas