mixed precision training cause numerical overshoot in Conv2D - tensorflow

Training CNN model in keras 2.4 and 2.9 using mixed precision
Model sporadically diverge during training with loss=nan
In model layer, found it caused when a Conv2D multiply two values and outcome exceed 65,504.
Not sure how come 'mixed precision' not handle such a use-case, or if it is wrongly implemented.

Related

Resnet training - L2 loss decreases while cross-entropy stays around 0.69

I am using this https://github.com/tensorflow/models/tree/master/official/resnet official tensorflow implementation of resnet to train a binary classifier on my own dataset. I modified a little bit of the input_fn in imagenet_main.py to do my own image loading and preprocessing. But after many times of parameter tuning, I can't make my model train properly. I can only find a set of parameters that let training accuracy increase reaching 100%, while the validation accuracy stay around 50% forever. The implementation uses piece-wise learning-rate. I tried initial learning rate from 0.1 to 1e-5 and weight decay from 1e-2 to 1e-5, and no convergence on validation set was found.
A suspicious observation is that during training, the l2 loss decrease slowly and steady while cross-entropy is very reluctant to decrease, staying around 0.69.
Any idea about what can I try further ?
Regarding my dataset and image preprocessing, The training data set is around 100K images. The validation set is around 10K. I just resize each image to 224*224 while keeping aspect ration and subtract 127 on each channel and divide them by 255.
Actually #Hua resnet have so many trainable parameters and it is trained on image net which has 1k classes. and your data-set has only two classes. Dense layers of resnet has 4k neurons which in result increase the number of trainable parameter. Now number of parameters are directly related to risk of over-fitting. Means that resnet model is not suitable for your data kindly make some changes to resnet. Try to decrease number of parameter. That may help –

Validation loss >> train loss, same data, binary classifier

I implemented this paper's neural net, with some differences (img below), for EEG classification; train_on_batch performance is excellent, with very low loss - but test_on_batch performance, though on same data, is poor: the net seems to always predict '1', most of the time:
TRAIN (loss,acc) VAL (loss,acc)
'0' -- (0.06269842,1) (3.7652588,0)
'1' -- (0.04473557,1) (0.3251827,1)
Data is fed as 30-sec segments (12000 timesteps) (10 mins per dataset) from 32 (= batch_size) datasets at once (img below)
Any remedy?
Troubleshooting attempted:
Disabling dropout
Disabling all regularizers (except batch-norm)
Randomly, val_acc('0','1') = (~.90, ~.12) - then back to (0,1)
Additional details:
Keras 2.2.4 (TensorFlow backend), Python 3.6, Spyder 3.3.4 via Anaconda
CuDNN LSTM stateful
CNNs pretrained, LSTMs added on afterwards (and both trained)
BatchNormalization after every CNN & LSTM layer
reset_states() applied between different datasets
squeeze_excite_block inserted after every but last CNN block
UPDATE:
Progress was made; batch_normalization and dropout are the major culprits. Major changes:
Removed LSTMs, GaussianNoise, SqueezeExcite blocks (img below)
Implemented batch_norm patch
Added sample_weights to reflect class imbalance - varied between 0.75 and 2.
Trained with various warmup schemes for both MaxPool and Input dropouts
Considerable improvements were observed - but not nearly total. Train vs. validation loss behavior is truly bizarre - flipping class predictions, and bombing the exact same datasets it had just trained on:
Also, BatchNormalization outputs during train vs. test time differ considerably (img below)
UPDATE 2: All other suspicions were ruled out: BatchNormalization is the culprit. Using Self-Normalizing Networks (SNNs) with SELU & AlphaDropout in place of BatchNormalization yields stable and consistent results.
I've inadvertently left in one non-standardized sample (in a batch of 32), with sigma=52 - which severely disrupted the BN layers; post-standardizing, I no longer observe a strong discrepancy between train & inference modes - if anything, any differences are difficult to spot.
Furthermore, the entire preprocessing was very fault - after redoing correctly, the problem no longer recurs. As a debug tip, try identifying whether any particular train dataset sharply alters layer activations during inference.

Training Inception V2 from scratch - diverging

As a learning exercise, I'm training the Inception (v2) model from scratch using the ImageNet dataset from the Kaggle competition. I've heard people say it took them a week or so of training on a GPU to converge this model in this same dataset. I'm currently training it on my MacBook Pro (single CPU), so I'm expecting it to converge in no less than a month or so.
Here's my implementation of the Inception model. Input is 224x224x3 images, with values in range [0, 1].
The learning rate was set to a static 0.01 and I'm using the stochastic gradient descent optimizer.
My question
After 48 hours of training, the training loss seems to indicate that it's learning from the training data, but the validation loss is beginning to get worse. Ordinarily, this would feel like the model is overfitting. Does it look like something might be wrong with my model or dataset, or is this perfectly expected, since I've only trained 5.8 epochs?
My training and validation loss and accuracy after 1.5 epochs.
Training and validation loss and accuracy after 5.8 epochs.
Some input images as seen by the model, as well as the output of one of the early convolution layers.

Why does the accuracy drop to zero in each epoch, while training convlstm layers in keras?

I am trying to use ConvLSTM layers in Keras 2 to train an action recognition model. The model has 3 ConvLSTM layers and 2 Fully Connected ones.
At each and every epoch the accuracy for the first batch (usually more than one) is zero and then it increases to some amount more than the previous epoch. For example, the first epoch finishes at 0.3 and the next would finish at 0.4 and so on.
My question is why does it get back to zero at each epoch?
p.s.
The ConvLSTM is stateless.
The model is compiled with SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True), for some reason it does not converge using Adam.
So - in order to understand why something like this happening you need to understand how keras computes accuracy during the batch computation:
Before each batch - a number of positively classified examples is stored.
After each batch - a number of positively classified examples is stored and it's printed after division by all examples used in training.
As your accuracy is pretty low it's highly probable that in a first few batches none of the examples will be classified properly. Especially when you have a small batch. This makes accuracy to be 0 at the beginning of your training.

Tensorflow: loss decreasing, but accuracy stable

My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images
A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.
Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.
Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.