Unusual behavior of ADAM optimizer with AMSGrad - optimization

I am trying some 1, 2, and 3 layer LSTM networks to classify land cover of some selected pixels from a Landsat time-series spectral data. I tried different optimizers (as implemented in Keras) to see which of them is better, and generally found AMSGrad variant of ADAM doing a relatively better job in my case. However, one strange thing to me is that for the AMSGrad variant, the training and test accuracies start at a relatively high value from the first epoch (instead of increasing gradually) and it changes only slightly after that, as you see in the below graph.
Performance of ADAM optimizer with AMSGrad on
Performance of ADAM optimizer with AMSGrad off
I have not seen this behavior in any other optimizer. Does it show a problem in my experiment? What can be the explanation for this phenomenon?

Pay attention to the number of LSTM layers. They are notorious for easily overfitting the data. Try a smaller model initially(less number of layers), and gradually increase the number of units in a layer. If you notice poor results, then try adding another LSTM layer, but only after the previous step has been done.
As for the optimizers, I have to admit I have never used AMSGrad. However, the plot with regard to the accuracy does seem to be much better in case of the AMSGrad off. You can see that when you use AMSGrad the accuracy on the training set is much better than that on the test set, which a strong sign of overfitting.
Remember to keep things simple, experiment with simple models and generic optimizers.

Related

How to improve the performance of CNN Model for a specific Dataset? Getting Low Accuracy on both training and Testing Dataset

We were given an assignment in which we were supposed to implement our own neural network, and two other already developed Neural Networks. I have done that and however, this isn't the requirement of the assignment but I still would want to know that what are the steps/procedure I can follow to improve the accuracy of my Models?
I am fairly new to Deep Learning and Machine Learning as a whole so do not have much idea.
The given dataset contains a total of 15 classes (airplane, chair etc.) and we are provided with about 15 images of each class in training dataset. The testing dataset has 10 images of each class.
Complete github repository of my code can be found here (Jupyter Notebook file): https://github.com/hassanashas/Deep-Learning-Models
I tried it out with own CNN first (made one using Youtube tutorials).
Code is as follows,
X_train = X_train/255.0
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape = X_train.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(16)) # added 16 because it model.fit gave error on 15
model.add(Activation('softmax'))
For the compiling of Model,
from tensorflow.keras.optimizers import SGD
model.compile(loss='sparse_categorical_crossentropy',
optimizer=SGD(learning_rate=0.01),
metrics=['accuracy'])
I used sparse categorical crossentropy because my "y" label was intenger values, ranging from 1 to 15.
I ran this model with following way,
model_fit = model.fit(X_train, y_train, batch_size=32, epochs=30, validation_split=0.1)
It gave me an accuracy of 0.2030 on training dataset and only 0.0733 on the testing dataset (both the datasets are present in the github repository)
Then, I tried out the AlexNet CNN (followed a Youtube tutorial for its code)
I ran the AlexNet on the same dataset for 15 epochs. It improved the accuracy on training dataset to 0.3317, however accuracy on testing dataset was even worse than my own CNN, at only 0.06
Afterwards, I tried out the VGG16 CNN, again following a Youtube Tutorial.
I ran the code on Google Colab for 10 Epochs. It managed to improve to 100% accuracy on training dataset in the 8th epoch. But this model gave the worst accuracy of all three on testing dataset with only 0.0533
I am unable to understand this contrasting behavior of all these models. I have tried out different epoch values, loss functions etc. but the current ones gave the best result relatively. My own CNN was able to get to 100% accuracy when I ran it on 100 epochs (however, it gave very poor results on the testing dataset)
What can I do to improve the performance of these Models? And specifically, what are the few crucial things that one should always try to follow in order to improve efficiency of a Deep Learning Model? I have looked up multiple similar questions on Stackoverflow but almost all of them were working on datasets provided by the tensorflow like mnist dataset and etc. and I didn't find much help from those.
Disclaimer: it's been a few years since I've played with CNNs myself, so I can only pass on some general advice and suggestions.
First of all, I would like to talk about the results you've gotten so far. The first two networks you've trained seem to at least learn something from the training data because they perform better than just randomly guessing.
However: the performance on the test data indicates that the network has not learned anything meaningful because those numbers suggest the network is as good as (or only marginally better than) a random guess.
As for the third network: high accuracy for training data combined with low accuracy for testing data means that your network has overfitted. This means that the network has memorized the training data but has not learned any meaningful patterns.
There's no point in continuing to train a network that has started overfitting. So once the training accuracy increases and testing accuracy decreases for a few epochs consecutively, you can stop training.
Increase the dataset size
Neural networks rely on loads of good training data to learn patterns from. Your dataset contains 15 classes with 15 images each, that is very little training data.
Of course, it would be great if you could get hold of additional high-quality training data to expand your dataset, but that is not always feasible. So a different approach is to artificially expand your dataset. You can easily do this by applying a bunch of transformations to the original training data. Think about: mirroring, rotating, zooming, and cropping.
Remember to not just apply these transformations willy-nilly, they must make sense! For example, if you want a network to recognize a chair, do you also want it to recognize chairs that are upside down? Or for detecting road signs: mirroring them makes no sense because the text, numbers, and graphics will never appear mirrored in real life.
From the brief description of the classes you have (planes and chairs and whatnot...), I think mirroring horizontally could be the best transformation to apply initially. That will already double your training dataset size.
Also, keep in mind that an artificially inflated dataset is never as good as one of the same size that contains all authentic, real images. A mirrored image contains much of the same information as its original, we merely hope it will delay the network from overfitting and hope that it will learn the important patterns instead.
Lower the learning rate
This is a bit of side note, but try lowering the learning rate. Your network seems to overfit in only a few epochs which is very fast. Obviously, lowering the learning rate will not combat overfitting but it will happen more slowly. This means that you can hopefully find an epoch with better overall performance before overfitting takes place.
Note that a lower learning rate will never magically make a bad-performing network good. It's just one way to locate a set of parameters that performs a tad bit better.
Randomize the training data order
During training, the training data is presented in batches to the network. This often happens in a fixed order over all iterations. This may lead to certain biases in the network.
First of all, make sure that the training data is shuffled at least once. You do not want to present the classes one by one, for example first all plane images, then all chairs, etc... This could lead to the network unlearning much of the first class by the end of each epoch.
Also, reshuffle the training data between epochs. This will again avoid potential minor biases because of training data order.
Improve the network design
You've designed a convolutional neural network with only two convolution layers and two fully connected layers. Maybe this model is too shallow to learn to differentiate between the different classes.
Know that the convolution layers tend to first pick up small visual features and then tend to combine these in higher level patterns. So maybe adding a third convolution layer may help the network identify more meaningful patterns.
Obviously, network design is something you'll have to experiment with and making networks overly deep or complex is also a pitfall to watch out for!

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

tf-slim batch norm: different behaviour between training/inference mode

I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.

Neural Network High Confidence Inaccurate Predictions

I have a trained a neural network on a classification task, and it is learning, although it's accuracy is not high. I am trying to figure out which test examples it is not confident about, so that I can gain some more insight into what is happening.
In order to do this, I decided to use the standard softmax probabilities in Tensorflow. To do this, I called tf.nn.softmax(logits), and used the probabilities provided here. I noticed that many times the probabilities were 99%, but the prediction was still wrong. As such, even when I only consider examples who have prediction probabilities higher than 99%, I get a poor accuracy, only 2-3 percent higher than my original accuracy.
Does anyone have any ideas as to why the network is so confident about wrong predictions? I am still new to deep learning, so am looking for some ideas to help me out.
Also, is using the softmax probabilities the right way to do determine confidence of predictions from a neural network? If not, is there a better way?
Thanks!
Edit: From the answer below, it seems like my network is just performing poorly. Is there another way to identify which predictions the network makes are likely to be wrong besides looking at the confidence (since the confidence doesn't seem to work well)?
Imagine your samples are split by a vertical line but you NN classifier learnt a horizontal line, in this case any prediction given by your classifier can only obtain 50% accuracy always. However NN will assign higher confidence to the samples which are further away from the horizontal line.
In short, when your model is doing poor classification higher confidence has little to none contribution to accuracy.
Suggestion: Check if the information you needed to do the correct classification are in the data then improve the overall accuracy first.

High variability loss of neural networks

I'm getting really high variability in both the accuracy and loss between each epoch, as high as 10%. It happens to my accuracy all the time, and my loss when I start adding in dropout. However I really need the dropout, any ideas on how to smooth it out?
It is hard to say anything concrete without knowing what you do. But because you mentioned that your dataset is very small: 500 samples, I say that your 10% performance jumps are not surprising. Still a few ideas:
definitely use a bigger dataset if you can. If it is not possible to collect a bigger dataset, try to augment whatever you have.
try a smaller dropout and see how it goes, try different regularizers (dropout is not the only option)
you data is small, you can afford to run more than 200 iterations
see how your model performs on the test set, it is possible that it just severely overfitted the data
Beside the fact that the data set is very small, during a training with a dropout regularization the loss function is not anymore well defined and I presume the accuracy is also biased. Therefore any tracked metric should be assessed without dropout. It seams that keras does not switch it off while calculating the accuracy during training.