How to remove "glitches" in the loss graph of training phase? - tensorflow

When training a deep learning model, I noticed that the training loss was little weird. There were some "glitches" at certain epochs, as seen in the figure below.
Please let me know the reasons and how to get rid of them?
Thank you

This may be completely normal and due to how the learning process works.
In practice, since using stochastic gradient descend (SGD) you optimize the loss function by approximating the whole-dataset loss landscape with the current minibatch loss landscape, the optimization process becomes noisy and peaked.
In fact, in each iteration, you evaluate the loss obtained by the model on the current minibatch and then update the model parameters based on this loss. However, this loss value is not necessarily the value you would have obtained by making a prediction on the whole dataset. For example, in a binary classification problem, imagine what happens if due to randomness your current minibatch contains only samples of one class A instead of both the classes A and B: the current loss doesn't take into account the class B and you will update the model parameters only based on the results of one class (A). As a consequence, if the next minibatch had to contain an equal number of samples of the classes A and B your results will be worst than usual.
Even if the unbalance among classes is usually addressed by using balanced minibatches or weighted loss functions, more in general you must think that what I described may happen also inside one class. Suppose you have a large heterogeneity inside the class A: your model could update the parameters more based on certain features than others.
For more theoretical aspects, which I really encourage you to read, you can have a look at this:
http://ruder.io/optimizing-gradient-descent/

Related

GAN - loss and evaluation of model

I'm struggling with understanding how to "objectively" evaluate a GAN (that is, not simply look at what it generates saying "this looks good/bad").
My understanding is that the discriminator should get a head start and, in theory, discriminator loss and generator loss both ought to converge to 0.5 - at which point both are equally "good".
I'm currently training a model, and I get discriminator loss beginning at 0.7 but quickly converging toward 0.25, and generator loss beginning at 50 and converging toward 0.35 (possibly less with further training).
This doesn't entirely make sense. How can both be better than 0.5?
Are my loss functions incorrect, or what else am I missing? How should performance be measured?
In a GAN setting, it is normal for you to have the losses be better because you are training only one of the networks at a time (thus beating the other network).
You can evaluate the generated output with some of the metrics PSNR, SSIM, FID, L2, Lpips, VGG, or something similar (depending on your particular task). This is still an ongoing area of research on how to objectively evaluate an image, and they are generally used as loss objectives in certain tasks.
I recommend looking at something like Analysis and Evaluation of Image Quality Metrics
I would recommend you look at the generator metrics over time to see if its improving, and obviously confirm that visually as well. You can use logging to see the metric changes or some visualization tools, tensorboard, or wandb for this.

Is it possible to estimate the time needed to train a machine learning model given a size of data and hardware specification?

I am planning to make small Tensor Flow image classification project, which is expected to run on machines with low processing power, and one of the concerns I was asked about was the time needed to train the model.
The project is still in the conception stage and no clear boundary is made.
But assuming that we will use Tensor flow for Python, with a simple Neural Network for say n images data set, is there a way to estimate or predict the time required to train the model before performing the training given the hardware in use?
I have asked one of my colleagues who works in NN and he said that maybe we could calculate the time needed by measuring the time for the first epoch and making an estimation how many epochs needed afterwards. Is this is a valid way? If yes then is it even possible to estimate the number of epochs needed? And either cases is there a way to calculate it before performing any training?
There is no definite way of finding the number of epochs to which the model converges. It is one of the hyperparameter.
Apart from the type of model you are training, convergence also depends on the distribution of data, and the optimizer you are using.
The rough estimate you can make by looking at the number of parameters you have in your model, check time for one epoch, and get a rough idea from "experience" on the number of epochs. BUT you always have to look at the training and validation loss curves to check for the convergence.

Memory requirements for back propagation - why not use the mean activation?

I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.

tf-slim batch norm: different behaviour between training/inference mode

I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.

One class classification - interpreting the models accuracy

I am using LIBSVM for classification of data. I am mainly doing One Class Classification.
My training sets consists of data of only one class & my testing data consists of data of two classes (one which belong to target class & the other which doesn't belong to the target class).
After applying svmtrain and svmpredict on both training and testing datasets the accuracy which is coming for training sets is 48% and for testing sets it is 34.72%.
Is it good? How can I know whether LIBSVM is classifying the datasets correctly?
To say if it is good or not depends entirely on the data you are trying to classify. You should search what is the state of the art accuracy for SVM model for your kind of classification and then you will be able to know if your model is good or not.
What I can say from your results is that the testing accuracy is worse than the training accuracy, which is normal as a classifier usually perform better with data it has already seen before.
What you can try now is to play with the regularization parameter (C if you are using a linear kernel) and see if the performance improves on the testing set.
You can also trace learning curves to see if your classifier overfit or not, which will help you choose if you need to increase or decrease the regularization.
For you case, you might want to apply weighting on the classes as the data is often sparse in favor of negative example.
To know whether Libsvm is classifying the dataset correctly you can look at which examples it predicted correctly and which ones it predicted incorrectly. Then you can try to change your features to improve its results.
If you are worried about your code being correct, you can try to code a toy example and play with it or use an example of someone on the web and replicate their results.