I have a question about TensorFlow's estimators in tf.estimator, in particular DNNClassifier. It says in the documentation:
max_steps: Number of total steps for which to train model. If None, train forever or train until input_fn generates the OutOfRange error or StopIteration exception
In the doc on datasets for estimators it mentions that for training you need to use the shuffle(), repeat(), and batch_size methods, so that the iterator on the data set does stop after it's gone through the data once.
Does this mean that the pre-made estimators such as DNNClassifier have no stopping criterion based on the learning rate or changes in the loss? Is it really the case you can only have these models stop training based on how you specify the input function or by giving a maximum number of steps?
TensforFlow will not presume to know upon what learning rate or loss it should stop at. This is reasonable because they're problem-dependent. You could reasonably argue that it could infer a sensible limit based on round-off error for the given data types (if they're consistent, e.g. float32) but then many problems should be stopped earlier. So there is no sensible, broadly applicable default.
However, you can control this behaviour yourself using callbacks. TensorFlow includes the EarlyStopping callback. You can find the (python) documentation for it here.
Related
I'm training a TensorFlow Keras Model using Model.fit(). I'm also using callbacks to log my training accuracy metrics after every batch using TensorFlow's on_train_batch_end() syntax. In addition, I'm using another callback to run Model.evaluate() every 1,000 batches to compute validation set accuracy and update the logs dict passed around the callbacks during Model.fit().
Looking at the logged metrics vs. batch number shows very perplexing results. After the Model.evaluate() run, the training accuracy experiences a significant 'jolt', initially triggering a rapid increase in the logged training accuracy and subsequently triggering a significant drop training accuracy followed by a slower recovery (see attached images).
My guess is that it's something to do with the Model.evaluate()'s call to reset_metrics(), which loops through and calls the reset_states() method on each metric. I can't work out what reset_states() is doing and if this is relevant to the behaviour I'm observing. It seems to relate to the Mean parent class of CategoricalAccuracy. I haven't been able to find anything helpful in the TensorFlow docs yet.
Are the metrics shown during Model.fit() actually some form of moving averages rather than the batch-wise metric? In that case, the reset_states() method would be resetting the moving average, possibly producing the jolting behaviour.
Can anyone with a better grasp of TensorFlow's inner workings help?
I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.
I'm having some learning experience on tensorflows estimator api. Doing some classification task on a small dataset with tensorflow's tf.contrib.learn.DNNClassifier (I know there is tf.estimator.DNNClassifier but I have to work on tensorflow 1.2) I get the accuracy graph on my test dataset. I wonder why there are these negative peaks.
I thought they could occur because of overfitting and self repairing. The next datapoint after the peak seems to have the same value as the point before.
I tried to look into the code to find any proof that estimator's train function has such a mechanism but did not find any.
So, is there such a mechanism or are there other possible explanations?
I don't think that the Estimator's train functions has any such mechanism.
Some possible theories:
Does your training restart anytime? Its possible that if you have some Estimated Moving Average (EMA) in your model, upon restart the moving average has to be recomputed.
Is your input data randomized? If not, its possible that a patch of input data is all misclassified, and again the EMA is possibly smoothing out.
This is pretty mysterious to me. If you do find out what the real issue is please do share!
I am setting up an object detection pipeline based on recently released tensorflow object detection API. I am using the arXiv as guidance. I am looking to understand the below for training on my own dataset.
It is not clear how they selected the learning rate schedules and how that would change based on the number of GPUs available for training. How do the training rate schedule change based on number of GPU's available for training? The paper mentions 9 GPUs are used. How should I change the training rate if I only want to use 1 GPU?
The released sample training config file for Pascal VOC using Faster R-CNN has initial learning rate = 0.0001. This is 10x lower than what was published in the original Faster-RCNN paper. Is this due to an assumption on the number of GPU's available for training or due to a different reason?
When I start training from the COCO detection checkpoint, how should the training loss decrease? Looking at tensorboard, on my dataset training loss is low - between 0.8 to 1.2 per iteration (with batch size of 1). Below image shows the various losses from tensorboard. . Is this expected behavior?
For questions 1 and 2: our implementation differs in a few small details compared to the original paper and internally we train all of our detectors with asynchronous SGD with ~10 GPUs. Our learning rates are calibrated for this setting (which you will also have if you decide to train via Cloud ML Engine as in the Pets walkthrough). If you use another setting, you will have to do a bit of hyperparameter exploration. For a single GPU, leaving the learning rate alone probably won't hurt performance, but you may be able to get faster convergence by increasing it.
For question 3: Training losses decrease erratically and you can only see the decrease if you smooth the plots quite a bit over time. Moreover, it's hard to explicitly say how well you are doing with respect to eval metrics just by looking at the training losses. I recommend looking at the mAP plots over time as well as the image visualizations to really get an idea of whether your model has "lifted off".
Hope this helps.
I'm trying to implement a fully convolutional network and train it on the Pascal VOC dataset, however after reading up on the labels in the set, I see that I need to somehow ignore the "void" label. In Caffe their softmax function has an argument to ignore labels, so I'm wondering what the mechanic is, so I can implement something similar in tensorflow.
Thanks
In tensorflow you're feeding the data in feed_dict right? Generally you'd want to just pre-process the data and remove the unwanted samples - don't give them to tensorflow for processing.
My prefered approach is a producer-consumer model where you fire up a tensorflow queue and load it with samples from a loader thread which just skips enqueuing your void samples.
In training your model dequeue samples in the model (you don't use feed_dict in the optimize step). This way you're not bothering to write out a whole new dataset with the specific preprocessing step you're interested in today (tomorrow you're likely to find you want to do some other preprocessing step).
As a side comment, I think tensorflow is a little more do-it-yourself than some other frameworks. But I tend to like that, it abstracts enough to be convenient, but not so much that you don't understand what's happening. When you implement it you understand it, that's the motto that comes to mind with tensorflow.