Why setting training=True in tf.keras.layers.Dropout during testing mode is leading to lower training loss values and higher prediction accuracy? - tensorflow

I'm using dropout layers on my model implemented in tensorflow (tf.keras.layers.Dropout). I set the "training= True" during the training and "training=False" while testing. The performance is poor. I accidentally changed "training=True" during testing too, and the results got much better. I'm wondering what's happening? And why it is affecting the training loss values? Because I'm not making any changes to the training and the whole testing process happens after training. However, changing "training=True" in testing is affecting the training process and causing the training loss to get closer to zero and then the testing results are better. Any possible explanation?
Thanks,

Sorry for the late response, but the answer from Celius is not quite correct.
The training parameter of the Dropout Layer (and for the BatchNormalization layer as well) defines whether this layer should behave in training or inference mode. You can read this in the official documentation.
However, the documentation is a bit unclear on how this affects the execution of your network. Setting training=False does not mean that the Dropout layer is not part of your network. It is by no means ignored as Celius explained, but it just behaves in inference mode. For Dropout, this means that no dropout will be applied. For BN, it means that BN will use the parameters estimated during training instead of computing new parameters for every mini-batch. This is really. The other way around, if you set training=True, the layer will behave in training mode and dropout will be applied.
Now to your question: The behavior of your network does not make sense. If dropout was applied on unseen data, there is nothing to learn from that. You only throw away information, hence your results should be worse. But I think your problem is not related to the Dropout layer anyway. Does your network also make use of BatchNormalization layers? If BN is applied in a poor way, it can mess up your final results. But I haven't seen any code, so it is hard to fully answer your question as is.

Related

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Strategies for pre-training models for use in tfjs

This is a more general version of a question I've already asked: Significant difference between outputs of deep tensorflow keras model in Python and tensorflowjs conversion
As far as I can tell, the layers of a tfjs model when run in the browser (so far only tested in Chrome and Firefox) will have small numerical differences in the output values when compared to the same model run in Python or Node. The cumulative effect of these small differences across all the layers of the model can cause fairly significant differences in the output. See here for an example of this.
This means a model trained in Python or Node will not perform as well in terms of accuracy when run in the browser. And the deeper your model, the worse it will get.
Therefore my question is, what is the best way to train a model to use with tfjs in the browser? Is there a way to ensure the output will be identical? Or do you just have to accept that there will be small numerical differences and, if so, are there any methods that can be used to train a model to be more resilient to this?
This answer is based on my personal observations. As such, it is debatable and not backed by much evidence. Some things that I follow to get accuracy of 16-bit models close to 32 bit models are:
Avoid using activations that have small upper and lower bounds, such as sigmoid or tanh, for hidden layers. These activations cause the weights of the next layer to become very sensitive to small values, and hence, small changes. I prefer using ReLU for such models. Since it is now the standard activation for hidden layers in most models, you should be using it in any case.
Avoid weight decay and L1/L2 regularizations on weights while training (the kernel_regularizer parameter in keras), since these increase sensitivity of weights. Use Dropout instead, I didn't observe a major drop in performance on TFLite when using it instead of numerical regularizers.

Unusual behavior of ADAM optimizer with AMSGrad

I am trying some 1, 2, and 3 layer LSTM networks to classify land cover of some selected pixels from a Landsat time-series spectral data. I tried different optimizers (as implemented in Keras) to see which of them is better, and generally found AMSGrad variant of ADAM doing a relatively better job in my case. However, one strange thing to me is that for the AMSGrad variant, the training and test accuracies start at a relatively high value from the first epoch (instead of increasing gradually) and it changes only slightly after that, as you see in the below graph.
Performance of ADAM optimizer with AMSGrad on
Performance of ADAM optimizer with AMSGrad off
I have not seen this behavior in any other optimizer. Does it show a problem in my experiment? What can be the explanation for this phenomenon?
Pay attention to the number of LSTM layers. They are notorious for easily overfitting the data. Try a smaller model initially(less number of layers), and gradually increase the number of units in a layer. If you notice poor results, then try adding another LSTM layer, but only after the previous step has been done.
As for the optimizers, I have to admit I have never used AMSGrad. However, the plot with regard to the accuracy does seem to be much better in case of the AMSGrad off. You can see that when you use AMSGrad the accuracy on the training set is much better than that on the test set, which a strong sign of overfitting.
Remember to keep things simple, experiment with simple models and generic optimizers.

tf-slim batch norm: different behaviour between training/inference mode

I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.

Tensorflow Autoencoder with 0 hidden units learns something

I am currently running some tests with simple Autoencoders. I wrote an Autoencoder myself entirely in Tensorflow and in addition copied and pasted the code from this keras blog entry: https://blog.keras.io/building-autoencoders-in-keras.html (just to have a different Autoencoder implementation).
When I was testing different architectures, I started with a single layer and a couple of hidden units in this layer. I noticed that when I reduce the number of hidden units to only a single (!) hidden unit, I still get the same training and test losses I get with bigger architectures (up to a couple of thousand hidden units). In my data, the worst loss is 0.5. Any architecture I've tried achieves ~ 0.15.
Just out of curiosity, I reduced the number of hidden units in the only existing hidden layer to zero (which I know doesn't make any sense). However, I still get a training and test loss of 0.15. I assumed that this strange behavior might be due to the bias in the decoding layer (when I reconstruct the input). Initially, I've set the bias variable (in TF) to trainable=True. So now I guess even without any hidden units, the model still learns the bias in the decoding layer which might lead to the reduction of my loss from 0.5 to 0.15.
In the next step, I set the bias in the decoding layer to trainable=False. Now the model (with no hidden units) doesn't learn anything, just as I would have expected it(loss=0.5). With one hidden unit,however, I again get test and training losses of around 0.15.
Following this line of thought, I set the bias in the encoding layer to trainable=False, since I wanted to avoid that my architecture only learns the bias. So now, only the weights of my autoencoder are trainable. This still works for a single hidden unit (and of course just a single hidden layer). Surprisingly, this only works in case of a single-layer network. As soon as I increase the number of layers (independent of the numbers of hidden units), the network again doesn't learn anything (in case only the weights get updated).
All the things I reported are true for the training loss as well as for the test loss (in a completely independent dataset the machine never sees). This makes it even more curious to me.
My question is: How can it happen that I learn as much from a 1 node "network" as from a bigger one (both for training and testing)? Second, how can it be that even larger nets seem to never overfit (training and test error slightly change, but are always comparable). Any suggestions would be very helpful!
Thanks a lot!
Nils