Tensorflow. Conditionally trainable variables and stochastic depth neural networks - tensorflow

I have come across with a problem when started implementing stochastic depth regularization approch using Tensorflow. The paper (https://arxiv.org/pdf/1603.09382.pdf) states that the model can converge faster if we drop randomly some residual units during training. Current Torch implementation works perfectly. In Tensoflow, I can put conditions on residual unit branches, so that during forward step the activations for it will be cancelled, but the weights still will be updated during backward step. There is no way to tell that these weights (in the residual branch which we cancelled) are no longer trainable and they should not be included in optimization for current session run.
I have created the issue on github, where I covered how this problem can be solved in naive way, of course there is something under-hood that will prevent applying an easy fix, otherwise it is really strange why the tf.Variable's trainable parameter does not allow boolean Tensor as a value. If someone has clue for this question, I would really appreciate if you restore my faith in Tensoflow :)

The trainable parameter is used to control whether a graph to train that variable is built or not. Using a conditional stopgradient (a tf.cond with tf.identity in one branch and tf.stopgradient on the other) will deal with stopping the gradient from that variable.
However, if its value was not used during a forward step the gradient computed is guaranteed to be 0, and hence the update will be a no-op.

Related

Can we permanently prune Tensorflow low magnitude weights?

I have a TensorFlow model where I can apply the pruner.prune_low_magnitude layer to the output of my Dense layers. This seems to work according to the instructions, and I get almost the same results down to 95% sparsity. The Processing time in GPU and CPU seems to be the same. It seems the pruning layer is calculating all the weights, then setting the smallest weights to zero to fit the sparsity figure. Is this right?
It would be nice to get a speed-up, of course, but for my present purposes, this is fine.
I would like to prune the weights and have them stay zero thereafter. I would prune some weights; then continue training to allow the model to recover from the pruning; then prune a bit more. I feel this should be a bit closer to what real neurones do. Is there some way of doing this?
My solution (which does not work yet) is to add a custom layer with Trainable=false. This has a mask array that starts off as all ones, and is set to zero if the corresponding weight is zero. The layer multiplies the weights by this mask array, so once a weight goes to zero, it will stay zero. Should this work? Is there a better way?
It would be nice to get a speed-up, of course, but for my present
purposes, this is fine.
To get a reduction in inference time, the op (Dense in your example) implementation needs to be able to take advantage of the sparsity and the particular hardware on which it runs.
The Tensorflow runtime on CPU/GPU does not support this yet, but TFLite does. The
Pruning for on-device inference w/ XNNPACK tutorial demonstrates that.
I would like to prune the weights and have them stay zero thereafter.
I would prune some weights; then continue training to allow the model
to recover from the pruning; then prune a bit more. I feel this should
be a bit closer to what real neurones do. Is there some way of doing
this?
This is the responsibility of the pruning schedule that is passed as argument to prune_low_magnitude().
Existing implementations, such as ConstantSparsity or PolynomialDecay perform what you described: prune only at some steps and let the model recover in between. The begin_step, end_step, and frequency arguments let you control when and how frequently the pruning is applied during training.
My solution (which does not work yet) is to add a custom layer with
Trainable=false. This has a mask array that starts off as all ones,
and is set to zero if the corresponding weight is zero. The layer
multiplies the weights by this mask array, so once a weight goes to
zero, it will stay zero. Should this work? Is there a better way?
This is basically how Tensorflow Model Optimization does it under the hood: pruning_impl.py. You just need to apply prune_low_magnitude() as per the Pruning with Keras tutorial.

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Why is L2 regularization not added back into original loss function?

I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?
Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).
The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.

tf-slim batch norm: different behaviour between training/inference mode

I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.

Which operations support automatic gradients?

I have a fairly complex quantisation layer in my network which contains, among other operations, tf.tile and tf.expand_dims ops. I noticed that my network did not train well. Looking at some debug output, I saw that the fully connected layer before this quantisation layer got zero gradients for its weights (I used optimizer.compute_gradients to determine this). Does this mean that what ever is before the quantisation layer does not update in training?
In general: How do I figure out which operations let gradients pass through and which do not? For instance, do the above mentionied tf.tile and tf.expand_dims let gradients pass through`
If there is an operation without gradients in your model you will get an error:
LookupError: No gradient defined for operation [...]
So your problem seems to be somewhere else, maybe you have a multiplication by zero somewhere which kills the gradients. There is not enough information in your question to find the real reason of your problem.
Edit:
I didn't directly answer the question which operations support automatic gradients.
It is not listed in the documentation and I think you can only see it by checking the source code or using the operation and see if you get the mentioned error when you try to optimize the model.
For tf.tile and tf.expand_dims there are gradients defined.